Google Cloud Data Loss Prevention (DLP) for XML data; an example of invoking the REST API

I worked a little bit to decipher the documentation for content:deidentify from Google Cloud. After some trial and error, this is what worked for me.

POST :dlp/v2/projects/:project/content:deidentify
content-type: application/json
x-goog-user-project: :project
Authorization: Bearer :token

{
  "inspectConfig": {
    "infoTypes": [ { "name": "EMAIL_ADDRESS" }, { "name": "PHONE_NUMBER" }, { "name": "URL" } ]
  },
  "deidentifyConfig": {
    "infoTypeTransformations": {
      "transformations": [ {
        "infoTypes": [
          {
            "name": "URL"
          }
        ],
        "primitiveTransformation": {
          "characterMaskConfig": {
            "numberToMask": -8,
            "reverseOrder": true,
          }
        }
      },
      {
        "infoTypes": [
          {
            "name": "PHONE_NUMBER"
          }
        ],
        "primitiveTransformation": {
          "characterMaskConfig": {
            "numberToMask": -1,
            "reverseOrder": false,
            "charactersToIgnore": [
              {
                "charactersToSkip": ".-"
              }
            ]
          }
        }
      },
      {
        "infoTypes": [
          {
            "name": "EMAIL_ADDRESS"
          }
        ],
        "primitiveTransformation": {
          "characterMaskConfig": {
            "numberToMask": -3,
            "reverseOrder": false,
            "charactersToIgnore": [
              {
                "charactersToSkip": ".@"
              }
            ]
          }
        }
      } ]
    }
  },
  "item": {
    "value": "<doc xmlns=\"urn:932F4698-0A64-49D4-963F-E6615BC399E8\">  <Name>Marcia</Name>  <URL>https://marcia.com</URL>  <Email>marcia@example.com</Email><Phone>412-343-0919</Phone></doc>"
  }
}

Notes:

  • As described here, You need to specify the header x-goog-user-project: :project (obviously replacing the word :project with your own project name), otherwise you will get the dreaded 403 error message like this:
    {
       "error": {
         "code": 403,
         "message": "Your application is authenticating by using local Application Default Credentials. The dlp.googleapis.com API requires a quota project, which is not set by default. To learn how to set your quota project, see https://cloud.google.com/docs/authentication/adc-troubleshooting/user-creds .",
         "status": "PERMISSION_DENIED",
         "details": [
           {
             "@type": "type.googleapis.com/google.rpc.ErrorInfo",
             "reason": "SERVICE_DISABLED",
             "domain": "googleapis.com",
             "metadata": {
               "service": "dlp.googleapis.com",
               "consumer": "projects/325555555"
             }
           }
         ]
       }
     }
  • You can specify the de-identify config as a template. example follows:
    POST :dlp/v2/projects/:project/content:deidentify
    content-type: application/json
    x-goog-user-project: :project
    Authorization: Bearer :token
    
    {
      "inspectConfig": {
        "infoTypes": [ { "name": "EMAIL_ADDRESS" }, { "name": "PHONE_NUMBER" }, { "name": "URL" } ]
      },
      "deidentifyTemplateName": "projects/my-project-name-12345/deidentifyTemplates/3816550063387353440",
      "item": {
        "value": "<doc xmlns=\"urn:932F4698-0A64-49D4-963F-E6615BC399E8\">  <Name>Marcia</Name>  <URL>https://marcia.com</URL>  <Email>marcia@example.com</Email><Phone>412-343-0919</Phone></doc>"
      }
    }
    

Jackson and XmlMapper – reading arbitrary data into a java.util.Map

I like the Jackson library from FasterXML. Really handy for reading JSON, writing JSON. Or I should say “serialization” and “deserialization”, ’cause that’s what the cool kids say. And the license is right. (If you need a basic overview of Jackson, I suggest this one from Eugen at Stackify.)

But not everything is JSON. Sometimes ya just wanna read some XML, amiright?

I work on projects where Jackson is included as a dependency. And I am aware that there is a jackson-dataformat-xml module that teaches Jackson how to read and write XML, using the same simple model that it uses for JSON.

Most of the examples I’ve seen show how to read XML into a POJO – in other words “databinding”. If my XML doc has an element named “Fidget” then upon de-serialization, the value there is used to populate the field or property on the Java object called “Fidget” (subject to name remapping of course).

That’s nice and handy, but like I said, sometimes ya just wanna read some XML. And it’s not known what the schema is. And you don’t have a pre-compiled Java class to hold the data. What I really want is to read XML into a java.util.Map<String,Object> . Very similar to what I would do in JavaScript with JSON.parse(). How can I do that?

It’s pretty easy, actually.

This works but there are some problems.

  1. The root element is lost. This is an inadvertent side-effect of using a JSON-oriented library to read XML.
  2. For any element that appears multiple times, only the last value is retained.

What I mean is this:
Suppose the source XML is:

<Root>
  <Parameters>
    <Parameter name='A'>valueA</Parameter>
    <Parameter name='B'>valueB</Parameter>
  </Parameters>
</Root>

Suppose you deserialize that into a map, and then re-serialize it as JSON. The output will be:

{
  "Parameters" : {
    "Parameter" : {
      "name" : "B",
      "" : "valueB"
    }
  }
}

What we really want is to retain the root element and also infer an array when there are repeated child elements in the source XML.

I wrote a custom deserializer, and a decorator for XmlStreamReader to solve these problems. Using them looks like this:

String xmlInput = "<Root><Messages><Message>Hello</Message><Message>World</Message></Messages></Root>";
InputStream is = new ByteArrayInputStream(xmlInput.getBytes(StandardCharsets.UTF_8));
RootSniffingXMLStreamReader sr = new RootSniffingXMLStreamReader(XMLInputFactory.newFactory().createXMLStreamReader(is));
XmlMapper xmlMapper = new XmlMapper();
xmlMapper.registerModule(new SimpleModule().addDeserializer(Object.class, new ArrayInferringUntypedObjectDeserializer()));
Map map = (Map) xmlMapper.readValue(sr, Object.class);
Assert.assertEquals( sr.getLocalNameForRootElement(), "Root");
Object messages = map.get("Messages");
Assert.assertTrue( messages instanceof Map, "map");
Object list = ((Map)messages).get("Message");
Assert.assertTrue( list instanceof List, "list");
Assert.assertEquals( ((List)list).get(0), "Hello");
Assert.assertEquals( ((List)list).get(1), "World");

And the output looks like this:

{
  "Parameters" : {
    "Parameter" : [
      {
        "name" : "A",
        "" : "valueA"
      },{
        "name" : "B",
        "" : "valueB"
      }
    ]
  }
}

…which is what we wanted.

Find the source code here: https://github.com/DinoChiesa/deserialize-xml-arrays-jackson

Hat tip to Jegan for the custom deserializer.

SAML – the standard that wasn’t

OASIS

SAML – the Security Assertion Markup Language is quite successful. SAML was born in 2002 out of OASIS, the somnolent standards body that enjoyed its heyday in the 2000’s forming so many of the XML-oriented standards like WS-BPEL, UDDI, UBL, and ODF. Today SAML enjoys success satisfying a key need in enterprises: browser-based single-sign across origins. Sign into www.mycompany.com and then later visit www.serviceprovider.com and get automatically authorized. The benefit is: people type in their passwords, just once.

SAML wasn’t designed for just that problem, or anyway, not for that specific problem. SAML was designed to address the general problem of exchanging claims securely. The summary on the first page of the spec says that SAML “defines the syntax and semantics for XML-encoded assertions about authentication, attributes, and authorization, and for the protocols that convey this information.” Hence the name, “Security Assertion Markup Language”. But in actual use, SAML is heavily oriented towards browser-based SSO.

In SAML, the claims or assertions are statements about people. When people (via apps or browser pages) make requests of systems – like “let me see this file”, or “let me transfer funds” – the system that receives this request can use trusted claims about the requester to make authorization decisions. The key is that the system needs to trust the claims, and the claims need to be relevant.

An example: If I go to the grocery store, I can present my debit card to the checkout person, to pay for my groceries. The card is basically a set of claims “asserted” by a bank about me:

Debit card
  1. that the person named on the card is a customer of the bank,
  2. that the person named is authorized to use a particular account.

This set of assertions is also decorated with some other information, like valid dates, and the author of the claims. The author of the claims is a bank, and that bank is affiliated with a card payment network, in my case, Visa. Also: My debit card expires in a given month and year. The implicit rule is that all parties agree that the claims presented in the plastic card do not hold after that date. This card is good if the merchant trusts the bank, and Visa, and if the dates are valid. Some merchants want to insure that the person named on the card is the same as the person presenting the card, so they’ll ask for a government-issued picture ID that has the same name.

The SAML Analogue

SAML Assertions

SAML works in a similar way, except the set of claims is formatted digitally, in an XML document, rather than on a plastic card. The set of claims enclosed in a SAML token is general – it can be any set of claims about a person, or “subject”. Claims such as “Dino is male”, or “Dino has no tattoos”, or “Dino is of sound mind and body” are all acceptable. But more often the claims are statements that are relevant to information-processing organizations, such as “Dino is an employee of XYZ corp.” and then some detailed information such as “Dino’s email address is Dino@xyzcorp.com”, “Dino is a member of the Aviation group in XYZ”, and so on. In the general case these are claims about a person’s identity; SAML calls them “Attributes” of the subject. Statements not about the particular person don’t belong in SAML. “It is sunny today” may or may not be true but it is completely unrelated to me – the person in question aka “subject” – therefore not suitable as an attribute in a SAML assertion about me.

Such claims about me could be used by an organization or company to decide whether to grant service to me, when I request it. If my company, XYZ corp, has a partnership agreement with another company, LMN Corp, then when I present my claims to LMN, along with my request, LMN can take a decision on whether to grant my request.

How Trustworthy are your claims?

The claims in a SAML Assertion are just statements, coded in an XML document. Though SAML is a particularly florid and ornate language, it’s still XML, and anyone could create such a document. For a system to be able to rely on that information, to trust that information, there must be some assurance that the presented claims are bona-fide and originate from a trusted source, and also that they are valid at a given moment in time. At one point, “Dino is in the Eighth grade” was a true statement about me, but that statement is no longer true. SAML uses digital signatures based on public-key cryptography for the purpose of assurances of the author of the claims, and explicit time windows on claims (eg, NotBefore or NotOnOrAfter) to circumscribe the validity of such claims.

Key

The “Relying Party” or RP examining a SAML Assertion SAML must verify the signature on the XML document, to insure that the claims can be trusted. The relying party must also evaluate the time windows on the claims. And then finally, the RP must evaluate the claims themselves. It may be that “Dino is a member of the recreation committee” does not grant me permission to see the early draft of the company’s 10-K filing. On the other hand if I am a senior director at the auditing firm, maybe “Dino is an employee at XYZ Auditors” and “Dino is a senior director” is a good enough claim to allow me to see or edit the document.

Simple in Concept, Complex in Execution

SAML is simple enough in principle. I’ve explained the broad strokes here, in just a few paragraphs. Of course, it builds on a large stack of technologies, starting with XML, XML Schema, XML namespaces, URIs, XML digital signatures, and X.509. That alone is a daunting set of technologies, though there is some relief in the maturity of the relevant specifications.

But the details about SAML itself have lead to additional complexity. First, the SAML 2.0 spec is 86 pages. Even there, it is not self-contained. One example: SAML has an element called an AuthnContextClassRef I’m guessing this implies “Authentication Context Class Reference”. For those of you scoring at home, that’s four nouns in a row. What exactly is this thing?

Helpfully, the OASIS spec defines this thing as

A URI reference identifying an authentication context class that describes the authentication context declaration that follows.

All clear? We now interrupt this essay to present a completely unrelated Dilbert comic.

Dilbert

In addition, the SAML spec document suggests, “See the Authentication Context specification [SAMLAuthnCxt] for a full description of authentication context information.” That document is itself an additional 70 pages. Ready to dive in?

This kind of complexity and standards-speak lead, even early on in the life of SAML, to complaints of impracticality from the people who had hoped to be able to use it. Even as early as 2003, just a few months after SAML 1.0 was launched, IBM, one of the original authors of the spec, was employing its partners to bravely assert that the idea that SAML was complex was a myth.

The Wizard

I can hear the wizard inveighing: Pay no attention to that AuthnContextClassRef!!

But the complaints about complexity were not academic. They were based on real-world attempts to get disparate implementations of “the standard” to interoperate. Even today, connecting an Identity Provider and a Relying Party via SAML is a challenge worthy of a platoon of IBM consultants. Have we got a mismatch in the AuthnContextClassRef? Well, we’re gonna have to figure out how to persuade the Relying Party to allow it, or to persuade the IdP to provide a different one. Have you got the wrong NameID Format? Transient, Permanent or Unspecified? Which side needs to give ground in this negotiation?

That’s what I mean when I call SAML “the standard that wasn’t.” It’s a standard, all right, but there are so many different options that despite the rigor of the specification, getting compliant systems to interoperate is still a huge challenge. Despite the challenges, the standard IS valuable – it works mostly, and it solves specific problems that many companies have. But it isn’t automatic.

Lessons for History

SAML is designed to address much more than browser-based single-sign on. But the lion’s share of adopters use SAML for just that, and only that.

There’s a lesson here regarding over-reach of standards: SAML could have been simpler, quicker to get adopted, and easier to use, had its designers restricted their design goals to addressing what 90% of people use it for today, anyway.

Why Bother?

But why am I even talking about SAML? My passion and intention is to work on APIs and enabling new interconnections. That’s why I’m at Apigee today. APIs means “SOAP on Steroids” or if you like, “all the benefits of SOAP without that unsightly residue”. It means getting better connections, faster, and allowing new customer experiences, better mobile apps, better connections between customers and companies. So if I am all about connecting systems with APIs, why do I care about SAML? Have I been sucked into a time-portal and time-warped back into 2007?

Ah, but no! See, the thing is with large companies, they move deliberately. Many still use SAML and still need any systems they install to integrate with their SAML-based Enterprise Identity system. So if I want to work with enterprises in helping them adopt APIs to supercharge their businesses, I need to get SAML working with the various web apps that enable API management and adoption. Get SAML integration done, then the enterprise can innovate with APIs. See?

Pretty printing XML from within emacs

I use emacs. Can’t help it. Been using it for years, and the cost of switching to something “more modern” has never reached the payoff threshold.

Today I want to show you how I pretty-print XML from within emacs.

The elisp for the pretty-printing logic was originally from
a stackoverflow answer. I modified it slightly and post it here:

        
        
      

Thanks to isagalaev for highlight.js.