Anonymise and tag simultaneously

One way; but not quite


Pseudonymising rows in an RDBMS can be done with one-way functions. Hashing is a way to do that. If you want to keep data sane, but don't want to give away names of people, consider the following SQL-code:

select md5( name ) as name, 
       md5( firstname ) as firstname, 
       md5( coalesce(inbetween,'') ) as inbetween,
       md5( department ) as department
  from ( values ('klaassen', 'jan', 'van', 'hr') ,
                ('jansen',   'piet', null, 'hr') 
       ) employees( name, firstname, inbetween, department );
name firstname inbetween department
fc726eb28c1ea71e8098f65c7235a3ff fa27ef3ef6570e32a79e74deca7c1bc3 957d2fa52c19a5aff4ccf5d9a959adab adab7b701f23bb82014c8506d3dc784e
bbfe0ebad1e3b608bca2b49d4f86bd09 734442e33d8eff4b48b62d603966bdb5 d41d8cd98f00b204e9800998ecf8427e adab7b701f23bb82014c8506d3dc784e

Tagging along

If one needs to tag username and date along the anonymisation, one can add that in a simple matter:

select md5( name                   || to_char(CURRENT_DATE,'yyyymmdd') || CURRENT_USER ) as name, 
       md5( firstname              || to_char(CURRENT_DATE,'yyyymmdd') || CURRENT_USER ) as firstname, 
       md5( coalesce(inbetween,'') || to_char(CURRENT_DATE,'yyyymmdd') || CURRENT_USER ) as inbetween,
       md5( department             || to_char(CURRENT_DATE,'yyyymmdd') || CURRENT_USER ) as department
  from ( values ('klaassen', 'jan', 'van', 'hr') ,
                ('jansen',   'piet', null, 'hr') 
       ) employees( name, firstname, inbetween, department );
name firstname inbetween department
004e447c7370e18d937db633eddfff52 4c512fa45b79da834b332132cbb44101 55c6fa3db27bf988db061b6109e62781 280d7b807bce5dca556d7ad05a6dd6d2
967bfb6b7e279c41bb32d54a836a72ff 65b496740bbce094823b3f9977874198 ad21b4772b49244f4fb0791c5a40be4c 280d7b807bce5dca556d7ad05a6dd6d2

Is it useful to do this?

Nobody suffers from reading md5-sums instead of real family names. If someone wants to correlate family names with height of salary; too bad. This person should request access to the crown jewels and will probably be turned down.

Otherwise, most reports on department sizes, income, revenues and kpi's can do without written out names, but can't do without access to employee records. Note that grouping by department, or even by name, still works as expected in the above resultset.

Dates go wrong however and small text-fields turn into 32-character fields, but these are small things and can be dealt with.

For more on this subject, but then a bit more theoretical and in depth, refer to John D. Cook on Applied privacy and John D. Cook on Adding Laplace or Guassian Noise

Does it work?

Of course this works fine, but not 100%. I'm quite sure an auditor will be impressed by all these md5-sums in reports, in-between-data and temporary results. It's debatable whether it is acceptable for making harmless data from riskful subjects. There are other hashing and encryption algorithms though, please choose the one which suits you best. See Pgcrypto for more information on how to do this in PostgreSQL.

One note however. It is not enough to just hash the name of an individual if one can still make out which department he or she is working and if one can still find all courses he or she went to. Two or three queries further and any SQL script-kid can identify people. So, please pseudonymise more than just the name.

Get Back! Get Back!

For third parties it is not easy to go from an md5-sum back to the real name of the subject; i.e. the process of going from harmless to riskful is hard. I will leave how hard this is to the math people.

For a trusted DBA to do this, this is not hard at all. Given that he or she knows all available names, firstnames, inbetweens and departments, it is not that hard to write a 15-line program which checks who retrieved leaked data and on which day exactly. The bigger the cardinality, the bigger the job to decipher names. This is in line with pretext item (26) of the GDPR and will help you making data harmless.

An implementation please

Vital for any scheme to work is that applications need no change.

For an implementation I've given a lot away, really, in the above text. But it is a bit clumsy to prepare a view for all possible authorisation levels. When adding or deleting columns in the original table, one must redo all view e.g.. Also when changing types, views must be redone.

Another solution is to use rules. These can be used to rewrite queries and allows the application to use the datamodel it is built for.

CREATE RULE "harmless" AS
ON SELECT TO employees
    SELECT pseudonymise( name ) as name,  
           pseudonymise( firstname ) as firstname,  
           col2, col3
      FROM employees;

I will implement and test these rules and some authentication levels in a life-like environment later, check on PostgreSQL RULES for pseudonymisation at litpro.nl for more on this and why the above doesn't work exactly.

If you need to implement different authorization levels, this gets quite complicated. There is row-level-security, allowing access to roles and role-membership automatically. There is column-level-security in the same way.

The snag is that these columns cannot be mentioned in the select-clause without returning a permission error. This makes applications barf.

One would like access to any column in a table for any role, but for some columns some kind of scrambling should be applied first. If these columns are then exempt from being updated/inserted (with traditional column level security), things are fine and it works with any application.

This is a feature which is non-existent in SQL and also in any implementation I know of1.

One of the feature-requests for PostgreSQL under the flag of GDPR now is: In a GRANT SELECT statement allow, in the column-list, a way to apply some kind of function on columns. The following could be valid SQL (this one allows a lot to the head of HR, and likewise, but scrambled, to HR interns):

GRANT SELECT ( name, firstname, inbetween, department )
          ON employees to head_of_hr;
GRANT SELECT ( department )
      MAPPED ( name, firstname, inbetween ) BY pseudonymise( text )
          ON employees to hr_interns;

About this title

The first four or five titles were written in some kind of rage after visiting the Big Data Expo, Utrecht 2017. I then knew about GDPR and had implemented various mechanisms to avoid running risks. The commercial heavy lifting on that expo was terrible. People should be informed about GDPR without FUD.


To be totally honest; I never check other implementations than PostgreSQL. This has never failed me.