Anonymise and tag simultaneously
One way; but not quite
Anonymity
Pseudonymising rows in an RDBMS can be done with one-way functions. Hashing is a way to do that. If you want to keep data sane, but don't want to give away names of people, consider the following SQL-code:
select md5( name ) as name, md5( firstname ) as firstname, md5( coalesce(inbetween,'') ) as inbetween, md5( department ) as department from ( values ('klaassen', 'jan', 'van', 'hr') , ('jansen', 'piet', null, 'hr') ) employees( name, firstname, inbetween, department );
name | firstname | inbetween | department |
---|---|---|---|
fc726eb28c1ea71e8098f65c7235a3ff | fa27ef3ef6570e32a79e74deca7c1bc3 | 957d2fa52c19a5aff4ccf5d9a959adab | adab7b701f23bb82014c8506d3dc784e |
bbfe0ebad1e3b608bca2b49d4f86bd09 | 734442e33d8eff4b48b62d603966bdb5 | d41d8cd98f00b204e9800998ecf8427e | adab7b701f23bb82014c8506d3dc784e |
Tagging along
If one needs to tag username and date along the anonymisation, one can add that in a simple matter:
select md5( name || to_char(CURRENT_DATE,'yyyymmdd') || CURRENT_USER ) as name, md5( firstname || to_char(CURRENT_DATE,'yyyymmdd') || CURRENT_USER ) as firstname, md5( coalesce(inbetween,'') || to_char(CURRENT_DATE,'yyyymmdd') || CURRENT_USER ) as inbetween, md5( department || to_char(CURRENT_DATE,'yyyymmdd') || CURRENT_USER ) as department from ( values ('klaassen', 'jan', 'van', 'hr') , ('jansen', 'piet', null, 'hr') ) employees( name, firstname, inbetween, department );
name | firstname | inbetween | department |
---|---|---|---|
004e447c7370e18d937db633eddfff52 | 4c512fa45b79da834b332132cbb44101 | 55c6fa3db27bf988db061b6109e62781 | 280d7b807bce5dca556d7ad05a6dd6d2 |
967bfb6b7e279c41bb32d54a836a72ff | 65b496740bbce094823b3f9977874198 | ad21b4772b49244f4fb0791c5a40be4c | 280d7b807bce5dca556d7ad05a6dd6d2 |
Is it useful to do this?
Nobody suffers from reading md5-sums instead of real family names. If someone wants to correlate family names with height of salary; too bad. This person should request access to the crown jewels and will probably be turned down.
Otherwise, most reports on department sizes, income, revenues and
kpi's can do without written out names, but can't do without access
to employee records. Note that grouping
by department, or even
by name, still works as expected in the above resultset.
Dates go wrong however and small text-fields turn into 32-character fields, but these are small things and can be dealt with.
For more on this subject, but then a bit more theoretical and in depth, refer to John D. Cook on Applied privacy and John D. Cook on Adding Laplace or Guassian Noise
Does it work?
Of course this works fine, but not 100%. I'm quite sure an auditor will be impressed by all these md5-sums in reports, in-between-data and temporary results. It's debatable whether it is acceptable for making harmless data from riskful subjects. There are other hashing and encryption algorithms though, please choose the one which suits you best. See Pgcrypto for more information on how to do this in PostgreSQL.
One note however. It is not enough to just hash the name of an individual if one can still make out which department he or she is working and if one can still find all courses he or she went to. Two or three queries further and any SQL script-kid can identify people. So, please pseudonymise more than just the name.
Get Back! Get Back!
For third parties it is not easy to go from an md5-sum back to the real name of the subject; i.e. the process of going from harmless to riskful is hard. I will leave how hard this is to the math people.
For a trusted DBA to do this, this is not hard at all. Given that he
or she knows all available names, firstnames, inbetweens and
departments, it is not that hard to write a 15-line program which
checks who retrieved leaked data and on which day exactly. The bigger
the cardinality, the bigger the job to decipher names. This is in
line with pretext item (26)
of the GDPR and will help you making
data harmless.
An implementation please
Vital for any scheme to work is that applications need no change.
For an implementation I've given a lot away, really, in the above text. But it is a bit clumsy to prepare a view for all possible authorisation levels. When adding or deleting columns in the original table, one must redo all view e.g.. Also when changing types, views must be redone.
Another solution is to use rules
. These can be used to rewrite
queries and allows the application to use the datamodel it is built
for.
CREATE RULE "harmless" AS ON SELECT TO employees DO INSTEAD SELECT pseudonymise( name ) as name, pseudonymise( firstname ) as firstname, col2, col3 FROM employees;
I will implement and test these rules and some authentication levels in a life-like environment later, check on PostgreSQL RULES for pseudonymisation at litpro.nl for more on this and why the above doesn't work exactly.
If you need to implement different authorization levels, this gets quite complicated. There is row-level-security, allowing access to roles and role-membership automatically. There is column-level-security in the same way.
The snag is that these columns cannot be mentioned in the select-clause without returning a permission error. This makes applications barf.
One would like access to any column in a table for any role, but for some columns some kind of scrambling should be applied first. If these columns are then exempt from being updated/inserted (with traditional column level security), things are fine and it works with any application.
This is a feature which is non-existent in SQL and also in any implementation I know of1.
One of the feature-requests for PostgreSQL under the flag of GDPR now
is:
In a GRANT SELECT
statement allow, in the column-list, a way to
apply some kind of function on columns. The following
could be valid SQL (this one allows a lot to the head of HR, and
likewise, but scrambled, to HR interns):
GRANT SELECT ( name, firstname, inbetween, department ) ON employees to head_of_hr; GRANT SELECT ( department ) MAPPED ( name, firstname, inbetween ) BY pseudonymise( text ) ON employees to hr_interns;
About this title
The first four or five titles were written in some kind of rage after visiting the Big Data Expo, Utrecht 2017. I then knew about GDPR and had implemented various mechanisms to avoid running risks. The commercial heavy lifting on that expo was terrible. People should be informed about GDPR without FUD.