1 - Static Masking

Static Masking is the simplest way to hide personal information! This idea is simply to destroy the original data or replace it with an artificial one.

The story

Over the years, Paul has collected data about his customers and their purchases in a simple database. He recently installed a brand new sales application and the old database is now obsolete. He wants to save it and he would like to remove all personal information before archiving it.

How it works

Learning Objective

In this section, we will learn:

  • How to write simple masking rules
  • The advantage and limitations of static masking
  • The concept of "Singling Out" a person

The "customer" table

DROP TABLE IF EXISTS customer CASCADE;
DROP TABLE IF EXISTS payout CASCADE;
CREATE TABLE customer ( id SERIAL PRIMARY KEY, firstname TEXT, lastname TEXT, phone TEXT, birth DATE, postcode TEXT );

Insert a few persons:

INSERT INTO customer
VALUES (107,'Sarah','Conor','060-911-0911', '1965-10-10', '90016'),
       (258,'Luke', 'Skywalker', NULL, '1951-09-25', '90120'),
       (341,'Don', 'Draper','347-515-3423', '1926-06-01', '04520') ;
SELECT *
FROM customer;
id firstname lastname phone birth postcode
107 Sarah Conor 060-911-0911 1965-10-10 90016
258 Luke Skywalker None 1951-09-25 90120
341 Don Draper 347-515-3423 1926-06-01 04520

The "payout" table

Sales are tracked in a simple table:

CREATE TABLE payout ( id SERIAL PRIMARY KEY, fk_customer_id INT REFERENCES customer(id), order_date DATE, payment_date DATE, amount INT );

Let's add some orders:

INSERT INTO payout
VALUES (1,107,'2021-10-01','2021-10-01', '7'),
       (2,258,'2021-10-02','2021-10-03', '20'),
       (3,341,'2021-10-02','2021-10-02', '543'),
       (4,258,'2021-10-05','2021-10-05', '12'),
       (5,258,'2021-10-06','2021-10-06', '92') ;

Activate the extension

CREATE EXTENSION IF NOT EXISTS anon;

Declare the masking rules

Paul wants to hide the last name and the phone numbers of his clients. He will use the dummy_last_name() and partial() functions for that:

SECURITY LABEL FOR anon ON COLUMN customer.lastname IS 'MASKED WITH FUNCTION anon.dummy_last_name()';
SECURITY LABEL FOR anon ON COLUMN customer.phone IS 'MASKED WITH FUNCTION anon.partial(phone,2,$$X-XXX-XX$$,2)';

Apply the rules permanently

SELECT anon.anonymize_table('customer');
anonymize_table
True
SELECT id,
       firstname,
       lastname,
       phone
FROM customer;
id firstname lastname phone
107 Sarah Mayert 06X-XXX-XX11
258 Luke Anderson None
341 Don Mraz 34X-XXX-XX23

This is called Static Masking because the real data has been permanently replaced. We'll see later how we can use dynamic anonymization or anonymous exports.

Exercices

E101 - Mask the client's first names

Declare a new masking rule and run the static anonymization function again.

E102 - Hide the last 3 digits of the postcode

Paul realizes that the postcode gives a clear indication of where his customers live. However he would like to have statistics based on their postcode area.

Add a new masking rule to replace the last 3 digits by 'x'.

E103 - Count how many clients live in each postcode area?

Aggregate the customers based on their anonymized postcode.

E104 - Keep only the year of each birth date

Paul wants age-based statistic. But he also wants to hide the real birth date of the customers.

Replace all the birth dates by January 1rst, while keeping the real year.

Hint

You can use the [make_date] or [date_trunc] functions !

E105 - Singling out a customer

Even if the "customer" is properly anonymized, we can still isolate a given individual based on data stored outside of the table. For instance, we can identify the best client of Paul's boutique with a query like this:

WITH best_client AS
  (SELECT SUM(amount),
          fk_customer_id
   FROM payout
   GROUP BY fk_customer_id
   ORDER BY 1 DESC
   LIMIT 1)
SELECT c.*
FROM customer c
JOIN best_client b ON (c.id = b.fk_customer_id)
id firstname lastname phone birth postcode
341 Don Mraz 34X-XXX-XX23 1926-06-01 04520

Note

This is called [Singling Out] a person.

We need to anonymize even further by removing the link between a person and its company. In the order table, this link is materialized by a foreign key on the field fk_company_id. However we can't remove values from this column or insert fake identifiers because if would break the foreign key constraint.


How can we separate the customers from their payouts while respecting the integrity of the data?

Find a function that will shuffle the column fk_company_id of the payout table

Tip

Check out the [static masking] section of the [documentation].

Solutions

S101

SECURITY LABEL
FOR anon ON COLUMN customer.firstname IS 'MASKED WITH FUNCTION anon.dummy_first_name()';


SELECT anon.anonymize_table('customer');


SELECT id,
       firstname,
       lastname
FROM customer;

S102

SECURITY LABEL
FOR anon ON COLUMN customer.postcode IS 'MASKED WITH FUNCTION anon.partial(postcode,2,$$xxx$$,0)';


SELECT anon.anonymize_table('customer');


SELECT id,
       firstname,
       lastname,
       postcode
FROM customer;

S103

SELECT postcode,
       COUNT(id)
FROM customer
GROUP BY postcode;
postcode count
90xxx 2
04xxx 1

S104

SECURITY LABEL FOR anon ON FUNCTION pg_catalog.date_trunc(text,interval) IS 'TRUSTED';
SECURITY LABEL FOR anon ON COLUMN customer.birth IS $$ MASKED WITH FUNCTION pg_catalog.date_trunc('year',birth) $$;
SELECT anon.anonymize_table('customer');
SELECT id, firstname, lastname, birth FROM customer;

S105

Let's mix up the values of the fk_customer_id:

SELECT anon.shuffle_column('payout', 'fk_customer_id', 'id');
shuffle_column
True

Now let's try to single out the best client again :

WITH best_client AS
  (SELECT SUM(amount),
          fk_customer_id
   FROM payout
   GROUP BY fk_customer_id
   ORDER BY 1 DESC
   LIMIT 1)
SELECT c.*
FROM customer c
JOIN best_client b ON (c.id = b.fk_customer_id);
id firstname lastname phone birth postcode
341 Sadye Bins 34X-XXX-XX23 1926-01-01 04xxx

WARNING

Note that the link between a customer and its payout is now completely false. For instance, if a customer A had 2 payouts. One of these payout may be linked to a customer B, while the second one is linked to a customer C.

In other words, this shuffling method with respect the foreign key constraint (aka the referential integrity) but it will break the data integrity. For some use case, this may be a problem.

In this case, Pierre will not be able to produce a BI report with the shuffle data, because the links between the customers and their payments are fake.