3- Anonymous Dumps
In many situation, what we want is simply to export the anonymized data into another database (for testing or to produce statistics). This is what pg_dump_anon does!
The Story
Paul has a website and a comment section where customers can express their views.
He hired a web agency to develop a new design for his website. The
agency asked for a SQL export (dump) of the current website database.
Paul wants to clean
the database export and remove any personal
information contained in the comment section.
How it works
Learning Objective
- Extract the anonymized data from the database
- Write a custom masking function to handle a JSON field.
Load the data
DROP TABLE IF EXISTS website_comment CASCADE;
CREATE TABLE website_comment (id SERIAL PRIMARY KEY,
message JSONB);
INSERT INTO website_comment
VALUES (1, json_build_object('meta', json_build_object('name', 'Lee Perry', 'ip_addr','40.87.29.113'), 'content', 'Hello Nasty!')),
(2, json_build_object('meta', json_build_object('name', '', 'email', 'biz@bizmarkie.com'), 'content', 'Great Shop')),
(3,json_build_object('meta', json_build_object('name','Jimmy'), 'content','Hi ! This is me, Jimmy James'));
Check the content of the website comments:
SELECT message->'meta'->'name' AS name,
message->'content' AS content
FROM website_comment
ORDER BY id ASC
name | content |
---|---|
Lee Perry | Hello Nasty! |
Great Shop | |
Jimmy | Hi ! This is me, Jimmy James |
Activate the extension
CREATE EXTENSION IF NOT EXISTS anon;
Masking a JSON column
The comment
field is filled with personal information and the fact the
field does not have a standard schema makes our tasks harder.
In general, unstructured data are difficult to mask.
As we can see, web visitors can write any kind of information in the comment section. Our best option is to remove this key entirely because there's no way to extract personal data properly.
We can clean the comment column simply by removing the content
key!
SELECT message - ARRAY['content']
FROM website_comment
WHERE id=1;
?column? |
---|
{\'meta\': {\'name\': \'Lee Perry\', \'ip_addr\': \'40.87.29.113\'}} |
First let's create a dedicated schema and declare it as trusted. This
means the anon
extension will accept the functions located in this
schema as valid masking functions. Only a superuser should be able to
add functions in this schema.
CREATE SCHEMA IF NOT EXISTS my_masks;
SECURITY LABEL
FOR anon ON SCHEMA my_masks IS 'TRUSTED';
Now we can write a function that remove the message content:
CREATE OR REPLACE FUNCTION my_masks.remove_content(j JSONB) RETURNS JSONB AS $func$ SELECT j - ARRAY['content'] $func$ LANGUAGE SQL ;
Let's try it!
SELECT my_masks.remove_content(message)
FROM website_comment
remove_content |
---|
{\'meta\': {\'name\': \'Lee Perry\', \'ip_addr\': \'40.87.29.113\'}} |
{\'meta\': {\'name\': \'\', \'email\': \'biz@bizmarkie.com\'}} |
{\'meta\': {\'name\': \'Jimmy\'}} |
And now we can use it in a masking rule:
SECURITY LABEL
FOR anon ON COLUMN website_comment.message IS 'MASKED WITH FUNCTION my_masks.remove_content(message)';
CREATE ROLE dump_anon LOGIN PASSWORD 'CHANGEME';
ALTER ROLE dump_anon
SET anon.transparent_dynamic_masking TO TRUE;
SECURITY LABEL
FOR anon ON ROLE dump_anon IS 'MASKED';
GRANT pg_read_all_data TO dump_anon;
For convenience, add a new entry in the .pgpass
file.
cat > ~/.pgpass << EOL
*:*:boutique:dump_anon:CHANGEME
EOL
Finally we can export an anonymous dump of the table with pg_dump
:
export PATH=$PATH:$(pg_config --bindir)
export PGHOST=localhost
pg_dump -U dump_anon boutique --table=website_comment > /tmp/dump.sql
Exercices
E301 - Dump the anonymized data into a new database
Create a database named boutique_anon
and transfer the entire database
into it.
E302 - Pseudonymize the meta fields of the comments
Pierre plans to extract general information from the metadata. For instance, he wants to calculate the number of unique visitors based on the different IP addresses. But an IP address is an indirect identifier, so Paul needs to anonymize this field while maintaining the fact that some values appear multiple times.
Replace the remove_content
function with a better one called
clean_comment
that will:
- Remove the content key
- Replace the
name
value with a fake last name - Replace the
ip_address
value with its MD5 signature - Nullify the
email
key
HINT: Look at the
jsonb_set()
andjsonb_build_object()
functions
Solutions
S301
export PATH=$PATH:$(pg_config --bindir)
export PGHOST=localhost
dropdb -U paul --if-exists boutique_anon
createdb -U paul boutique_anon --owner paul
pg_dump -U dump_anon boutique | psql -U paul --quiet boutique_anon
export PGHOST=localhost
psql -U paul boutique_anon -c 'SELECT COUNT(*) FROM company'
S302
CREATE OR REPLACE FUNCTION my_masks.clean_comment(message JSONB) RETURNS JSONB VOLATILE LANGUAGE SQL AS $func$ SELECT jsonb_set( message, ARRAY['meta'], jsonb_build_object( 'name',anon.fake_last_name(), 'ip_address', md5((message->'meta'->'ip_addr')::TEXT), 'email', NULL ) ) - ARRAY['content']; $func$;
SELECT my_masks.clean_comment(message)
FROM website_comment;
clean_comment |
---|
{\'meta\': {\'name\': \'Marshall\', \'email\': None, \'ip_address\': \'1d8cbcdef988d55982af1536922ddcd1\'}} |
{\'meta\': {\'name\': \'Campbell\', \'email\': None, \'ip_address\': None}} |
{\'meta\': {\'name\': \'Dodson\', \'email\': None, \'ip_address\': None}} |
SECURITY LABEL
FOR anon ON COLUMN website_comment.message IS 'MASKED WITH FUNCTION my_masks.clean_comment(message)';