Various Masking Strategies
The extension provides functions to implement 8 main anonymization strategies:
- Destruction
- Adding Noise
- Randomization
- Faking
- Advanced Faking
- Pseudonymization
- Generic Hashing
- Partial scrambling
- Conditional masking
- Generalization
- Using pg_catalog functions
- Image bluring
- Write your own Masks !
Depending on your data, you may need to use different strategies on different columns :
- For names and other 'direct identifiers' , Faking is often useful
- Shuffling is convenient for foreign keys
- Adding Noise is interesting for numeric values and dates
- Partial Scrambling is perfect for email address and phone numbers
- etc.
Destruction
First of all, the fastest and safest way to anonymize a data is to destroy it :-)
In many cases, the best approach to hide the content of a column is to replace all the values with a single static value.
For instance, you can replace a entire column by the word 'CONFIDENTIAL' like this:
SECURITY LABEL FOR anon
ON COLUMN users.address
IS 'MASKED WITH VALUE ''CONFIDENTIAL'' ';
Adding Noise
This is also called Variance. The idea is to "shift" dates and numeric values. For example, by applying a +/- 10% variance to a salary column, the dataset will remain meaningful.
-
anon.noise(original_value,ratio)
where original_value can be aninteger
, abigint
or adouble precision
. If the ratio is 0.33, the return value will be the original value randomly shifted with a ratio of +/- 33% -
anon.dnoise(original_value, interval)
where original_value can be a date, a timestamp, or a time. If interval = '2 days', the return value will be the original value randomly shifted by +/- 2 days
WARNING : The noise()
masking functions are vulnerable to a form of
repeat attack, especially with Dynamic Masking. A masked user can guess
an original value by requesting its masked value multiple times and then simply
use the AVG()
function to get a close approximation. (See
demo/noise_reduction_attack.sql
for more details). In a nutshell, these
functions are best fitted for Anonymous Dumps and Static Masking.
They should be avoided when using Dynamic Masking.
Randomization
The extension provides a large choice of functions to generate purely random data :
Basic Random values
anon.random_date()
returns a dateanon.random_string(n)
returns a TEXT value containingn
lettersanon.random_zip()
returns a 5-digit codeanon.random_phone(p)
returns a 8-digit phone withp
as a prefixanon.random_hash(seed)
returns a hash of a random string for a given seed
Random between
To pick any value inside between two bounds:
anon.random_date_between(d1,d2)
returns a date betweend1
andd2
anon.random_int_between(i1,i2)
returns an integer betweeni1
andi2
anon.random_bigint_between(b1,b2)
returns a bigint betweenb1
andb2
NOTE: With these functions, the lower and upper bounds are included.
For instance anon.random_int_between(1,3)
returns either 1, 2 or 3.
For more advanced interval descriptions, check out the [Random in Range] section.
Random in Array
The random_in
function returns an element a given array
For example:
anon.random_in(ARRAY[1,2,3])
returns an int between 1 and 3anon.random_in(ARRAY['red','green','blue'])
returns a text
Random in Enum
This is one especially useful when working with ENUM
types!
anon.random_in_enum(variable_of_an_enum_type)
returns any val
CREATE TYPE card AS ENUM ('visa', 'mastercard', ‘amex’);
SELECT anon.random_in_enum(NULL::CARD);
random_in_enum
----------------
mastercard
CREATE TABLE customer (
id INT,
...
credit_card CARD
);
SECURITY LABEL FOR anon ON COLUMN customer.creditcard
IS 'MASKED WITH FUNCTION anon.random_in_enum(creditcard)'
Random in Range
RANGE types are a powerful way to describe an interval of values, where can define inclusive or excluvive bounds:
https://www.postgresql.org/docs/current/rangetypes.html#RANGETYPES-EXAMPLES
There a function for each subtype of range:
anon.random_in_int4range('[5,6)')
returns an INT of value 5anon.random_in_int8range('(6,7]')
returns a BIGINT of value 7anon.random_in_numrange('[0.1,0.9]')
returns a NUMERIC between 0.1 and 0.9anon.random_in_daterange('[2001-01-01, 2001-12-31)')
returns a date in 2001anon.random_in_tsrange('[2022-10-01,2022-10-31]')
returns a TIMESTAMP in october 2022anon.random_in_tstzrange('[2022-10-01,2022-10-31]')
returns a TIMESTAMP WITH TIMEZONE in october 2022
NOTE: It is not possible to get a random value from a RANGE with an
infinite bound. For example anon.random_in_int4range('[2022,)')
returns NULL.
Random Sequence ID
When masking a SERIAL columns it can be useful to general a UNIQUE
value
based on a sequence.
anon.random_id()
returns a BIGINTanon.random_id_int()
returns a INTanon.random_id_small_int()
returns a SMALLINT
Each call to these functions will return a incremented value much like the [nextval()] function.
At any time, you can reset the current sequence value with a new value. For instance:
SELECT pg_catalog.setval('anon.random_id_seq', 42);
Faking
The idea of Faking is to replace sensitive data with random-but-plausible values. The goal is to avoid any identification from the data record while remaining suitable for testing, data analysis and data processing.
In order to use the faking functions, you have to init()
the extension
in your database first:
SELECT anon.init();
The init()
function will import a default dataset of random data (iban,
names, cities, etc.).
This dataset is in English and very small ( 1000 values for each category ). If you want to use localized data or load a specific dataset, please read the Custom Fake Data section.
Once the fake data is loaded, you have access to these faking functions:
anon.fake_address()
returns a complete post addressanon.fake_city()
returns an existing cityanon.fake_country()
returns a countryanon.fake_company()
returns a generic company nameanon.fake_email()
returns a valid email addressanon.fake_first_name()
returns a generic first nameanon.fake_iban()
returns a valid IBANanon.fake_last_name()
returns a generic last nameanon.fake_postcode()
returns a valid zipcodeanon.fake_siret()
returns a valid SIRET
For TEXT and VARCHAR columns, you can use the classic Lorem Ipsum generator:
anon.lorem_ipsum()
returns 5 paragraphsanon.lorem_ipsum(2)
returns 2 paragraphsanon.lorem_ipsum( paragraphs := 4 )
returns 4 paragraphsanon.lorem_ipsum( words := 20 )
returns 20 wordsanon.lorem_ipsum( characters := 7 )
returns 7 charactersanon.lorem_ipsum( characters := anon.length(table.column) )
returns the same amount of characters as the original string
Advanced Faking
Generating fake data is a complex topic. The fake_
functions provided above
are limited to basic use case. For more advanced faking methods, in particular
if you are looking for localized fake data, PostgreSQL Anonymizer
provides an advanced faking engine with localisation support.
This engine (fake-rs) is available via more than 70 functions with the
dummy_
prefix:
tips:
The fake_* and dummy_* functions achieve the same goal.
The fake_* functions are the first implementation in pl/pgsql. They were introduced in Version 1. It's a rather naïve and limited approach.
The dummy_* functions are a new implementation based on a Rust library. It provides a more advanced fake generator and adds localization. It was introduced in Version 2.
New users should always prefer the dummy_* functions. The fake_* functions are kept for backward compatibility.
- anon.dummy_bic()
- anon.dummy_bs()
- anon.dummy_bs_adj()
- anon.dummy_bs_noun()
- anon.dummy_bs_verb()
- anon.dummy_building_number()
- anon.dummy_buzzword()
- anon.dummy_buzzword_middle()
- anon.dummy_buzzword_tail()
- anon.dummy_catchphrase()
- anon.dummy_cell_number()
- anon.dummy_city_name()
- anon.dummy_city_prefix()
- anon.dummy_city_suffix()
- anon.dummy_color()
- anon.dummy_company_name()
- anon.dummy_company_suffix()
- anon.dummy_country_code()
- anon.dummy_country_name()
- anon.dummy_credit_card_number()
- anon.dummy_currency_code()
- anon.dummy_currency_name()
- anon.dummy_currency_symbol()
- anon.dummy_dir_path()
- anon.dummy_domain_suffix()
- anon.dummy_file_extension()
- anon.dummy_file_name()
- anon.dummy_file_path()
- anon.dummy_first_name()
- anon.dummy_free_email()
- anon.dummy_free_email_provider()
- anon.dummy_health_insurance_code()
- anon.dummy_hex_color()
- anon.dummy_hsl_color()
- anon.dummy_hsla_color()
- anon.dummy_industry()
- anon.dummy_ip()
- anon.dummy_ipv4()
- anon.dummy_ipv6()
- anon.dummy_isbn()
- anon.dummy_isbn13()
- anon.dummy_isin()
- anon.dummy_last_name()
- anon.dummy_latitude()
- anon.dummy_licence_plate()
- anon.dummy_longitude()
- anon.dummy_mac_address()
- anon.dummy_name()
- anon.dummy_name_with_title()
- anon.dummy_phone_number()
- anon.dummy_post_code()
- anon.dummy_profession()
- anon.dummy_rfc_status_code()
- anon.dummy_rgb_color()
- anon.dummy_rgba_color()
- anon.dummy_safe_email()
- anon.dummy_secondary_address()
- anon.dummy_secondary_address_type()
- anon.dummy_state_abbr()
- anon.dummy_state_name()
- anon.dummy_street_name()
- anon.dummy_street_suffix()
- anon.dummy_suffix()
- anon.dummy_timezone()
- anon.dummy_title()
- anon.dummy_user_agent()
- anon.dummy_username()
- anon.dummy_uuidv1()
- anon.dummy_uuidv3()
- anon.dummy_uuidv4()
- anon.dummy_uuidv5()
- anon.dummy_valid_statux_code()
- anon.dummy_word()
- anon.dummy_words(int4range)
- anon.dummy_zip_code()
For each of this function, you can add the _locale(...)
suffix and specify
in which local context you want.
For example:
SELECT anon.dummy_last_name();
dummy_last_name
------------------------
Tillman
SELECT anon.dummy_last_name_locale('fr_FR');
dummy_last_name_locale
------------------------
Granier
SELECT anon.dummy_last_name_locale('pt_BR');
dummy_last_name_locale
------------------------
Barreto
Currently 7 locales are available: ar_SA, en_US(default), fr_FR, ja_JP, pt_BR, zh_CN, zh_TW.
Not that some dummy_
functions are not implemented for certain locales.
If you wish to contribute or ask for missing fake data, please contact directly
the fake-rs project, which is the library that this extension is using under the
hood !
Pseudonymization
Pseudonymization is similar to Faking in the sense that it generates realistic values. The main difference is that the pseudonymization is deterministic : the functions always will return the same fake value based on a seed and an optional salt.
In order to use the faking functions, you have to init()
the extension
in your database first:
SELECT anon.init();
Once the fake data is loaded you have access to 10 pseudo functions:
anon.pseudo_first_name(seed,salt)
returns a generic first nameanon.pseudo_last_name(seed,salt)
returns a generic last nameanon.pseudo_email(seed,salt)
returns a valid email addressanon.pseudo_city(seed,salt)
returns an existing cityanon.pseudo_country(seed,salt)
returns a countryanon.pseudo_company(seed,salt)
returns a generic company nameanon.pseudo_iban(seed,salt)
returns a valid IBANanon.pseudo_siret(seed,salt)
returns a valid SIRET
The second argument (salt
) is optional. You can call each function with
only the seed like this anon.pseudo_city('bob')
. The salt is here to increase
complexity and avoid dictionary and brute force attacks (see warning below).
If a specific salt is not given, the value of the anon.salt
GUC parameter is
used instead (see the Generic Hashing section for more details).
The seed can be any information related to the subject. For instance, we can consistently generate the same fake email address for a given person by using her login as the seed :
SECURITY LABEL FOR anon
ON COLUMN users.emailaddress
IS 'MASKED WITH FUNCTION anon.pseudo_email(users.login) ';
NOTE : You may want to produce unique values using a pseudonymization
function. For instance, if you want to mask an email
column that is declared
as UNIQUE
. In this case, you will need to initialize the extension with a fake
dataset that is way bigger than the numbers of rows of the table. Otherwise you
may see some "collisions" happening, i.e. two different original values producing
the same pseudo value.
WARNING : Pseudonymization is often confused with anonymization but in fact
they serve 2 different purposes : pseudonymization
is a way to protect the
personal information but the pseudonymized data is still "linked" to the real data.
The GDPR makes it very clear that personal data which has undergone
pseudonymization is still related to a person. (see GDPR Recital 26)
Generic hashing
In theory, hashing is not a valid anonymization technique, however in practice it is sometimes necessary to generate a determinist hash of the original data.
For instance, when a pair of primary key / foreign key is a "natural key", it may contain actual information ( like a customer number containing a birth date or something similar).
Hashing such columns allows to keep referential integrity intact even for relatively unusual source data. Therefore, the
-
anon.digest(value,salt,algorithm)
lets you choose a salt, and a hash algorithm from a pre-defined list -
anon.hash(value)
will return a text hash of the value using a secret salt (defined by theanon.salt
parameter) and hash algorithm (defined by theanon.algorithm
parameter). The default value ofanon.algorithm
issha256
and possible values are: md5, sha224, sha256, sha384 or sha512. The default value ofanon.salt
is an empty string. You can modify these values with:
sql
ALTER DATABASE foo SET anon.salt TO 'xsfnjefnjsnfjsnf';
ALTER DATABASE foo SET anon.algorithm TO 'sha384';
Keep in mind that hashing is a form a Pseudonymization. This means that the
data can be "de-anonymized" using the hashed value and the masking function. If an
attacker gets access to these 2 elements, he or she could re-identify
some persons using brute force
or dictionary
attacks. Therefore, the
salt and the algorithm used to hash the data must be protected with the
same level of security that the original dataset.
In a nutshell, we recommend that you use the anon.hash()
function rather than
anon.digest()
because the salt will not appear clearly in the masking rule.
Furthermore: in practice the hash function will return a long string of character like this:
SELECT anon.hash('bob');
hash
----------------------------------------------------------------------------------------------------------------------------------
95b6accef02c5a725a8c9abf19ab5575f99ca3d9997984181e4b3f81d96cbca4d0977d694ac490350e01d0d213639909987ef52de8e44d6258d536c55e427397
For some columns, this may be too long and you may have to cut some parts the hash in order to fit into the column. For instance, if you have a foreign key based on a phone number and the column is a VARCHAR(12) you can transform the data like this:
SECURITY LABEL FOR anon ON COLUMN people.phone_number
IS 'MASKED WITH FUNCTION anon.left(anon.hash(phone_number),12)';
SECURITY LABEL FOR anon ON COLUMN call_history.fk_phone_number
IS 'MASKED WITH FUNCTION anon.left(anon.hash(fk_phone_number),12)';
Of course, cutting the hash value to 12 characters will increase the risk of "collision" (2 different values having the same fake hash). In such case, it's up to you to evaluate this risk.
Partial Scrambling
Partial scrambling leaves out some part of the data. For instance : a credit card number can be replaced by '40XX XXXX XXXX XX96'.
2 functions are available:
anon.partial('abcdefgh',1,'xxxx',3)
will return 'axxxxfgh';anon.partial_email('daamien@gmail.com')
will become 'da**@gm****.com'
Conditional Masking
In some situations, you may want to apply a masking filter only for some value or for a limited number of lines in the table.
For instance, if you want to "preserve NULL values", i.e. masking only the lines
that contains a value, you can use the anon.ternary
function, which works
like a CASE WHEN x THEN y ELSE z
statement :
SECURITY LABEL FOR anon ON COLUMN player.score
IS 'MASKED WITH FUNCTION anon.ternary(score IS NULL,
NULL,
anon.random_int_between(0,100));
You may also want to exclude some lines within the table. Like keeping the password of some users so that they still may be able to connect to a testing deployment of your application:
SECURITY LABEL FOR anon ON COLUMN account.password
IS 'MASKED WITH FUNCTION anon.ternary( id > 1000, NULL::TEXT, password)';
WARNING : Conditional masking may create a partially deterministic "connection" between the original data and the masked data. And that connection can be used to retrieve personal information from the masked data. For instance, if NULL values are preserved for a "deceased_date" column, it will reveal which persons are still actually alive... In a nutshell: conditional masking may often produce a dataset that is not fully anonymized and therefore would still technically contain personal information.
Generalization
Generalization is the principle of replacing the original value by a range containing this value. For instance, instead of saying 'Paul is 42 years old', you would say 'Paul is between 40 and 50 years old'.
The generalization functions are a data type transformation. Therefore it is not possible to use them with the dynamic masking engine. However they are useful to create anonymized views. See example below.
Let's imagine a table containing health information:
SELECT * FROM patient;
id | name | zipcode | birth | disease
----+----------+----------+------------+---------------
1 | Alice | 47678 | 1979-12-29 | Heart Disease
2 | Bob | 47678 | 1959-03-22 | Heart Disease
3 | Caroline | 47678 | 1988-07-22 | Heart Disease
4 | David | 47905 | 1997-03-04 | Flu
5 | Eleanor | 47909 | 1999-12-15 | Heart Disease
6 | Frank | 47906 | 1968-07-04 | Cancer
7 | Geri | 47605 | 1977-10-30 | Heart Disease
8 | Harry | 47673 | 1978-06-13 | Cancer
9 | Ingrid | 47607 | 1991-12-12 | Cancer
We can build a view upon this table to suppress some columns ( SSN
and name
) and generalize the zipcode and the birth date like
this:
CREATE VIEW anonymized_patient AS
SELECT
'REDACTED' AS lastname,
anon.generalize_int4range(zipcode,100) AS zipcode,
anon.generalize_tsrange(birth,'decade') AS birth
disease
FROM patients;
The anonymized table now looks like that:
SELECT * FROM anonymized_patient;
lastname | zipcode | birth | disease
----------+---------------+-----------------------------+---------------
REDACTED | [47600,47700) | ["1970-01-01","1980-01-01") | Heart Disease
REDACTED | [47600,47700) | ["1950-01-01","1960-01-01") | Heart Disease
REDACTED | [47600,47700) | ["1980-01-01","1990-01-01") | Heart Disease
REDACTED | [47900,48000) | ["1990-01-01","2000-01-01") | Flu
REDACTED | [47900,48000) | ["1990-01-01","2000-01-01") | Heart Disease
REDACTED | [47900,48000) | ["1960-01-01","1970-01-01") | Cancer
REDACTED | [47600,47700) | ["1970-01-01","1980-01-01") | Heart Disease
REDACTED | [47600,47700) | ["1970-01-01","1980-01-01") | Cancer
REDACTED | [47600,47700) | ["1990-01-01","2000-01-01") | Cancer
The generalized values are still useful for statistics because they remain true, but they are less accurate, and therefore reduce the risk of re-identification.
PostgreSQL offers several RANGE data types which are perfect for dates and numeric values.
For numeric values, 3 functions are available:
generalize_int4range(value, step)
generalize_int8range(value, step)
generalize_numrange(value, step)
...where value
is the data that will be generalized, and step
is the size of
each range.
Using pg_catalog
functions
Since version 1.3, the pg_catalog
schema is not trusted by default. This is
a security measure designed to prevent users from using sophisticated functions
inside masking rules (such as pg_catalog.query_to_xml
, pg_catalog.ts_stat
or the system administration functions ) that should not be used as masking
functions.
However, the extension allows using some useful and safe functions from the
pg_catalog
schema for your convenience. These are small subset of functions
that are declared as TRUSTED
for anonymization.
The list of TRUSTED pg_catalog functions is available via the
anon.pg_trusted_functions
views :
SELECT * FROM anon.pg_trusted_functions;
If you need to use a pg_catalog function which is not in this list, you can ask a superuser to trust it with:
SECURITY LABEL FOR anon ON FUNCTION pg_catalog.foo IS 'TRUSTED';
Note: Even when multiple masking policies are defined, the functions must be declared as
TRUSTED
in the "anon" policy and they will be trusted for all policies.
Image bluring
Images can show some sensitive data, for example
- A photo concerning personal data.
- A barcode representing personal data.
it is possible to blur this image using
anon.image_blur(data,sigma)
returns a bytea- data type
bytea
: the image data - sigma type
numeric
: This parameter controls the amount of blurring. A higher sigma value results in a more blurred image, while a lower sigma value results in a less blurred image.
usage :
CREATE TABLE images (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
image_data BYTEA NOT NULL
);
create extension anon;
SELECT anon.init();
SECURITY LABEL FOR anon ON COLUMN images.image_data
IS 'MASKED WITH FUNCTION anon.image_blur(image_data,1.0)';
SELECT anon.anonymize_database();
Write your own Masks !
You can also use your own function as a mask. The function must either be destructive (like Partial Scrambling) or insert some randomness in the dataset (like Faking).
Especially for complex data types, you may have to write your own function. This will be a common use case if you have to hide certain parts of a JSON field.
For example:
CREATE TABLE company (
business_name TEXT,
info JSONB
)
The info
field contains unstructured data like this:
SELECT jsonb_pretty(info) FROM company WHERE business_name = 'Soylent Green';
jsonb_pretty
----------------------------------
{
"employees": [
{
"lastName": "Doe",
"firstName": "John"
},
{
"lastName": "Smith",
"firstName": "Anna"
},
{
"lastName": "Jones",
"firstName": "Peter"
}
]
}
(1 row)
Using the PostgreSQL JSON functions and operators, you can walk through the keys and replace the sensitive values as needed.
CREATE SCHEMA custom_masks;
CREATE FUNCTION custom_masks.remove_last_name(j JSONB)
RETURNS JSONB
VOLATILE
LANGUAGE SQL
AS $func$
SELECT
json_build_object(
'employees' ,
array_agg(
jsonb_set(e ,'{lastName}', to_jsonb(anon.fake_last_name()))
)
)::JSONB
FROM jsonb_array_elements( j->'employees') e
$func$;
-- This step requires superuser privilege
SECURITY LABEL FOR anon ON FUNCTION custom_masks.remove_last_name IS 'TRUSTED';
Then check that the function is working correctly:
SELECT custom_masks.remove_last_name(info) FROM company;
When that's ok you can declare this function as the mask of
the info
field:
SECURITY LABEL FOR anon ON COLUMN company.info
IS 'MASKED WITH FUNCTION custom_masks.remove_last_name(info)';
And try it out !
# SELECT anonymize_table('company');
# SELECT jsonb_pretty(info) FROM company WHERE business_name = 'Soylent Green';
jsonb_pretty
-------------------------------------
{
"employees": [ +
{ +
"lastName": "Prawdzik",+
"firstName": "John" +
}, +
{ +
"lastName": "Baltazor",+
"firstName": "Anna" +
}, +
{ +
"lastName": "Taylan", +
"firstName": "Peter" +
} +
] +
}
(1 row)
This is just a quick and dirty example. As you can see, manipulating a sophisticated JSON structure with SQL is possible, but it can be tricky at first! There are multiple ways of walking through the keys and updating values. You will probably have to try different approaches, depending on your real JSON data and the performance you want to reach.