Wizards | What is anonymization?

What is anonymization?

#anonymization #gdpr #personaldata

It would be hard to miss all the articles on the topic of GDPR and all the various, terrifying sanctions that could be put upon an entity for non-compliance. Few, however, delve into important details, such as the significance of anonymization or data retention, which allow for avoiding all these sanctions and make the work of developers significantly easier. For this reason, we decided to explain in an accessible way what anonymization and retention of personal data are and show why their proper implementation is of such importance in the software development process. Today, let us tackle anonymization.

What is anonymization?

Anonymization is a process that allows you to permanently remove the link between personal data and the person to whom the data relates. Thanks to this, what was previously deemed as personal ceases to be that.

What does it look like in practice?

The definition above becomes less complicated when presented with an example. Let us imagine, for example, Superman – a comic hero from Krypton who wants to hide his identity and blend in with the crowd.

Name	Superman
Occupation	Superhero
Origin	Krypton

During the anonymization process, Superman enters the telephone booth, puts on glasses and a tweed suit, and becomes Clark Kent, a reporter from Kansas.

Name	Clark Kent
Occupation	Reporter
Origin	Kansas, USA

Through the anonymization process, Superman’s data turned into Clark Kent’s, and there is no connection between these two people. This is fictitious data that can be safely used, e.g. in test environments.

The example above illustrates the process of anonymization itself. Let us now consider why it is important that the anonymization is of good quality.

Irreversibility

The foundation of anonymization is its irreversibility. We should never be able to find out what the original data looked like, based on the anonymized data. Clark’s associates should not be able to discover his true identity.

When we anonymize a data set, usually only a fragment of the data will undergo change. However, we must ensure that non-anonymized data does not allow the anonymization process to be reversed for the entire set. In our example, we would not have to change Superman’s favorite color. However, if we do not anonymize his origin, we would certainly cause a sensation.

True to reality

An important qualitative measure of anonymization is also how well it imitates reality. If Superman and all other people in the data set are anonymized as follows:

Name	X
Occupation	Y
Origin	Z

we have no doubt that the process is irreversible, but its usefulness is questionable. Person X does not look like someone who exists in reality, and the nature of the original data has not been preserved. The length of the names were not preserved, and the data itself looks unbelievable with all of the people having the same name. In the case of IT systems, the tester using such data would run into a lot of issues, he would not even be able to distinguish between people.

Repeatability

Another feature of good anonymization is its repeatability. When anonymizing the data set, we want to make sure that each time the data set would be anonymized in the same way. We want Superman to always become Clark Kent, no matter whether it’s the first or the tenth anonymization. This is especially important from the point of view of Quality Assurance. Testers often create test cases based on specific data. If this data were changed every time, the tester’s work would certainly be more difficult!

Integrated systems

Today’s IT world is represented by countless systems connected with each other. Hardly any application can function as a single organism. Systems connect with each other, exchanging data and using each other’s services. Therefore, when approaching anonymization, we must consider the process not only for one system, but for many systems at once. The challenge is for anonymized data to be consistent throughout the entire ecosystem. This means that if the Daily Planet (Clark’s workplace) has a human resources system and a blog, then in both applications Superman will become Clark Kent.

Efficiency

The last key parameter affecting the quality of anonymization, from my point of view, is performance. IT systems process huge data sets measured in gigabytes or even terabytes. Anonymization of such databases can be time consuming, therefore, we must ensure not only security but also good speed of the anonymization process. One of the things Superman learned after arriving on Earth is that time is money. This saying rings even more true in the case of modern IT.

I invite everyone interested in the topic of data retention to read my next article, which I plan on publishing shortly.

Artur Żórawski, Founder & CTO