Wizards

Good quality tests require good data – data that is the most accurate representation of reality. A copy of production data is very often used for this purpose. Such a dedicated test environment is often used to reproduce tickets, debugging issues with data and performing stress tests. Setting aside the fact that this practice is most often incompatible with the GDPR, while the production environment is monitored and audited like a fortress, and only a few people have access to it, non-production environments are treated much less restrictively. The number of people with access to them (not including the users) is also much larger. Many serious leaks of personal data were not caused by hacking into the “fortress”, but by abuse of these “unprotected settlements”.

In the area of test data, there are usually two extremes – personal data is either processed by testers and developers in production database copies, or, we wait half a year to refresh test environments with artificial data, usually poorly prepared. The solution to this problem could be the implementation of anonymization, but as it turns out, this is not an easy task.

Challenges associated with designing the anonymization process

Simple data masking can work in simple cases, but you can quickly see that this is not enough for applications that we usually work with every day. On the other hand, when reviewing existing solutions, we noticed that they did not meet our needs – most often they did not support mechanisms to maintain data consistency between different databases. It was also difficult to find a solution that supported the automation of the anonymization process. The most popular tools didn’t allow for defining your own generators, not only regarding a single record, but also taking into account the distribution of data. By implementing a solution that meets these requirements yourself, one will quickly encounter obstacles:

Simple data masking leads to application errors due to potential violation of data formats expected by the application, e.g. personal identification number with a value of 8501XXXXX11 will cause validation errors. This approach also very quickly leads to duplicate data.
The use of native mechanisms (e.g. Dynamic Data Masking in SQL Server) may be sufficient to view data directly on the database, however, it does not allow collating an anonymized copy of the production environment in a secure manner. In such solutions, the data, although presented as masked, still appears in the database in their raw form.
A solution based on SQL scripts turns out to be insufficient. Simple, context-free generators (with a number of keys to the order of several hundred thousand and more) will increasingly conflict with previously generated data, causing violations of uniqueness. Solving this problem through lookups will result in a significant degradation of anonymization performance. The introduction of additional database structures to store the generated identities would be cumbersome to maintain. In the case of anonymizing many schemes, including on different database servers, this approach quickly becomes extremely inefficient, and in the case of various database engines, even unusable. Scripts begin to be complex and complicated, requiring more and more time to keep them running.

Changes in the database structure (e.g. a new version of the application that added or modified tables and columns) will potentially result in the anonymization process being outdated.

Happy medium

However, there exists a happy medium – ensuring free access to high-quality data reflecting the characteristics of production data, while ensuring the security of the solution and compliance with legal regulations. This happy medium is Nocturno – a data anonymization tool that we designed together as a team. While working on this solution, we decided to take care of:

– Maintaining full data consistency – not only within the schema or database, but all data sources within the organization (databases of various suppliers, LDAP, file sources, etc.)

Reflecting the characteristics of real data
Supporting automation fully
Creating the anonymization process easily
Ensuring full security of the solution – no possibility of reverse anonymization
Establishing configurability and extensibility – a wide, built-in set of algorithms and generators, with the ability to write custom components
Enabling versioning of the system’s anonymization process together with the version of the anonymized system (GIT, SVN, etc.)

What do we gain by implementing good-quality anonymization?

By implementing anonymization, we are able to reduce the number of people who have access to personal data to the absolute minimum. Due to the good quality of the anonymized data, its use for software development purposes is transparent and compliant with the GDPR. The process based on Nocturno is easily configurable and maintainable by developers – it can be simultaneously developed in the same codebase as the application.

Nocturno supports two main implementation scenarios:

Administrative launch of anonymization on the indicated database instance (anonymization on request)
Automatic process of creating an anonymous copy of the production database – preparation of an anonymous backup ready for use in test environments and by developers

The picture above portrays Nocturno’s role in the automatic process of providing anonymized copies of databases.

More information about Nocturno can be found here: https://wizards.io/en/nocturno -en/. If you have questions about the anonymization process, please feel free to reach out.

Marcin Gorgoń, Senior Software Engineer

Soon it will be twenty years since I joined the world of IT. During this time, I have observed how the environment has changed, how development processes have developed and what new tools have been used. Over time, many processes, including repetitive tasks, were automated. Companies implemented Continuous Integration and Continuous Delivery. All of this change has been motivated by a single thought: let software developers focus on system and business development.

Enter GDPR

The entry of GDPR into life shook the IT world and changed the rules of the game. The development process became more complicated and operating on personal data became a big risk that had to be addressed. Working in a software house, we saw these issues clearly because they occurred in each of our projects. In theory, we were prepared for GDPR. We completed the appropriate courses and the company was armed with documents and records. In practice, it turned out that legal restrictions and the uncertainty associated with the entry of this regulation into force impacted our everyday work. Gone was my dream of unhindered development, where we could focus solely on producing quality software.

Shortly after the appearance of GDPR regulations, we started looking for available solutions. The tools that we were able to find did not meet our project needs because every day we developed entire integrated ecosystems created in various technologies that exchanged personal data. I felt as if I had travelled two decades backwards in time.

Change of status quo

Ultimately, a group of people in the company emerged that set themselves the goal of changing the status quo. We knew what was required and how our plan could be implemented. We had never faced such a challenge before. Together, however, we managed to create a set of tools that ended up being a Godsend for us.

Anonymization of data

We started by anonymizing data in test environments. We created a tool that was able to handle many applications at once, taking into account the specificity of Polish law, and do its work efficiently.

The created solution was to support all of our projects, so high configurability and the ability to adapt to various requirements was the priority. We included anonymization in Continuous Integration processes and quickly implemented them in our projects. It turned out that the most painful aspects of GDPR are now handled automatically and no longer cause sleepless nights to the development team.

Retention of personal data

The next step was the retention of personal data, which is necessary in almost every system. Taking care of this aspect in a single application is easy. Performing data retention in ten integrated systems is much more difficult, and in a hundred – virtually impossible. It was clear to us that we did not want to repeat the same functionality in all systems that we produce. This is how another tool was born, relieving us of this burden.

Everything was back on track, just as I had dreamed. Fortunately, GDPR turned out to be only a bump on the road in our projects.

With all of this in mind, we founded a startup. We came to the conclusion that the problems we had been dealing with were being experienced by many development teams, and we now had the ready solution.

That is why we decided to create Nocturno and Oblivio, about which you will be able to read more soon on our company profile.

Artur Żórawski, Founder & CTO of Wizards