Good quality tests require good data – data that is the most accurate representation of reality. A copy of production data is very often used for this purpose. Such a dedicated test environment is often used to reproduce tickets, debugging issues with data and performing stress tests. Setting aside the fact that this practice is most often incompatible with the GDPR, while the production environment is monitored and audited like a fortress, and only a few people have access to it, non-production environments are treated much less restrictively. The number of people with access to them (not including the users) is also much larger. Many serious leaks of personal data were not caused by hacking into the “fortress”, but by abuse of these “unprotected settlements”.
In the area of test data, there are usually two extremes – personal data is either processed by testers and developers in production database copies, or, we wait half a year to refresh test environments with artificial data, usually poorly prepared. The solution to this problem could be the implementation of anonymization, but as it turns out, this is not an easy task.
Simple data masking can work in simple cases, but you can quickly see that this is not enough for applications that we usually work with every day. On the other hand, when reviewing existing solutions, we noticed that they did not meet our needs – most often they did not support mechanisms to maintain data consistency between different databases. It was also difficult to find a solution that supported the automation of the anonymization process. The most popular tools didn’t allow for defining your own generators, not only regarding a single record, but also taking into account the distribution of data. By implementing a solution that meets these requirements yourself, one will quickly encounter obstacles:
However, there exists a happy medium – ensuring free access to high-quality data reflecting the characteristics of production data, while ensuring the security of the solution and compliance with legal regulations. This happy medium is Nocturno – a data anonymization tool that we designed together as a team. While working on this solution, we decided to take care of:
– Maintaining full data consistency – not only within the schema or database, but all data sources within the organization (databases of various suppliers, LDAP, file sources, etc.)
By implementing anonymization, we are able to reduce the number of people who have access to personal data to the absolute minimum. Due to the good quality of the anonymized data, its use for software development purposes is transparent and compliant with the GDPR. The process based on Nocturno is easily configurable and maintainable by developers – it can be simultaneously developed in the same codebase as the application.
Nocturno supports two main implementation scenarios:
The picture above portrays Nocturno’s role in the automatic process of providing anonymized copies of databases.
Marcin Gorgoń, Senior Software Engineer