The application of artificial intelligence and machine learning to solve today’s problems requires access to large amounts of data. One of the key obstacles faced by analysts is access to this data. Synthetic data can help solve this data problem in a privacy-preserving manner. Many different types of data can be synthesized, including images, video, audio, text and structured data.
What is synthetic data?
Data synthesis is an emerging privacy-enhancing technology that can enable access to realistic data, which is information that may be synthetic but has the properties of an original dataset. It also simultaneously ensures that such information can be used and disclosed with reduced obligations under contemporary privacy statutes. Synthetic data retains the statistical properties of the original data. Therefore, there are an increasing number of use cases in which it would serve as a proxy for real data.
Synthetic data is created by taking an original (real) dataset and then building a model to characterize the distributions and relationships in that data — this is called the “synthesizer.” The synthesizer is typically an artificial neural network or other machine learning technique that learns these (original) data characteristics. Once that model is created, it can be used to generate synthetic data.
The data is generated from the model and does not have a 1:1 mapping to real data, meaning the likelihood of mapping the synthetic records to real individuals would be very small — it is not considered personal information. However, if the synthesizer is overfit to real data, then the generated data will replicate the original real data. Therefore, the synthesizer has to be constructed in a manner to avoid such overfitting. A formal privacy assurance should also be performed on the synthesized data to validate that there is a weak mapping between synthetic records to individuals.
The use cases that synthetic data can assist with include AI, machine learning and other data science projects that require realistic data for model building and validation, software testing applications, technology evaluations and open data initiatives.