In a blog post, computer scientist Mary Theofanos from NIST (National Institute of Standards and Technology), discussed “differential privacy” and how it could be used to expose valuable information without exposing our personal details.
What is differential privacy? The concept of differential privacy is actually easy to understand. Let’s say there’s a dataset that has your personal information in it and you don’t want your data to be identifiable in that dataset. The easy answer is that we just eliminate your row of data.
But we can’t do that for everybody because then we wouldn’t have any data. And even if we just eliminated rows of individuals who specifically opted out, there’s a good chance we wouldn’t have enough data to analyze, which means we would never discover the interesting trends or actionable information in the data.
With differential privacy, you take a dataset that has personal information in it and make it so that the personal information is not identifiable. You essentially add noise to the data, but in a very prescribed, mathematically rigorous way that preserves the properties of the overall data while hiding individual identities.
How do we ensure we have valuable data while protecting individuals privacy? In a data driven world, we need to make good decisions about how we analyze data while protecting personally identifiable information (PII). Differential privacy allows us to do that.
Can you give us an example of how differential privacy can be used to solve a problem? For instance, 911 call data could tell us a lot about patient outcomes, but it has a great deal of PII. If we de-ID that data, if we take out all the personally identifiable data, like name and street address, we can now answer questions like what are the outcomes relative to the length of and type of care the patient received on scene. Is it better to do more on the scene to try and aid the patient, which delays the trip to the hospital, or is it better to get to the ER quickly?
One reason differential privacy is coming to the forefront is because of artificial intelligence. If artificial intelligence tools have good data going in, we’re going to see better results. Right now, we often have to use training data, or completely made-up data, because of privacy concerns. But if applying differential privacy techniques to real datasets allows you to use data that you otherwise couldn’t, you’re going to get better results.
Also, more organizations are now looking at their statistical databases and trying to figure out how they can make their data accessible to other organizations that may want to run statistical analyses and identify trends. Differential privacy could be used to identify trends in a number of arenas, including medicine, transportation, energy, agriculture, economics and financial services. There are just so many ways it could benefit everybody.
What made you want to get involved in the differential privacy space? A couple of things. First, while I’m not a cybersecurity expert, I actually started my career in cybersecurity and worked for several agencies looking for vulnerabilities. Because of that, I’m very concerned about my personal privacy and I take a lot of measures to minimize my exposure from a privacy perspective.
But I also recognize the value of data today and that sharing data and merging large databases in many ways can improve all sorts of outcomes. I recently read a book called Invisible Women: Data Bias in a World Designed for Men, by Caroline Criado Perez, which talks about how data is missing on large segments of the population. If we’re using incomplete data today to show trends and how things are moving forward, and we’re using that data to train AI tools, then we’re propagating bad data.
So, given the decisions we make today may impact us for the next 30, 40 or 50 years, how do we fill in those gaps and ensure we have good data and the most data possible? Especially when some people don’t necessarily want to be part of those datasets if their PII is going to be exposed. I realized that differential privacy is a solution to that issue.
What should we be concerned about when it comes to differential privacy? There’s obviously controversy with any new technique. Some people are concerned that differential privacy may hide some trends or that some of the techniques used could change the data in some way such that it’s not reflective of the actual dataset. This is part of the measurement problem that NIST is trying to address. At what point are the trends hidden or does the deep analysis change because we applied differential privacy? And we don’t really know the answers to all those questions yet.