Differential privacy and new privacy approaches
How can we keep patient data truly private in this new era of technology?
As we’ve discussed, the current approaches to patient privacy are not sufficient for modern AI technology due to the risk of re-identification. Luckily, there are many non-healthcare reasons to protect privacy, so there are a large number of smart people working on this problem.
What healthcare currently uses is called a “Statistical Disclosure Limitation”, which is a fancy way of saying that some data is withheld to decrease privacy risk to individuals. HIPAA and data de-identification are examples of this approach.
There are a few innovative ways to increase privacy without markedly affecting the ability to study datasets that include patient records, though most still have limited use cases for practical reasons.
Differential Privacy
The inherent problems with Statistical Disclosure Limitation prompted people 20 years ago to start to develop an approach that would protect against information leakage across a range of threats and include protection even for technology that hadn’t yet been developed. These forward-thinking mathematicians developed the mathematical approach of differential privacy.
The idea is that a model’s behavior barely changes when an individual is taken out or inserted into the database; you can’t tell if a person is included by looking at the algorithm’s output. It does this by adding noise to the results so that individual data isn’t distinguishable; ie, if a person in a database has a disease, the person querying the database will only know that there is someone with that disease, but won’t be able to tell which person it is. The details of how this works involves complex math and many complicated equations. It’s especially useful in the context of machine learning, and can even improve generalization in machine learning algorithms.
You can imagine, though, that if you’re adding “random noise” to something like statistics, there are going to be some tradeoffs. For example, some functions are more sensitive than others so more noise has to be added to protect privacy, but this can make the results meaningless.
Differential Privacy and the 2020 US Census
There are also situations that are fundamentally incompatible with differential privacy like detecting outliers or identifying where a specific human is for things like contact tracing during pandemics. So there are inherent tradeoffs in terms of accuracy and privacy, which then requires more people to make complex decisions and customize the tools to the purpose for which it’s being used.
As Scientific American explains it:
“If we allow algorithms that give out almost exactly the same information in the two cases, then useful and efficient algorithms do exist. This “almost” is a precisely calibrated parameter, a measurable quantification of privacy. Individuals or social institutions could decide what value of this parameter represents an acceptable loss of privacy, and then differentially private algorithms could be chosen that guarantee that the privacy loss is less than the selected parameter.”
This sounds great right? Why aren’t all patient datasets protected with differential privacy?
Problems with differential privacy
A few major reasons:
Right now specialized software engineers have to “write code, tune parameters, and optimize the trade-off between the privacy and accuracy of statistical releases”. In other words:
It’s expensive
It’s time consuming
Not many differential privacy experts exist. Some of the highest profile releases have been for enormous organizations who had access to this kind of specialized engineers like Google, Amazon, and the US Census.
Since you have to make a series of assumptions and decisions about how close the data needs to be to “real life”, exploratory data analysis is hard
Private Set Intersection
Private Set Intersection (PSI) was created to let institutions and researchers with access to sensitive datasets release the information “without being experts in computer science, statistics, or privacy”. It’s a user interface that converts some of the complex math into words a researcher would find more familiar to facilitate the use of differential privacy techniques. This is still a relatively nascent interface, and a reasonable amount of understanding of the tradeoffs inherent in differential privacy are still required to use it.
Privacy Budget
The concept of a privacy budget was first developed in the context of differential privacy, but it’s one of the most pleasantly named engineering concepts because it describes so well what it does: it gives a budget of how much privacy the system is allowed to take from a person, and after that budget is used up it gives an error or a standard response. So in the example below a third party application starts with a budget of 3 queries about a customer, and once they’re used up it no longer gives results.
https://inpher.io/blog/privacy-budget-and-the-data-driven-enterprise
Homomorphic encryption
This means “same structure”, so it’s another nicely named encryption method: it keeps the structure of the original data so another institution can work with it without actually decrypting the data, thus preserving patient privacy. The other party can run actual math operations on the dataset without ever seeing the dataset itself.
Different types of homomorphic encryption allow different kinds of math operations:
“Partially Homomorphic Encryption (PHE). In PHE, ‘partially’ means that only a single mathematical function can be performed on encrypted values. So only one action — either addition or multiplication — can be performed an unlimited number of times on the encrypted data.
Somewhat Homomorphic Encryption (SHE). ‘Somewhat’ is more general than PHE in that it supports homomorphic operations with additions and multiplications. However, only a limited number of operations can be performed on the encrypted data.
Fully Homomorphic Encryption (FHE). Where PHE and SHE have limited operations, fully homomorphic encryption has the capability of using both operations, addition and multiplication, with no limit on the number of times they’re performed on the encrypted data.”
The disadvantage of homomorphic encryption is that it’s slow and expensive, so it’s usually used to encrypt things like keys rather than the data itself. This is better than nothing, but still leaves the possibility of privacy violations once the data are decrypted.
Additional privacy techniques
A few other privacy techniques you may see if you continue reading about privacy protection are:
K-anonymity, which was popular in the privacy research literature for a while but is now seen as fairly weak and is not commonly used.
RSA (Rivest Shamir Adleman) which is a common way to protect data through encryption and decryption using a private and public key pair and is even used in digital verification like SSL certificates; it’s actually a type of homomorphic encryption, and therefore is high-compute.
Conclusion
We know that HIPAA is not going to be sufficient to protect patient privacy as AI becomes cheaper and more accessible. Differential privacy, private set intersection, homomorphic encryption, and the privacy budget are emerging techniques to help protect the right to privacy for our patients. Hopefully these approaches can be incorporated into more healthcare dataset analyses in the future.