Person in the know calling me out. |
Since this Comey Twitter leak is such a nice example, I'm going to provide more context by revisiting a taxonomy I used in my spring software security course, adding statistical privacy to the list. (Last time I had to use a much less exciting example, about my mother spying on my browser cookies.)
- Access control mechanisms resolve permissions on individual pieces of data, independently of a program that uses the data. An access control policy could say, for instance, that only Comey's followers could see who he is following. You can use access control policies to check data as it's leaving a database, or anywhere in the code. Things people care about with respect to access control is that the access control language can express the desired policies while providing provable guarantees that policies won't accidentally grant access, and can be checked reasonably efficiently.
- Information flow mechanisms check the interaction of sensitive data with the rest of the program. In the case of this Comey leak, access control policies were in place some of the time. For example, if you went to Comey's profile page, you couldn't see who he was following. How the journalist ended up finding his page was by looking at the other users suggested by the recommendation algorithm after requesting to follow hypothesized-Comey. (This was aided by the fact that Comey is following few people, and In this case, it seems that Instagram was feeding secret follow information into the recommendation algorithm and not realizing that the results could leak follow information. An information flow mechanism would make sure that any computation based on secret follow information could not make its way into the output from a recommendation algorithm. If the follow list is secret, then so is the length of that list, people followed by people on the follow list, photos of people from the list, etc.
- Statistical privacy mechanisms protect prevent aggregate computations from revealing too much information about individual sensitive values. For instance, you might want to develop a machine learning algorithm that uses medical patient record information to do automated diagnosis given symptoms. It's clear that individual patient record information needs to be kept secret--in fact, there are laws that require people to keep this secret. But there can be a lot of good if we can use sensitive patient information to help other patients. What we want, then, is to allow algorithms to use this data, but with a guarantee that an observer has a very low probability of tracing diagnoses back to individual patients. The most popular formulation of statistical privacy is differential privacy, a property over computations that allows computations only if observers can tell the original data apart from slightly different data with very low probability. Differential privacy is very hot right now: you may have read that Apple is starting to use this. It's also not a solved problem: my collaborator and co-instructor Matt Fredrikson has an interesting paper about the tension between differential privacy and social good, calling for a reformulation of statistical privacy to address the current flaws.
For those wondering why I didn't talk about encryption: encryption focuses on the orthogonal problem of putting a lock on an individual piece of data, where locks can have varying cost and varying strength. Encryption involves a different kind of math--and we also don't cover encryption in my spring course for this reason.
Another discussion I had on Twitter. |
Discussion. Some people may wonder if the Comey Twitter leak is an information flow leak, or some other kind of leak. It is true that in many cases, this Instagram bug may not be so obvious because someone is following many people, and the recommendation algorithm has more to work with. I would argue that it squarely is in the purview of information flow mechanisms. If follow information is secret, then recommendation algorithms should not be able to compute using this data. (Here, it seems like what one means by "deducible" is "computed from," and that's an information flow property.) We're not in a situation where these recommendation engines are taking information from thousands of users and doing something important. It's very easy for information to leak here, and it's simply not worth the loss to privacy!
Poor, and in violation of our privacy settings. |
Takeaways. We should stand up for ourselves when it comes to our data. Companies like Facebook are making recommendations based on private information all the time, and not only is it creepy, but it violates our privacy policies, and they can definitely do something about it. My student Scott recently made $1000 from Facebook's bug bounty program reporting that photos from protected accounts were showing up in keep-in-touch emails from Instagram. If principles alone don't provide enough motivation, maybe the $$ will incentivize you to call tech companies out when you encounter sloppy data privacy practices.