Friday, March 31, 2017

Five Research Ideas Instagram Could Have Used to Protect Comey's Secret Twitter

Even though cybersecurity is one of the hottest topics on the Internet, my specific area of research, information flow security, has remained relatively obscure. Until now, that is.

You may have heard of "information flow" as a term that has been thrown around with terms like "data breach," "information leak," and "1337 hax0r." You may not be aware that information flow is a specific term, referring to the practice of tracking sensitive data as it flows through a program. While techniques like access control and encryption protect individual pieces of data (for instance, as they leave a database), information flow techniques additionally protect the results of any computations on sensitive data.

Information flow bugs are usually not the kinds of glamorous bugs that make headlines. Many of the data leaks that have been in the public consciousness, for instance the Target and Sony hacks, happened because the data was not protected properly at all. In these cases, having the appropriate access checks, or encrypting the data, should do the trick. But "why we need to protect data more better" is harder to explain. Up through my PhD thesis defense, I had such a difficult time finding headlines that were actually information flow bugs that I resorted to general software security motivations (cars! skateboards! rifles!) instead.

From the article.
Then along came "This Is Almost Certainly James Comey's Twitter Account," an article I have been waiting for since I started working on information flow in 2010. The basic idea behind the article is this: a journalist named Ashley Feinberg wanted to find FBI director James Comey's secret Twitter account, and so started digging around the Internet. Feinberg was able to be successful within four hours due to being clever and a key information leak in Instagram: when you request to follow an Instagram account, it makes algorithmic suggestions based on who to follow. And in the case of this article, the algorithmic suggestions for Comey's son Brien included several family members, including James Comey's wife--and the account that Feinberg deduced to be James Comey's. And it seems that Comey uses the same "anonymous" handle on Instagram as he does on Twitter. And so Instagram's failure to protect Brien Comey's protected "following" list led to the discovery of James Comey's Twitter account.

So what happened here? Instagram promises to protect secret accounts, which it (sometimes*) does. When one directly views the Instagram page of a protected user, they cannot access that person's photos, who that user is following, and who follows that user. This might lead a person to think that all of this information is protected all of the time. Wrong! It turns out the protected account information is visible to algorithms that suggest other users to follow, a feature that becomes--incorrectly--visible to all viewers once a follow is requested, because, presumably, whoever implemented this functionality forgot an access check. In this case the leak is particularly insidious because while the profile photos and names of the users shown are all already public, they are likely shown as a result of a computations on secret information: Brien Comey's protected follow information. (This is a subtle case to remember to check!) In information flow nomenclature, this is called an implicit flow. When someone is involved in a lot of Instagram activity, the implicit flow of the follow information may not be so apparent. But when many of the recommended follows are Comey family members, many of them who use their actual names, this leak becomes more serious!

Creepy Facebook search, from
In the world of information flow, this article is a Big Deal because it so perfectly illustrates why information flow analyses are useful. For years, I had been jumping up and down and waving my arms (see here and here, for instance) about why we need to check data in more places than the point where it leaves the database. Applications aren't just showing sensitive values directly anymore, but the results of all kinds of computations on those values! (In this case it was a recommendations algorithm.) We don't always know where sensitive data is eventually going! (As was the case when Brien Comey's protected "following" list was handed over to the algorithm.) Policies might depend on sensitive data! We may even compute where sensitive data is going based on other sensitive data! In a world where we can search over anything, no data is safe!

Until recently, my explanations have seemed subtle and abstract to most, in direct contrast to the sexy flashy security work that people imagine after watching Hackers or reading Crypto. By now, though, we information flow researchers should have your attention. We have all kinds of computations over all kinds of data going to all kinds of people, and nobody has any clue what is going on in the code. Even though digital security should be one of the main concerns of the FBI, Comey is not able to avoid the problems that arise from the mess of policy spaghetti that is modern code.

Fortunately, information flow researchers have been working for years on preventing precisely this kind of Comey leak**. In fashionable BuzzFeed style, I will list exactly five research ideas Instagram could adapt to prevent such leaks in the future:
  1. Static label-based type-checking. In most programming languages, program values have types. Type usually tell you simple things like whether something is a number or a list, but they can be arbitrarily fancy. Types may be checked at compile time, before the program runs, or at run time, while the program is running. There has been a line of work on static (compile time) label-based information flow type systems (starting with Jif for Java, with a survey paper here describing more of this work) that allows programmers to label data values with security levels (for instance, secret or not) as types, and that propagate the type of a program that makes sure sensitive information does not flow places that are less sensitive. These type systems give guarantees about any program that runs. The beauty of these type systems is that while they look simple, they are clever enough to be able to capture the kind of implicit flow that we saw with algorithms leaking Brien Comey's follow information. (We'd label the follow lists as sensitive, and then any values computed from them couldn't be leaked!)
  2. Static verification. Label-based type-checking is a light-weight way of proving the correctness of programs according to some logical specification. There are also heavier-weight ways of doing it, using systems that translate programs automatically into logical representations, and check them against the specification. Various directions of work using refinement types, super fancy types that depend on program values could be used for information flow. An example of a refinement type is {int x | canSee(alice, x)}, the type of a value that exists as an integer x that can only exist if user "alice" is allowed to see it according to the "canSee" function/predicate) Researchers have also demonstrated ways of proving information flow properties in systems like IronClad and mKertiKOS. These efforts are pretty hardcore and require a lot of programmer effort, but they allow people to prove all sorts of boutique guarantees on boutique systems (as opposed to the generic type system guarantees using the subset of a language that is supported).
  3. Label-based dynamic information flow tracking. Static label-based type-checking, while useful, often requires the programmer to put labels all over programs. Systems such as HiStar, Flume (the specific motivation of which was the OKCupid web server), and Hails allow labeling of data in a way similar to static label-based type systems, but track the flow of information dynamically, while the program is running. The run-time tracking, while it makes it so that programmers don't have to put labels everywhere, comes at a cost. First, it introduces performance slowdowns. Second, we can't know if a program is going to give us some kind of "access denied" error before it runs, so there could be accesses denied all over the place. Many of these systems handle these problems by doing things at the process level: if there is an unintended leak anywhere in the process, the whole process aborts. (Those who haven't heard of processes can think of the process as encapsulating a whole big task, rather than an individual action, like doing a single arithmetic operation.)
  4. Secure multi-execution. Secure multi-execution is a nice trick for running black-box code (code that you don't want to--or can't--change) in a way that is secure with respect to information flow. The trick is this: every time you reach a sensitive value, you execute the sensitive value in one process, and you spawn another process using a secure default input. The process separation guarantees that sensitive values won't leak into the process containing the default value, so you know you should always be allowed to show the result of that one. As you might guess, secure multi-execution can slow down the program quite a bit, as it needs to spawn a new process every time it sees a sensitive value. To mitigate this, my collaborators Tom Austin and Cormac Flanagan developed a faceted execution semantics for programs that lets you execute a program on multiple values at the same time, with all of the security guarantees of secure multi-execution.
  5. Policy-agnostic programming. While all of these other approaches can prevent sensitive values from leaking information, if we want programs to run most of the time, somebody needs to make sure that programs are written not to leak information in the first place. It turns out this is pretty difficult, so I have been working on programming model that factors information flow policies out of the rest of the program. (If I'm going to write a whole essay about information flow, of course I'm going to write about my own research too!) Instead of having to implement information flow policies as checks across the program, where any missing check can lead to a bug, type error, or runtime "access denied," programmers can now specify each policy once, associated with the data, along with a default value, and rely on the language runtime and/or compiler to make the code execute according to the policies. In the policy-agnostic system, the programmer can say that Brien Comey's follows should only be visible to followers, and the machine becomes responsible for making sure this policy is enforced everywhere, including the code implementing the recommendations algorithm. That policies can depend on sensitive values, that sensitive values may be shown to viewers whose identities are computed from sensitive values, and that enforcing policies usually implementing access checks across the code are all challenges. Our semantics for the Jeeves programming language (paper here) addresses all of these issues using a dynamic faceted execution approach, and we have also extended this programming model to handle applications with a SQL database backend (paper here). We are also working on a static type-driven repair approach (draft here).
I don't know how much this Twitter account leak upset the Comeys, but reading this article was pretty much the most exciting thing that I have ever done. Up until now, most people have thought about security in terms of protecting individual data items, rather than in terms of a complex and subtle interaction with the programs that use them. This has started to change in the last few years as people have been realizing just how much of our data is online, and just how unreliable the code is that we trust with this data. I hope that this Comey leak will cause even more people to realize how important it is to reason deeply about what our software is doing. (And to fund and use our research. :))

* A student in my spring software security course (basically, programming languages applied to security), Scott, had noticed earlier this semester that emails from Instagram allowed previews of protected accounts he was not following. He reported this to Facebook's bug bounty program and made $1000. I told him to please write in the course reviews that the course helped him make money.
** Note that a lot of other things are going on in this Comey story. The reporter used facts about Comey to figure out the setup, and also some clever inference. But this clever inference exploited a specific information leak from the secret follows list to the recommendations list, and this post focuses on this kind of leak.


Anonymous said...

These are interesting, but data security is seldom so binary. An algorithm might return an aggregate value which is typed as public, not secret, but might still end up inadvertently leaking secure data. The type checker won't save you from incorrect or outdated assessments of the security of derived data.

The cost of a leak is also not binary. A security failure might look like "we leaked inconsequential information about the users, but saved the lives of thousands of people", or it might look like "we compromised the safety of our users, but made record profits". You'll have trouble capturing that with a type system.

winterkoninkje said...

@Anonymous, Yes you're correct that information can be leaked at a finer granularity than all or none, and that policies can change over time. Luckily, a number of the researchers in this area have already started tackling those issues. Those five bullet points are nowhere near the state of the art in this field; the first few are the bare minimum ideas, and are covered in any introductory course on the topic.

Kristopher Micinski said...

I agree with Wren that there are many good further research directions that have followed up to allow the declassification policy to be more flexible. But I worry that there is a more fundamental issue at play that might be harder to address: no matter what happens, I suspect any information flow policy that protects users will necessarily produce worse results to ensure fairness and security. This is going to be a political problem as much as it is a technical one: we'll have to convince companies that they need to integrate transparent machine learning algorithms into their pipelines. On the technical end, we have to make it as easy as possible for real companies to be able to accomplish this, and provide auditing mechanisms so that users can understand what is happening with their data and why results of machine learning algorithms leak information.