Saturday, September 16, 2017

The Genius Fallacy

During our department's PhD Open House this past spring, a student asked what I thought made a PhD student successful. I realized that my answer now is different than it would have been a few years ago.

My friend Seth tells me I need to build more suspense in my writing*, so let me first tell my life story.

The whole time I was growing up, I was slightly disappointed that I wasn't some kind of prodigy. It seemed that my parents were telling me every day about so-and-so's toddler son who was playing Beethoven concertos from memory, so-and-so's daughter who, as an infant, had already completed a course on special relativity. In order to give me the same opportunities to demonstrate my genius, my parents spent all their money on piano lessons, gymnastic classes, writing camps, art camps, tennis camps, and extracurricular math classes. Unfortunately, nobody ever said, "This is the best kid I have ever seen. I must take her away from her family to train her for greatness."

"Child prodigies have hard lives," my father would tell me, probably trying hiding his disappointment. "It can be difficult for them to make friends because others can't relate to how gifted they are."

"Just work hard, be a nice person, and try to be happy," my mother would tell me. "You didn't know how to cry when you were born. I'm glad you're able to talk in full sentences."

Despite the comforting words from my parents, there was always a part of me that held out hope of discovering a secret prodigious talent. But the angst of not being a prodigy was small compared to the existential angst of being newly alive and so I mostly tried to work hard, to be a nice person, and to be happy. This got me all the way to college, where I thought I could leave all this prodigy nonsense behind me.

In college, I discovered that the pressure to be immediately and wildly gifted came in another form. In my first two years of school, I attended many talks and panels by professors telling us what we should do with our lives. I attended a research panel in the economics department, where one of the professors kept repeating the word "star."

"You have to be a super star to succeed in a department like ours," he said about what it meant to be on the tenure track in the economics department. "I want undergraduate researchers who are stars."

I didn't know what a star was and I didn't presume to be one, but I liked the professor's research, so I emailed him my resume and said I would like to work with him.

He never wrote back.

I resigned myself to not being a star. I took hard classes with people who had medaled in math, informatics, and science olympiads, wondering how it would feel to do the problem sets if I had such a gifted, well-trained mind. I also became concerned about my future. What was my place in a world that worshipped instatalent?

It all began to change when I began to talk more with the professors in the Computer Science department. Despite my lack of apparent star quality, my professors seemed to like answering the questions I asked them. They pitched me projects I could do, and before I knew it I was applying to PhD programs and preparing to spend the next few years doing academic research. As I was graduating, I spoke with my one professor to get advice about my future in research.

"Research isn't just about smarts," my professor told me. At the time, I thought this was a white lie that professors told to their students who weren't prodigies.

Then she told me something that turned my worldview upside down. "My biggest concern for you, Jean, is that you need to start finishing projects," she told me. "You need to focus."

It was then that I began to realize that maybe the myth of the instagenius was but a myth. I had gone from interest to interest, from project to project, waiting to find It, that easy fit, that continuous honeymoon. With some projects I had It for a while, long enough to demonstrate to myself and others that I could finish. Then I moved on, waiting to fall in love with a problem, waiting for a problem to choose me. What I had failed to see was that this relationship with a problem didn't just happen: I had to do my share of the work.

Still, I clung to the dream of the easy problem. At Google, employees get to have a 20% project: a side project they spend the equivalent of one day a week working on that may or may not make its way into production eventually. In graduate school, my 20% project was looking for an easier project--a project with which I had more chemistry, a project with fewer days lost to dead ends and angst. One of my hobbies involved interviewing for internships in completely different research areas. Another one of my hobbies was fantasizing about becoming a classics PhD student, despite knowing no ancient languages. (I once took an upper-level literature seminar on Aristotle with the leading world scholar on Homeric poetry and I thought he had a pretty good life.)

But because I like to finish what I started, the PhD became a process of learning to persevere. Instead of indulging the temptation to switch projects, advisors, or even schools, I kept going. I endured something like five rounds of rejections on the first paper towards my PhD thesis, and multiple years of people telling me that maybe I should find another topic, because I didn't seem in love. Eventually, I learned that every problem that looks like it might be easy has hard parts, every problem that looks like it might be fun has boring parts, and all problems worth solving are full of dead ends. I finally learned, in the words of my friend Seth, that "the grass is brown everywhere."

And this shattering of my belief in instagenius has shaped my conception of what makes a student a star. There was a time when I, like many people, thought that the superstars were the ones who sounded the most impressive when they spoke, or who had the most raw brainpower. If you asked me what I thought made a good researcher, I may have said some other traits like creativity and good taste in problems. And while all these certainly help with being a good researcher, there are plenty of people with these traits who do not end up being successful.

What I have learned is that discipline and the ability to persevere are equally, if not more, important to success than being able to look like a smart person in meetings. All of the superstars I've known have worked harder--and often faced more obstacles, in part due to the high volume of work--than other people, despite how much it might look like they are flying from one brilliant result to another from the outside. Because of this, I now want students who accept that life is hard and that they are going to fail. I want students who accept that sometimes work is going to feel like it's going to nowhere, to the point that they wish they were catastrophically failing instead because then at least something would be happening. While confidence might signal resilience and a formidable intellect might decrease the number of obstacles, the main differentiator between a star and simply a smart person is the ability to keep showing up when things do not go well.

It has become especially important for me to fight the idolization of the lone genius because it is not just distracting, but also harmful. Currently, people who "look smart" (which often translates into looking white, male, and/or socioeconomically privileged) have a significant advantage for two main reasons. The first reason has to do with self-perception. Committing to hard work and overcoming obstacles is easier if you think it will pay off. If someone already does not feel like they belong, it is easier for them to stop trying and self-select out of a pursuit when they hit a snag. The second reason has to do with perception by others. Research suggests that in fields that value innate talent, women and other minorities are often stereotyped to have less of it, leading to unfair treatment.

And so I've written this post not just to reveal my longstanding delusions of grandeur, but also to start a discussion how the myth of instagenius holds us back, as individual researchers and as a community. Would love to hear your thoughts about how we can move past the genius fallacy.

Related writing:

* Seth also tells me the main idea of this blog post is the same as Angela Duckworth's book Grit. I guess I should tell you that you could read that instead of this. On the subject of the lack of originality of my ideas, you should also read what Cal Newport has to say about the "passion trap."

Wednesday, August 09, 2017

Guest Post: The Real Problem Isn't Gender; It's the Modern Media

This guest post by Seth Stephens-Davidowitz was adapted from a comment he wrote on a Facebook post of mine sharing this essay.

While gender in tech is certainly an issue, a lot of the controversy over it is unnecessary. What recently happened with the Google memo is a classic case of Scott Alexander’s Toxoplasma of Rage, one of the most brilliant pieces I have ever read. Read his post. Then read it again.

The stories that go viral are those that maximize anger and foster the most disagreement.

Guy writes a memo with a lot of true statements but an aggressive tone bound to infuriate some people. Within two days, everybody is predictably furious.

My hypothesis is that an overwhelming majority of people actually agree on many of the points of contention--or would agree if they were phrased a little less aggressively, in a tone less likely to create controversy and less likely to go viral.

How many people, for example, agree with the following statement?

"We do not know why CS majors are 80% male. It is possible that, even though millions of women have a passion for computer science, there are, in aggregate, fewer women than men who have this passion. We don't know since computer science is kind of new. And also we don't really understand why female CS majors rose to 40% and then plummeted. Since it is possible that discrimination and stereotypes play a role, we should devote resources to making sure everybody with interest in these high-status jobs has ample opportunity to pursue them. Also, everybody should be judged based on their own interest and aptitude in a job, not how many people of their gender would want that job. Finally, the majority of women in tech--as well as many other high-powered fields--have said they have faced sexism, and we should work really hard to stop that."

This addresses many of the controversies that were raised by the James Damore memo and the responses to it, but is phrased in a way such that few people would find it objectionable. Perhaps we should stop falling for these traps that maximize rage and instead try sober analysis. We may find a surprising amount of consensus.

Lastly, no young person, man or woman, should actually be training for anything--driving cars, teaching kids, diagnosing diseases, or writing programs--because AI will soon do all that for us. ;)

For 352 pages of sober analysis on even more controversial topics, you can check out Seth’s book Everybody Lies.

Saturday, April 01, 2017

Techniques for Protecting Comey's Twitter: A Taxonomy

Person in the know calling me out.
After my post about how the Comey Twitter leak was the most exciting thing ever for information flow security researchers, I had some conversations with people wanting to know how to tell between information that is directly leaked and information that is deduced. Someone also pointed out that I didn't mention differential privacy, a kind of statistical privacy that talks about how much information an observer can infer. It's true: there are many mechanisms for protecting sensitive information, and I focused on a particular one, both because it was the relevant one and because it's what I work on. :)

Since this Comey Twitter leak is such a nice example, I'm going to provide more context by revisiting a taxonomy I used in my spring software security course, adding statistical privacy to the list. (Last time I had to use a much less exciting example, about my mother spying on my browser cookies.)

  • Access control mechanisms resolve permissions on individual pieces of data, independently of a program that uses the data. An access control policy could say, for instance, that only Comey's followers could see who he is following. You can use access control policies to check data as it's leaving a database, or anywhere in the code. Things people care about with respect to access control is that the access control language can express the desired policies while providing provable guarantees that policies won't accidentally grant access, and can be checked reasonably efficiently.
  • Information flow mechanisms check the interaction of sensitive data with the rest of the program. In the case of this Comey leak, access control policies were in place some of the time. For example, if you went to Comey's profile page, you couldn't see who he was following. How the journalist ended up finding his page was by looking at the other users suggested by the recommendation algorithm after requesting to follow hypothesized-Comey. (This was aided by the fact that Comey is following few people, and  In this case, it seems that Instagram was feeding secret follow information into the recommendation algorithm and not realizing that the results could leak follow information. An information flow mechanism would make sure that any computation based on secret follow information could not make its way into the output from a recommendation algorithm. If the follow list is secret, then so is the length of that list, people followed by people on the follow list, photos of people from the list, etc.
  • Statistical privacy mechanisms protect prevent aggregate computations from revealing too much information about individual sensitive values. For instance, you might want to develop a machine learning algorithm that uses medical patient record information to do automated diagnosis given symptoms. It's clear that individual patient record information needs to be kept secret--in fact, there are laws that require people to keep this secret. But there can be a lot of good if we can use sensitive patient information to help other patients. What we want, then, is to allow algorithms to use this data, but with a guarantee that an observer has a very low probability of tracing diagnoses back to individual patients. The most popular formulation of statistical privacy is differential privacy, a property over computations that allows computations only if observers can tell the original data apart from slightly different data with very low probability. Differential privacy is very hot right now: you may have read that Apple is starting to use this. It's also not a solved problem: my collaborator and co-instructor Matt Fredrikson has an interesting paper about the tension between differential privacy and social good, calling for a reformulation of statistical privacy to address the current flaws.
For those wondering why I didn't talk about encryption: encryption focuses on the orthogonal problem of putting a lock on an individual piece of data, where locks can have varying cost and varying strength. Encryption involves a different kind of math--and we also don't cover encryption in my spring course for this reason.

Another discussion I had on Twitter.
Discussion. Some people may wonder if the Comey Twitter leak is an information flow leak, or some other kind of leak. It is true that in many cases, this Instagram bug may not be so obvious because someone is following many people, and the recommendation algorithm has more to work with. I would argue that it squarely is in the purview of information flow mechanisms. If follow information is secret, then recommendation algorithms should not be able to compute using this data. (Here, it seems like what one means by "deducible" is "computed from," and that's an information flow property.) We're not in a situation where these recommendation engines are taking information from thousands of users and doing something important. It's very easy for information to leak here, and it's simply not worth the loss to privacy!

Poor, and in violation of our privacy settings.
Takeaways. We should stand up for ourselves when it comes to our data. Companies like Facebook are making recommendations based on private information all the time, and not only is it creepy, but it violates our privacy policies, and they can definitely do something about it. My student Scott recently made $1000 from Facebook's bug bounty program reporting that photos from protected accounts were showing up in keep-in-touch emails from Instagram. If principles alone don't provide enough motivation, maybe the $$ will incentivize you to call tech companies out when you encounter sloppy data privacy practices.

Friday, March 31, 2017

Five Research Ideas Instagram Could Have Used to Protect Comey's Secret Twitter

Even though cybersecurity is one of the hottest topics on the Internet, my specific area of research, information flow security, has remained relatively obscure. Until now, that is.

You may have heard of "information flow" as a term that has been thrown around with terms like "data breach," "information leak," and "1337 hax0r." You may not be aware that information flow is a specific term, referring to the practice of tracking sensitive data as it flows through a program. While techniques like access control and encryption protect individual pieces of data (for instance, as they leave a database), information flow techniques additionally protect the results of any computations on sensitive data.

Information flow bugs are usually not the kinds of glamorous bugs that make headlines. Many of the data leaks that have been in the public consciousness, for instance the Target and Sony hacks, happened because the data was not protected properly at all. In these cases, having the appropriate access checks, or encrypting the data, should do the trick. But "why we need to protect data more better" is harder to explain. Up through my PhD thesis defense, I had such a difficult time finding headlines that were actually information flow bugs that I resorted to general software security motivations (cars! skateboards! rifles!) instead.

From the article.
Then along came "This Is Almost Certainly James Comey's Twitter Account," an article I have been waiting for since I started working on information flow in 2010. The basic idea behind the article is this: a journalist named Ashley Feinberg wanted to find FBI director James Comey's secret Twitter account, and so started digging around the Internet. Feinberg was able to be successful within four hours due to being clever and a key information leak in Instagram: when you request to follow an Instagram account, it makes algorithmic suggestions based on who to follow. And in the case of this article, the algorithmic suggestions for Comey's son Brien included several family members, including James Comey's wife--and the account that Feinberg deduced to be James Comey's. And it seems that Comey uses the same "anonymous" handle on Instagram as he does on Twitter. And so Instagram's failure to protect Brien Comey's protected "following" list led to the discovery of James Comey's Twitter account.

So what happened here? Instagram promises to protect secret accounts, which it (sometimes*) does. When one directly views the Instagram page of a protected user, they cannot access that person's photos, who that user is following, and who follows that user. This might lead a person to think that all of this information is protected all of the time. Wrong! It turns out the protected account information is visible to algorithms that suggest other users to follow, a feature that becomes--incorrectly--visible to all viewers once a follow is requested, because, presumably, whoever implemented this functionality forgot an access check. In this case the leak is particularly insidious because while the profile photos and names of the users shown are all already public, they are likely shown as a result of a computations on secret information: Brien Comey's protected follow information. (This is a subtle case to remember to check!) In information flow nomenclature, this is called an implicit flow. When someone is involved in a lot of Instagram activity, the implicit flow of the follow information may not be so apparent. But when many of the recommended follows are Comey family members, many of them who use their actual names, this leak becomes more serious!

Creepy Facebook search, from express.co.uk.
In the world of information flow, this article is a Big Deal because it so perfectly illustrates why information flow analyses are useful. For years, I had been jumping up and down and waving my arms (see here and here, for instance) about why we need to check data in more places than the point where it leaves the database. Applications aren't just showing sensitive values directly anymore, but the results of all kinds of computations on those values! (In this case it was a recommendations algorithm.) We don't always know where sensitive data is eventually going! (As was the case when Brien Comey's protected "following" list was handed over to the algorithm.) Policies might depend on sensitive data! We may even compute where sensitive data is going based on other sensitive data! In a world where we can search over anything, no data is safe!

Until recently, my explanations have seemed subtle and abstract to most, in direct contrast to the sexy flashy security work that people imagine after watching Hackers or reading Crypto. By now, though, we information flow researchers should have your attention. We have all kinds of computations over all kinds of data going to all kinds of people, and nobody has any clue what is going on in the code. Even though digital security should be one of the main concerns of the FBI, Comey is not able to avoid the problems that arise from the mess of policy spaghetti that is modern code.

Fortunately, information flow researchers have been working for years on preventing precisely this kind of Comey leak**. In fashionable BuzzFeed style, I will list exactly five research ideas Instagram could adapt to prevent such leaks in the future:
  1. Static label-based type-checking. In most programming languages, program values have types. Type usually tell you simple things like whether something is a number or a list, but they can be arbitrarily fancy. Types may be checked at compile time, before the program runs, or at run time, while the program is running. There has been a line of work on static (compile time) label-based information flow type systems (starting with Jif for Java, with a survey paper here describing more of this work) that allows programmers to label data values with security levels (for instance, secret or not) as types, and that propagate the type of a program that makes sure sensitive information does not flow places that are less sensitive. These type systems give guarantees about any program that runs. The beauty of these type systems is that while they look simple, they are clever enough to be able to capture the kind of implicit flow that we saw with algorithms leaking Brien Comey's follow information. (We'd label the follow lists as sensitive, and then any values computed from them couldn't be leaked!)
  2. Static verification. Label-based type-checking is a light-weight way of proving the correctness of programs according to some logical specification. There are also heavier-weight ways of doing it, using systems that translate programs automatically into logical representations, and check them against the specification. Various directions of work using refinement types, super fancy types that depend on program values could be used for information flow. An example of a refinement type is {int x | canSee(alice, x)}, the type of a value that exists as an integer x that can only exist if user "alice" is allowed to see it according to the "canSee" function/predicate) Researchers have also demonstrated ways of proving information flow properties in systems like IronClad and mKertiKOS. These efforts are pretty hardcore and require a lot of programmer effort, but they allow people to prove all sorts of boutique guarantees on boutique systems (as opposed to the generic type system guarantees using the subset of a language that is supported).
  3. Label-based dynamic information flow tracking. Static label-based type-checking, while useful, often requires the programmer to put labels all over programs. Systems such as HiStar, Flume (the specific motivation of which was the OKCupid web server), and Hails allow labeling of data in a way similar to static label-based type systems, but track the flow of information dynamically, while the program is running. The run-time tracking, while it makes it so that programmers don't have to put labels everywhere, comes at a cost. First, it introduces performance slowdowns. Second, we can't know if a program is going to give us some kind of "access denied" error before it runs, so there could be accesses denied all over the place. Many of these systems handle these problems by doing things at the process level: if there is an unintended leak anywhere in the process, the whole process aborts. (Those who haven't heard of processes can think of the process as encapsulating a whole big task, rather than an individual action, like doing a single arithmetic operation.)
  4. Secure multi-execution. Secure multi-execution is a nice trick for running black-box code (code that you don't want to--or can't--change) in a way that is secure with respect to information flow. The trick is this: every time you reach a sensitive value, you execute the sensitive value in one process, and you spawn another process using a secure default input. The process separation guarantees that sensitive values won't leak into the process containing the default value, so you know you should always be allowed to show the result of that one. As you might guess, secure multi-execution can slow down the program quite a bit, as it needs to spawn a new process every time it sees a sensitive value. To mitigate this, my collaborators Tom Austin and Cormac Flanagan developed a faceted execution semantics for programs that lets you execute a program on multiple values at the same time, with all of the security guarantees of secure multi-execution.
  5. Policy-agnostic programming. While all of these other approaches can prevent sensitive values from leaking information, if we want programs to run most of the time, somebody needs to make sure that programs are written not to leak information in the first place. It turns out this is pretty difficult, so I have been working on programming model that factors information flow policies out of the rest of the program. (If I'm going to write a whole essay about information flow, of course I'm going to write about my own research too!) Instead of having to implement information flow policies as checks across the program, where any missing check can lead to a bug, type error, or runtime "access denied," programmers can now specify each policy once, associated with the data, along with a default value, and rely on the language runtime and/or compiler to make the code execute according to the policies. In the policy-agnostic system, the programmer can say that Brien Comey's follows should only be visible to followers, and the machine becomes responsible for making sure this policy is enforced everywhere, including the code implementing the recommendations algorithm. That policies can depend on sensitive values, that sensitive values may be shown to viewers whose identities are computed from sensitive values, and that enforcing policies usually implementing access checks across the code are all challenges. Our semantics for the Jeeves programming language (paper here) addresses all of these issues using a dynamic faceted execution approach, and we have also extended this programming model to handle applications with a SQL database backend (paper here). We are also working on a static type-driven repair approach (draft here).
I don't know how much this Twitter account leak upset the Comeys, but reading this article was pretty much the most exciting thing that I have ever done. Up until now, most people have thought about security in terms of protecting individual data items, rather than in terms of a complex and subtle interaction with the programs that use them. This has started to change in the last few years as people have been realizing just how much of our data is online, and just how unreliable the code is that we trust with this data. I hope that this Comey leak will cause even more people to realize how important it is to reason deeply about what our software is doing. (And to fund and use our research. :))

* A student in my spring software security course (basically, programming languages applied to security), Scott, had noticed earlier this semester that emails from Instagram allowed previews of protected accounts he was not following. He reported this to Facebook's bug bounty program and made $1000. I told him to please write in the course reviews that the course helped him make money.
** Note that a lot of other things are going on in this Comey story. The reporter used facts about Comey to figure out the setup, and also some clever inference. But this clever inference exploited a specific information leak from the secret follows list to the recommendations list, and this post focuses on this kind of leak.

Wednesday, March 08, 2017

Autoresponse: Striking: A Day Without a Woman

In front of the Federal Building, Pittsburgh.
Dear Message Sender,

  I am not responding to email on March 8, 2017 because I am observing A Day Without a Woman. In the afternoon, I will be joining students at CMU in a silent protest and attending a rally at the City-County building in downtown Pittsburgh.

  Despite the efforts and progress made towards gender equality, women do not have an equal voice, and we are not appreciated equally in society. For example:
  • The gender wage gap persists, and two-thirds of minimum wage earners are women.
  • The House and Senate are currently 19% women. This means an 81% male group is making decisions that affect women's health and lives.
  • The United States still has not had a female president, even though many countries we'd like to think we are more progressive than have a woman currently in power
  • Only 24 of the Fortune 500 CEOs are women. Money is power, and women have less of it.
Some may say that women are simply less ambitious, or don't want to be in positions of power and influence as much as men do. Study after study--and I'm happy to talk in more detail--have shown that women who do have the ambition face far more obstacles than men do. Also, my statistics above focus on what people like to call "privileged" women, but the undervaluing of female labor (including domestic and emotional labor) make life even harder for those in less fortunate circumstances.

  There are many ways you can show support. The first is to attend local rallies, especially if you have an employment situation where you will have few consequences. Even if you are not a woman and/or not striking today, here are some things you can do:
  • Listen to women, and call people out when women's voices are not heard.
  • Question your own biases. (You can have biases even if you are a woman!)
  • Vote for women. Champion women. Mentor women. (In that order.)
  • Support people who are striking, and who are more actively fighting for women's rights and the appreciation of women's labor, both financially and by amplifying their voices.

Yours in solidarity,
Jean