Sunday, May 01, 2016

Myth: "CS Researchers Don't Publish Code or Data"

A collaboration with Sam Tobin-Hochstadt, Assistant Professor at Indiana University.

There has been some buzz on social media about this "Extremely Angry" Twitter thread. Mike Hoye, Engineering Community Manager for Firefox at Mozilla expressed frustration about getting access to the products of research. It turns out that many other people are angry about this too.

While there are certainly legitimate aspects to these complaints, we’d like to address a specific misperception from this Twitter thread: the claim that "CS researchers don't publish code or data." The data simply shows this is not true.

First of all, while the Repeatability in Computer Science study from a few years ago highlighted some issues with reproducibility in our field, it revealed that a significant fraction of researchers (226 out of 402) in systems conferences have code available either directly linked from the paper, or on request.

Additionally, in the last few years, conferences in Programming Languages and Software Engineering have been pushing for more standardization of code-sharing and repeatability of results through Artifact Evaluation Committees. There is a comprehensive summary of Artifact Evaluation in our field here. (In fact, Jean is co-chairing the POPL 2017 AEC with Stephen Chong.) According to the site, artifacts are evaluated according to the following criteria:
  • Consistent with the paper. Does the artifact substantiate and help to reproduce the claims in the paper?
  • Complete. What is the fraction of the results that can be reproduced?
  • Well documented. Does the artifact describe and demonstrate how to apply the presented method to a new input?
  • Easy to reuse. How easy is it to reuse the provided artifact? 
The most detailed documentation is associated with the AEC for OOPSLA 2013, where 50 papers were accepted, 18 artifacts passed evaluation, and 3 artifacts were rejected. For PLDI 2014, 20 of of 50 papers submitted artifacts and 12 passed. By PLDI 2015, 27 papers (out of 52) had had approved artifacts. Even POPL, the “theoretical” PL conference, had 21 papers with approved artifacts by 2016.

For those wondering why more artifacts are not passing yet, here is a transcribed discussion by Edward Yang from PLDI 2014. The biggest takeaways are that 1) many people care about getting the community to share reproducible and reusable code and 2) it takes time to figure out the best ways to share research code. (That academia’s job is not to produce shippable products, as Sam pointed out on Twitter, is the subject of a longer conversation.)

While it’s going to take time for us to develop practices and standards that encourage reproducibility and reusability, we’ve already seen some improvements. Over the years, Artifact Evaluation has become more standardized and committees have moved towards asking researchers to package code in VMs if possible to ensure long-term reproducibility. Here are the latest instructions for authors.

Yes, we can always do better to push towards making all of our papers and code available and reusable. Yes, researchers can do better in helping bridge the communication gap between academia and industry--and this is something we've both worked at. But the evidence shows that the academic community is certainly sharing our code--and that we’ve been doing a better job of it each year.

Note: It would be really cool if someone did a survey of individual researchers. As Sam pointed out on Twitter, many of our colleagues use GitHub or other social version control and push their code even before the papers come out.

--

UPDATE! Here is a survey for academics to report on how we share code. Please fill it out so we can see what the numbers are like! Thanks to Emery Berger, Professor at UMass Amherst, for conducting the survey.

Related update. Some conversations with others reminded me that the times I haven't shared my code, it has been because I was collaborating with companies and corporate IP policies prevented me from sharing. (In fact, this was one of the reasons I preferred to stay in academia.) The survey above asks about this. I'm curious how the numbers come out.

7 comments:

Tragic Cynic said...

To paraphrase what you've said:
> 50% of 'public research' have either:
> public examples of reproducible experiments (ie. code)
> or code on request. (good luck contacting the dead! :P)
>
> we need a committee to make sure experiments are repeatable
>
> people complain because our experiments (code) cannot be even attempted (won't run).
> it's hard to make reproducible experiments. But this is not that bad, because (direct quote)
> > "That academia’s job is not to produce shippable
> > products..." (Last sentence, paragraph 6)
> yes, we are bad at this. But we are improving year by year. (no hurry :<} )

This paraphrasing, of course, assumes that being able to run code is evidence for the thesis at hand.

When you guys:
Stop taking government checks
Make lectures in shorts & thongs, like Carmack (maybe chub&tuck as well?)
keep all your tax-payer(industry) funded findings public
By all that, I mean not doing this
Or DMCA claiming a student's homework for using your code libraries rather than dealing with the issue on-campus, asking for a library gitignore, etc...

Then,
and only then would I consider your expensive existence ( To those who fund you, through tax or fee ) a boon.

parenz said...

> This paraphrasing, of course, assumes that being able to run code is evidence for the thesis at hand.

Indeed, for a huge portion of CS research providing a runnable piece of code is not a priority.

> Make lectures in shorts & thongs, like Carmack (maybe chub&tuck as well?)

Is this a joke?

> keep all your tax-payer(industry) funded findings public

Believe me, most of the researchers support Open Access models.

> and only then would I consider your expensive existence ( To those who fund you, through tax or fee ) a boon.

So let me be clear, unless a researcher produces a runnable piece of code, you consider their research useless?

Paolo Giarrusso said...

First, don't confuse scientists with those suing Swartz — those were scientific publishers. I didn't know about MIT, but I don't support their decision. We'd like to get rid of scientific publishers, but there's lots of inertia, even though we don't get one penny out of what one might pay to read an article. Yet, it's happening.

Second, the device you're typing on wouldn't exist without research, and some of that research is too expensive for companies to fund — so it's either government-funded or won't happen. Even this functional programming thing that Carmack now loves talking about exists thanks to researchers (who've been at it for >50 years), not thanks to the companies who've ignored it.

Grigori Fursin said...

As an artifact evaluation chair from CGO and PPoPP (http://cTuning.org/ae), I can mention that the major problem is not about code and data sharing, but about sharing artifacts as reusable and customizable components. This would allow reserchers not only to validate current claims, but also to try techniques in a different environment and with different parameters.

However, after surveying authors, we noticed that there is simply no supporting technology for that. This motivated us to develop an open-source framework (Collective Knowledge) to let researchers share their artifacts as Python components with JSON API and meta information, and even to crowdsource empirical experiments (we focus mainly on application performance analysis and optimization in computer systems' research):
* http://github.com/ctuning/ck

We also promote new publication model where articles and artifacts are submitted to an open archive at the time of submission, while all reviews are public (i.e. it's a cooperative effort to validate and improve shared code, data and results rather than just shaming and rejecting problematic ones). We tried it at our ADAPT workshop this year, and it was very successful:
* http://adapt-workshop.org
* http://arxiv.org/abs/1406.4020

Hope it will help address some of the raised issues!

Vicky Steeves said...

So there's a big difference between reproducibility & making code and data available. While open access is a big ingredient in reproducibility, there's the problem of "dependency hell." This basically amounts to the crazy amount of dependencies required to successfully rerun and reuse someone else's code & data -- for instance, if you are on a different OS, if you have a different version of a software library, if you don't document what you are doing (publishing a bunch of CSVs and scripts does nothing to help people reproduce things), then your research won't be reproducible. While there's great OA in CS, there's not great reproducibility.

aliya seen said...

It is a big opportunity for me to apply on a personal statement revision because it will bring them perfect site for us.

niaziakmal khan said...

Programming is combination of intelligent and creative work. Programmers can do anything with code. The entire Programming tutorials that you mention here on this blog are awesome. Beginners Heap also provides latest tutorials of Programming from beginning to advance level.
Be with us to learn programming in new and creative way.