Sunday, May 01, 2016

Myth: "CS Researchers Don't Publish Code or Data"

A collaboration with Sam Tobin-Hochstadt, Assistant Professor at Indiana University.

There has been some buzz on social media about this "Extremely Angry" Twitter thread. Mike Hoye, Engineering Community Manager for Firefox at Mozilla expressed frustration about getting access to the products of research. It turns out that many other people are angry about this too.

While there are certainly legitimate aspects to these complaints, we’d like to address a specific misperception from this Twitter thread: the claim that "CS researchers don't publish code or data." The data simply shows this is not true.

First of all, while the Repeatability in Computer Science study from a few years ago highlighted some issues with reproducibility in our field, it revealed that a significant fraction of researchers (226 out of 402) in systems conferences have code available either directly linked from the paper, or on request.

Additionally, in the last few years, conferences in Programming Languages and Software Engineering have been pushing for more standardization of code-sharing and repeatability of results through Artifact Evaluation Committees. There is a comprehensive summary of Artifact Evaluation in our field here. (In fact, Jean is co-chairing the POPL 2017 AEC with Stephen Chong.) According to the site, artifacts are evaluated according to the following criteria:
  • Consistent with the paper. Does the artifact substantiate and help to reproduce the claims in the paper?
  • Complete. What is the fraction of the results that can be reproduced?
  • Well documented. Does the artifact describe and demonstrate how to apply the presented method to a new input?
  • Easy to reuse. How easy is it to reuse the provided artifact? 
The most detailed documentation is associated with the AEC for OOPSLA 2013, where 50 papers were accepted, 18 artifacts passed evaluation, and 3 artifacts were rejected. For PLDI 2014, 20 of of 50 papers submitted artifacts and 12 passed. By PLDI 2015, 27 papers (out of 52) had had approved artifacts. Even POPL, the “theoretical” PL conference, had 21 papers with approved artifacts by 2016.

For those wondering why more artifacts are not passing yet, here is a transcribed discussion by Edward Yang from PLDI 2014. The biggest takeaways are that 1) many people care about getting the community to share reproducible and reusable code and 2) it takes time to figure out the best ways to share research code. (That academia’s job is not to produce shippable products, as Sam pointed out on Twitter, is the subject of a longer conversation.)

While it’s going to take time for us to develop practices and standards that encourage reproducibility and reusability, we’ve already seen some improvements. Over the years, Artifact Evaluation has become more standardized and committees have moved towards asking researchers to package code in VMs if possible to ensure long-term reproducibility. Here are the latest instructions for authors.

Yes, we can always do better to push towards making all of our papers and code available and reusable. Yes, researchers can do better in helping bridge the communication gap between academia and industry--and this is something we've both worked at. But the evidence shows that the academic community is certainly sharing our code--and that we’ve been doing a better job of it each year.

Note: It would be really cool if someone did a survey of individual researchers. As Sam pointed out on Twitter, many of our colleagues use GitHub or other social version control and push their code even before the papers come out.

--

UPDATE! Here is a survey for academics to report on how we share code. Please fill it out so we can see what the numbers are like! Thanks to Emery Berger, Professor at UMass Amherst, for conducting the survey.

Related update. Some conversations with others reminded me that the times I haven't shared my code, it has been because I was collaborating with companies and corporate IP policies prevented me from sharing. (In fact, this was one of the reasons I preferred to stay in academia.) The survey above asks about this. I'm curious how the numbers come out.

30 comments:

Tragic Cynic said...

To paraphrase what you've said:
> 50% of 'public research' have either:
> public examples of reproducible experiments (ie. code)
> or code on request. (good luck contacting the dead! :P)
>
> we need a committee to make sure experiments are repeatable
>
> people complain because our experiments (code) cannot be even attempted (won't run).
> it's hard to make reproducible experiments. But this is not that bad, because (direct quote)
> > "That academia’s job is not to produce shippable
> > products..." (Last sentence, paragraph 6)
> yes, we are bad at this. But we are improving year by year. (no hurry :<} )

This paraphrasing, of course, assumes that being able to run code is evidence for the thesis at hand.

When you guys:
Stop taking government checks
Make lectures in shorts & thongs, like Carmack (maybe chub&tuck as well?)
keep all your tax-payer(industry) funded findings public
By all that, I mean not doing this
Or DMCA claiming a student's homework for using your code libraries rather than dealing with the issue on-campus, asking for a library gitignore, etc...

Then,
and only then would I consider your expensive existence ( To those who fund you, through tax or fee ) a boon.

Anonymous said...

> This paraphrasing, of course, assumes that being able to run code is evidence for the thesis at hand.

Indeed, for a huge portion of CS research providing a runnable piece of code is not a priority.

> Make lectures in shorts & thongs, like Carmack (maybe chub&tuck as well?)

Is this a joke?

> keep all your tax-payer(industry) funded findings public

Believe me, most of the researchers support Open Access models.

> and only then would I consider your expensive existence ( To those who fund you, through tax or fee ) a boon.

So let me be clear, unless a researcher produces a runnable piece of code, you consider their research useless?

Unknown said...

First, don't confuse scientists with those suing Swartz — those were scientific publishers. I didn't know about MIT, but I don't support their decision. We'd like to get rid of scientific publishers, but there's lots of inertia, even though we don't get one penny out of what one might pay to read an article. Yet, it's happening.

Second, the device you're typing on wouldn't exist without research, and some of that research is too expensive for companies to fund — so it's either government-funded or won't happen. Even this functional programming thing that Carmack now loves talking about exists thanks to researchers (who've been at it for >50 years), not thanks to the companies who've ignored it.

Grigori Fursin said...

As an artifact evaluation chair from CGO and PPoPP (http://cTuning.org/ae), I can mention that the major problem is not about code and data sharing, but about sharing artifacts as reusable and customizable components. This would allow reserchers not only to validate current claims, but also to try techniques in a different environment and with different parameters.

However, after surveying authors, we noticed that there is simply no supporting technology for that. This motivated us to develop an open-source framework (Collective Knowledge) to let researchers share their artifacts as Python components with JSON API and meta information, and even to crowdsource empirical experiments (we focus mainly on application performance analysis and optimization in computer systems' research):
* http://github.com/ctuning/ck

We also promote new publication model where articles and artifacts are submitted to an open archive at the time of submission, while all reviews are public (i.e. it's a cooperative effort to validate and improve shared code, data and results rather than just shaming and rejecting problematic ones). We tried it at our ADAPT workshop this year, and it was very successful:
* http://adapt-workshop.org
* http://arxiv.org/abs/1406.4020

Hope it will help address some of the raised issues!

Vicky Steeves said...

So there's a big difference between reproducibility & making code and data available. While open access is a big ingredient in reproducibility, there's the problem of "dependency hell." This basically amounts to the crazy amount of dependencies required to successfully rerun and reuse someone else's code & data -- for instance, if you are on a different OS, if you have a different version of a software library, if you don't document what you are doing (publishing a bunch of CSVs and scripts does nothing to help people reproduce things), then your research won't be reproducible. While there's great OA in CS, there's not great reproducibility.

aliyaa said...

It is a big opportunity for me to apply on a personal statement revision because it will bring them perfect site for us.

Unknown said...

Good job in presenting the correct content with the clear explanation. The content looks real with valid information. Good Wor, Learn how our role-based and specialty aws certifications help you demonstrate your deep AWS knowledge.

Anonymous said...

thank you for sharing useful information.
web programming tutorial
welookups

MindtechAffiliates said...

Usually, I visit your blogs and get updated with the information you include but today’s blog would be the most appreciable...

Thanks
Cpa offers

IICT Technologies said...

Superb
SAP Training in Chennai
SAP ABAP Training in Chennai
SAP Basis Training in Chennai
SAP FICO Training in Chennai
SAP SD Training in Chennai
SAP MM Training in Chennai
SAP PM Training in Chennai
SAP PP Training in Chennai
SAP MDG Training in Chennai
SAP EHS Training in Chennai

Michael said...

Your blog was amazing
http://alltopc.com/

Rose said...

https://getdailybook.com/

Srigokul said...

Interesting.. Nice Blog, Thanks for Sharing this useful information...

Data science training in chennai
Data science course in chennai

AI Patasala said...

Thanks for sharing such a fantastic blog. I really like it. Keep sharing some more articles.
AI Patasala-Data Science course in Hyderabad
AI Patasala-Artificial Intelligence Course
AI Patasala-Machine Learning Course in Hyderabad

traininginstitute said...

I think I have never seen such blogs before that have completed things with all the details which I want. So kindly update this ever for us.
digital marketing courses in hyderabad with placement

Priya Rathod said...

I was impressed by the information that you have on your site. It showed me how much experience you have in this area, and also gave me some options to consider.
AWS Training in Hyderabad
AWS Course in Hyderabad

jony blaze said...

Great Article. I really liked your blog post! It was well organized, insightful and most of all helpful.
Artificial Intelligence Training in Hyderabad
Artificial Intelligence Course in Hyderabad

Priya Rathod said...

I love this article. It's well-written. Thanks for all the effort you put into it! I enjoyed reading it and plan to read many more of your articles in the future.
Data Science Training in Hyderabad
Data Science Course in Hyderabad

kumal kumar said...
This comment has been removed by the author.
Sushil said...

Hi, I was browsing the internet for information and found your blog. I am impressed with the information you have on this blog. Thanks for sharing
MLOps Training

Unknown said...

Your work is very good and I appreciate you and hopping for some more informative posts data science course in delhi with placement

Unknown said...

I am very enjoyed for this blog. Its an informative topic. It help me very much to solve some problems. Its opportunity are so fantastic and working style so speedy. cyber security course in delhi

Nathan said...

I truly like you're composing style, incredible data, thankyou for posting.
Data Science Courses in Bangalore

technologyforall said...

Get Trained on Online Data Science Course by real-time industry experts and excel your career in Data Science technology.

baku said...




Hey friend, it is very well written article, thank you for the valuable and useful information you provide in this post. Keep up the good work! FYI, please check these depression, stress and anxiety related articles.
Federal Bank Signet Credit Card 2021 Review , The High Five Habit Free pdf Download , 10 lines about online classes in English

Ganesh said...

I am impressed by the information that you have on this blog. It shows how well you understand this subject.
Data Science Course in Ahmedabad

traininginstitute said...

I have never seen such blogs ever before that have complete things with all the details I want. So kindly update this ever for us.
full stack web development course in malaysia

keshav said...

Really impressed! Information shared was very helpful Your website is very valuable. Thanks for sharing.
Food Product Development Consultant

prathyusha said...

360digiTMG not only focuses on providing world-class data science online courses, but it also prepares its students to ace interviews and land secure jobs upon completion of the course. This post focuses on the efforts of 360digiTMG in helping its students land lucrative jobs, thereby motivating hundreds of data science aspirants like me.best data science course with placement in hyderabad

360digitmgmalaysia said...

Penang's IT companies understand the importance of cybersecurity and data protection it companies in penang