Sunday, May 01, 2016

Myth: "CS Researchers Don't Publish Code or Data"

A collaboration with Sam Tobin-Hochstadt, Assistant Professor at Indiana University.

There has been some buzz on social media about this "Extremely Angry" Twitter thread. Mike Hoye, Engineering Community Manager for Firefox at Mozilla expressed frustration about getting access to the products of research. It turns out that many other people are angry about this too.

While there are certainly legitimate aspects to these complaints, we’d like to address a specific misperception from this Twitter thread: the claim that "CS researchers don't publish code or data." The data simply shows this is not true.

First of all, while the Repeatability in Computer Science study from a few years ago highlighted some issues with reproducibility in our field, it revealed that a significant fraction of researchers (226 out of 402) in systems conferences have code available either directly linked from the paper, or on request.

Additionally, in the last few years, conferences in Programming Languages and Software Engineering have been pushing for more standardization of code-sharing and repeatability of results through Artifact Evaluation Committees. There is a comprehensive summary of Artifact Evaluation in our field here. (In fact, Jean is co-chairing the POPL 2017 AEC with Stephen Chong.) According to the site, artifacts are evaluated according to the following criteria:
  • Consistent with the paper. Does the artifact substantiate and help to reproduce the claims in the paper?
  • Complete. What is the fraction of the results that can be reproduced?
  • Well documented. Does the artifact describe and demonstrate how to apply the presented method to a new input?
  • Easy to reuse. How easy is it to reuse the provided artifact? 
The most detailed documentation is associated with the AEC for OOPSLA 2013, where 50 papers were accepted, 18 artifacts passed evaluation, and 3 artifacts were rejected. For PLDI 2014, 20 of of 50 papers submitted artifacts and 12 passed. By PLDI 2015, 27 papers (out of 52) had had approved artifacts. Even POPL, the “theoretical” PL conference, had 21 papers with approved artifacts by 2016.

For those wondering why more artifacts are not passing yet, here is a transcribed discussion by Edward Yang from PLDI 2014. The biggest takeaways are that 1) many people care about getting the community to share reproducible and reusable code and 2) it takes time to figure out the best ways to share research code. (That academia’s job is not to produce shippable products, as Sam pointed out on Twitter, is the subject of a longer conversation.)

While it’s going to take time for us to develop practices and standards that encourage reproducibility and reusability, we’ve already seen some improvements. Over the years, Artifact Evaluation has become more standardized and committees have moved towards asking researchers to package code in VMs if possible to ensure long-term reproducibility. Here are the latest instructions for authors.

Yes, we can always do better to push towards making all of our papers and code available and reusable. Yes, researchers can do better in helping bridge the communication gap between academia and industry--and this is something we've both worked at. But the evidence shows that the academic community is certainly sharing our code--and that we’ve been doing a better job of it each year.

Note: It would be really cool if someone did a survey of individual researchers. As Sam pointed out on Twitter, many of our colleagues use GitHub or other social version control and push their code even before the papers come out.

--

UPDATE! Here is a survey for academics to report on how we share code. Please fill it out so we can see what the numbers are like! Thanks to Emery Berger, Professor at UMass Amherst, for conducting the survey.

Related update. Some conversations with others reminded me that the times I haven't shared my code, it has been because I was collaborating with companies and corporate IP policies prevented me from sharing. (In fact, this was one of the reasons I preferred to stay in academia.) The survey above asks about this. I'm curious how the numbers come out.

46 comments:

Tragic Cynic said...

To paraphrase what you've said:
> 50% of 'public research' have either:
> public examples of reproducible experiments (ie. code)
> or code on request. (good luck contacting the dead! :P)
>
> we need a committee to make sure experiments are repeatable
>
> people complain because our experiments (code) cannot be even attempted (won't run).
> it's hard to make reproducible experiments. But this is not that bad, because (direct quote)
> > "That academia’s job is not to produce shippable
> > products..." (Last sentence, paragraph 6)
> yes, we are bad at this. But we are improving year by year. (no hurry :<} )

This paraphrasing, of course, assumes that being able to run code is evidence for the thesis at hand.

When you guys:
Stop taking government checks
Make lectures in shorts & thongs, like Carmack (maybe chub&tuck as well?)
keep all your tax-payer(industry) funded findings public
By all that, I mean not doing this
Or DMCA claiming a student's homework for using your code libraries rather than dealing with the issue on-campus, asking for a library gitignore, etc...

Then,
and only then would I consider your expensive existence ( To those who fund you, through tax or fee ) a boon.

Anonymous said...

> This paraphrasing, of course, assumes that being able to run code is evidence for the thesis at hand.

Indeed, for a huge portion of CS research providing a runnable piece of code is not a priority.

> Make lectures in shorts & thongs, like Carmack (maybe chub&tuck as well?)

Is this a joke?

> keep all your tax-payer(industry) funded findings public

Believe me, most of the researchers support Open Access models.

> and only then would I consider your expensive existence ( To those who fund you, through tax or fee ) a boon.

So let me be clear, unless a researcher produces a runnable piece of code, you consider their research useless?

Unknown said...

First, don't confuse scientists with those suing Swartz — those were scientific publishers. I didn't know about MIT, but I don't support their decision. We'd like to get rid of scientific publishers, but there's lots of inertia, even though we don't get one penny out of what one might pay to read an article. Yet, it's happening.

Second, the device you're typing on wouldn't exist without research, and some of that research is too expensive for companies to fund — so it's either government-funded or won't happen. Even this functional programming thing that Carmack now loves talking about exists thanks to researchers (who've been at it for >50 years), not thanks to the companies who've ignored it.

Grigori Fursin said...

As an artifact evaluation chair from CGO and PPoPP (http://cTuning.org/ae), I can mention that the major problem is not about code and data sharing, but about sharing artifacts as reusable and customizable components. This would allow reserchers not only to validate current claims, but also to try techniques in a different environment and with different parameters.

However, after surveying authors, we noticed that there is simply no supporting technology for that. This motivated us to develop an open-source framework (Collective Knowledge) to let researchers share their artifacts as Python components with JSON API and meta information, and even to crowdsource empirical experiments (we focus mainly on application performance analysis and optimization in computer systems' research):
* http://github.com/ctuning/ck

We also promote new publication model where articles and artifacts are submitted to an open archive at the time of submission, while all reviews are public (i.e. it's a cooperative effort to validate and improve shared code, data and results rather than just shaming and rejecting problematic ones). We tried it at our ADAPT workshop this year, and it was very successful:
* http://adapt-workshop.org
* http://arxiv.org/abs/1406.4020

Hope it will help address some of the raised issues!

Vicky Steeves said...

So there's a big difference between reproducibility & making code and data available. While open access is a big ingredient in reproducibility, there's the problem of "dependency hell." This basically amounts to the crazy amount of dependencies required to successfully rerun and reuse someone else's code & data -- for instance, if you are on a different OS, if you have a different version of a software library, if you don't document what you are doing (publishing a bunch of CSVs and scripts does nothing to help people reproduce things), then your research won't be reproducible. While there's great OA in CS, there's not great reproducibility.

aliyaa said...

It is a big opportunity for me to apply on a personal statement revision because it will bring them perfect site for us.

Unknown said...

Programming is combination of intelligent and creative work. Programmers can do anything with code. The entire Programming tutorials that you mention here on this blog are awesome. Beginners Heap also provides latest tutorials of Programming from beginning to advance level.
Be with us to learn programming in new and creative way.

Unknown said...

Good job in presenting the correct content with the clear explanation. The content looks real with valid information. Good Wor, Learn how our role-based and specialty aws certifications help you demonstrate your deep AWS knowledge.

Anonymous said...

thank you for sharing useful information.
web programming tutorial
welookups

IT Tutorials said...

Really useful information. Thank you so much for sharing.It will help everyone.Keep Post. RPA training in chennai | RPA training in Chennai with placement | UiPath training in Chennai | UiPath certification in Chennai with cost

Diya shree said...

Hey, it's really nice information to share here. Thanks for your blog, keep posting like this regularly. Thank you!!!

PMP Training in Chennai | Best PMP Training in Chennai |
Project Management Requirements | PMP Certification Training Courses and Books |
PMP Certification Courses in Velachery & OMR | PMP Certification training in chennai | Project Manager Interview Questions & Answer

DedicatedHosting4u said...

Just seen your Article, it amazed me and surpised me with god thoughts that eveyone will benefit from it. It is really a very informative post for all those budding entreprenuers planning to take advantage of post for business expansions. You always share such a wonderful articlewhich helps us to gain knowledge .Thanks for sharing such a wonderful article, It will be deinitely helpful and fruitful article.
Thanks
DedicatedHosting4u.com

MindtechAffiliates said...

Usually, I visit your blogs and get updated with the information you include but today’s blog would be the most appreciable...

Thanks
Cpa offers

IICT Technologies said...

Superb
SAP Training in Chennai
SAP ABAP Training in Chennai
SAP Basis Training in Chennai
SAP FICO Training in Chennai
SAP SD Training in Chennai
SAP MM Training in Chennai
SAP PM Training in Chennai
SAP PP Training in Chennai
SAP MDG Training in Chennai
SAP EHS Training in Chennai

Tutorials said...

Thanks for the article. Its very useful. Keep sharing.   AWS Certification course online  |     AWS online course     AWS course online  

Michael said...

Your blog was amazing
http://alltopc.com/

Rose said...

https://getdailybook.com/

Trending Technologies said...

Excellent blog with valuable information and just added your blog to my bookmarking sites thank for sharing.
Data Science Course in Bangalore

Srigokul said...

Interesting.. Nice Blog, Thanks for Sharing this useful information...

Data science training in chennai
Data science course in chennai

madhavi reddy said...

I Want to leave a little comment to support and wish you the best of luck.we wish you the best of luck in all your blogging endeavors.
business analytics course in bangalore

madhavi reddy said...

I Want to leave a little comment to support and wish you the best of luck.we wish you the best of luck in all your blogging endeavors.
cyber security course in bangalore

Deekshitha said...

Informative blog
data scientist course in Bangalore

madhavi reddy said...

I Want to leave a little comment to support and wish you the best of luck.we wish you the best of luck in all your blogging endeavors.
data science course in bangalore with placement

Data Science Institute in Bangalore - 360DigiTMG said...

Really wonderful blog completely enjoyed reading and learning to gain the vast knowledge. Eventually, this blog helps in developing certain skills which in turn helpful in implementing those skills. Thanking the blogger for delivering such a beautiful content and keep posting the contents in upcoming days.

data science training institute in bangalore

data analytics courses in bangalore with placement - 360DigiTMG said...

Tremendous blog quite easy to grasp the subject since the content is very simple to understand. Obviously, this helps the participants to engage themselves in to the subject without much difficulty. Hope you further educate the readers in the same manner and keep sharing the content as always you do.

data analytics courses in bangalore with placement

madhavi reddy said...

I Want to leave a little comment to support and wish you the best of luck.we wish you the best of luck in all your blogging endeavors.
data science course in bangalore with placement

Mallela said...

Thanks for posting the best information and the blog is very helpful.data science institutes in hyderabad

AI Patasala said...

Thanks for sharing such a fantastic blog. I really like it. Keep sharing some more articles.
AI Patasala-Data Science course in Hyderabad
AI Patasala-Artificial Intelligence Course
AI Patasala-Machine Learning Course in Hyderabad

trainingcourses said...

I feel very grateful that I read this. It is very helpful and very informative and I really learned a lot from it.
Best Data Science courses in Hyderabad

Data Science Courses said...

Awesome article. I enjoyed reading your articles. this can be really a good scan for me. wanting forward to reading new articles. maintain the nice work!
Data Science Courses in Bangalore

Data Analytics Course said...

Excellent Blog! I would like to thank you for the efforts you have made in writing this post. Gained lots of knowledge.
Data Analytics Course

Datascience Course Analyst said...

Great post i must say and thanks for the information. Education is definitely a sticky subject. However, is still among the leading topics of our time. I appreciate your post and look forward to more.
Data Science Course in Bangalore

trainingcourses said...

I think I have never seen such blogs before that have completed things with all the details which I want. So kindly update this ever for us.
digital marketing courses in hyderabad with placement

Business Analytics said...

I am sure it will help many people. Keep up the good work. It's very compelling and I enjoyed browsing the entire blog.
Business Analytics Course in Bangalore

AI Courses said...

What an incredible message this is. Truly one of the best posts I have ever seen in my life. Wow, keep it up.
AI Courses in Bangalore

Mallela said...

Thanks for posting the best information and the blog is very helpful.artificial intelligence course in hyderabad

Deekshitha said...

Informative blog
ai training in hyderabad

Deekshitha said...

Informative blog
ai training in hyderabad

InstituteBlr said...

I just got to this amazing site not long ago. I was actually captured with the piece of resources you have got here. Big thumbs up for making such wonderful blog page!
data analytics course in bangalore

Mallela said...

Thanks for posting the best information and the blog is very important.digital marketing institute in hyderabad

Data Science Courses said...

Awesome article. I enjoyed reading your articles. this can be really a good scan for me. wanting forward to reading new articles. maintain the nice work!
Data Science Courses in Bangalore

AI Courses said...

What an incredible message this is. Truly one of the best posts I have ever seen in my life. Wow, keep it up.
AI Courses in Bangalore

Business Analytics said...

I am sure it will help many people. Keep up the good work. It's very compelling and I enjoyed browsing the entire blog.
Business Analytics Course in Bangalore

yanmaneee said...

kobe basketball shoes
curry shoes
lebron shoes
a bathing ape
kobe shoes
goyard handbags
yeezy boost 500
moncler jackets
off-white
yeezy

data analytics books said...

I am a new user of this site, so here I saw several articles and posts published on this site, I am more interested in some of them, hope you will provide more information on these topics in your next articles.
data analytics training in bangalore

Pallavi reddy said...

i am glad to discover this page : i have to thank you for the time i spent on this especially great reading !! i really liked each part and also bookmarked you for new information on your site.
artificial intelligence training in chennai