A collaboration with Sam Tobin-Hochstadt, Assistant Professor at Indiana University.
There has been some buzz on social media about this "Extremely Angry" Twitter thread. Mike Hoye, Engineering Community Manager for Firefox at Mozilla expressed frustration about getting access to the products of research. It turns out that many other people are angry about this too.
While there are certainly legitimate aspects to these complaints, we’d like to address a specific misperception from this Twitter thread: the claim that "CS researchers don't publish code or data." The data simply shows this is not true.
First of all, while the Repeatability in Computer Science study from a few years ago highlighted some issues with reproducibility in our field, it revealed that a significant fraction of researchers (226 out of 402) in systems conferences have code available either directly linked from the paper, or on request.
Additionally, in the last few years, conferences in Programming Languages and Software Engineering have been pushing for more standardization of code-sharing and repeatability of results through Artifact Evaluation Committees. There is a comprehensive summary of Artifact Evaluation in our field here. (In fact, Jean is co-chairing the POPL 2017 AEC with Stephen Chong.) According to the site, artifacts are evaluated according to the following criteria:
For those wondering why more artifacts are not passing yet, here is a transcribed discussion by Edward Yang from PLDI 2014. The biggest takeaways are that 1) many people care about getting the community to share reproducible and reusable code and 2) it takes time to figure out the best ways to share research code. (That academia’s job is not to produce shippable products, as Sam pointed out on Twitter, is the subject of a longer conversation.)
While it’s going to take time for us to develop practices and standards that encourage reproducibility and reusability, we’ve already seen some improvements. Over the years, Artifact Evaluation has become more standardized and committees have moved towards asking researchers to package code in VMs if possible to ensure long-term reproducibility. Here are the latest instructions for authors.
Yes, we can always do better to push towards making all of our papers and code available and reusable. Yes, researchers can do better in helping bridge the communication gap between academia and industry--and this is something we've both worked at. But the evidence shows that the academic community is certainly sharing our code--and that we’ve been doing a better job of it each year.
Note: It would be really cool if someone did a survey of individual researchers. As Sam pointed out on Twitter, many of our colleagues use GitHub or other social version control and push their code even before the papers come out.
--
UPDATE! Here is a survey for academics to report on how we share code. Please fill it out so we can see what the numbers are like! Thanks to Emery Berger, Professor at UMass Amherst, for conducting the survey.
Related update. Some conversations with others reminded me that the times I haven't shared my code, it has been because I was collaborating with companies and corporate IP policies prevented me from sharing. (In fact, this was one of the reasons I preferred to stay in academia.) The survey above asks about this. I'm curious how the numbers come out.
There has been some buzz on social media about this "Extremely Angry" Twitter thread. Mike Hoye, Engineering Community Manager for Firefox at Mozilla expressed frustration about getting access to the products of research. It turns out that many other people are angry about this too.
While there are certainly legitimate aspects to these complaints, we’d like to address a specific misperception from this Twitter thread: the claim that "CS researchers don't publish code or data." The data simply shows this is not true.
First of all, while the Repeatability in Computer Science study from a few years ago highlighted some issues with reproducibility in our field, it revealed that a significant fraction of researchers (226 out of 402) in systems conferences have code available either directly linked from the paper, or on request.
Additionally, in the last few years, conferences in Programming Languages and Software Engineering have been pushing for more standardization of code-sharing and repeatability of results through Artifact Evaluation Committees. There is a comprehensive summary of Artifact Evaluation in our field here. (In fact, Jean is co-chairing the POPL 2017 AEC with Stephen Chong.) According to the site, artifacts are evaluated according to the following criteria:
- Consistent with the paper. Does the artifact substantiate and help to reproduce the claims in the paper?
- Complete. What is the fraction of the results that can be reproduced?
- Well documented. Does the artifact describe and demonstrate how to apply the presented method to a new input?
- Easy to reuse. How easy is it to reuse the provided artifact?
For those wondering why more artifacts are not passing yet, here is a transcribed discussion by Edward Yang from PLDI 2014. The biggest takeaways are that 1) many people care about getting the community to share reproducible and reusable code and 2) it takes time to figure out the best ways to share research code. (That academia’s job is not to produce shippable products, as Sam pointed out on Twitter, is the subject of a longer conversation.)
While it’s going to take time for us to develop practices and standards that encourage reproducibility and reusability, we’ve already seen some improvements. Over the years, Artifact Evaluation has become more standardized and committees have moved towards asking researchers to package code in VMs if possible to ensure long-term reproducibility. Here are the latest instructions for authors.
Yes, we can always do better to push towards making all of our papers and code available and reusable. Yes, researchers can do better in helping bridge the communication gap between academia and industry--and this is something we've both worked at. But the evidence shows that the academic community is certainly sharing our code--and that we’ve been doing a better job of it each year.
Note: It would be really cool if someone did a survey of individual researchers. As Sam pointed out on Twitter, many of our colleagues use GitHub or other social version control and push their code even before the papers come out.
--
UPDATE! Here is a survey for academics to report on how we share code. Please fill it out so we can see what the numbers are like! Thanks to Emery Berger, Professor at UMass Amherst, for conducting the survey.
Related update. Some conversations with others reminded me that the times I haven't shared my code, it has been because I was collaborating with companies and corporate IP policies prevented me from sharing. (In fact, this was one of the reasons I preferred to stay in academia.) The survey above asks about this. I'm curious how the numbers come out.
To paraphrase what you've said:
ReplyDelete> 50% of 'public research' have either:
> public examples of reproducible experiments (ie. code)
> or code on request. (good luck contacting the dead! :P)
>
> we need a committee to make sure experiments are repeatable
>
> people complain because our experiments (code) cannot be even attempted (won't run).
> it's hard to make reproducible experiments. But this is not that bad, because (direct quote)
> > "That academia’s job is not to produce shippable
> > products..." (Last sentence, paragraph 6)
> yes, we are bad at this. But we are improving year by year. (no hurry :<} )
This paraphrasing, of course, assumes that being able to run code is evidence for the thesis at hand.
When you guys:
Stop taking government checks
Make lectures in shorts & thongs, like Carmack (maybe chub&tuck as well?)
keep all your tax-payer(industry) funded findings public
By all that, I mean not doing this
Or DMCA claiming a student's homework for using your code libraries rather than dealing with the issue on-campus, asking for a library gitignore, etc...
Then,
and only then would I consider your expensive existence ( To those who fund you, through tax or fee ) a boon.
> This paraphrasing, of course, assumes that being able to run code is evidence for the thesis at hand.
ReplyDeleteIndeed, for a huge portion of CS research providing a runnable piece of code is not a priority.
> Make lectures in shorts & thongs, like Carmack (maybe chub&tuck as well?)
Is this a joke?
> keep all your tax-payer(industry) funded findings public
Believe me, most of the researchers support Open Access models.
> and only then would I consider your expensive existence ( To those who fund you, through tax or fee ) a boon.
So let me be clear, unless a researcher produces a runnable piece of code, you consider their research useless?
First, don't confuse scientists with those suing Swartz — those were scientific publishers. I didn't know about MIT, but I don't support their decision. We'd like to get rid of scientific publishers, but there's lots of inertia, even though we don't get one penny out of what one might pay to read an article. Yet, it's happening.
ReplyDeleteSecond, the device you're typing on wouldn't exist without research, and some of that research is too expensive for companies to fund — so it's either government-funded or won't happen. Even this functional programming thing that Carmack now loves talking about exists thanks to researchers (who've been at it for >50 years), not thanks to the companies who've ignored it.
As an artifact evaluation chair from CGO and PPoPP (http://cTuning.org/ae), I can mention that the major problem is not about code and data sharing, but about sharing artifacts as reusable and customizable components. This would allow reserchers not only to validate current claims, but also to try techniques in a different environment and with different parameters.
ReplyDeleteHowever, after surveying authors, we noticed that there is simply no supporting technology for that. This motivated us to develop an open-source framework (Collective Knowledge) to let researchers share their artifacts as Python components with JSON API and meta information, and even to crowdsource empirical experiments (we focus mainly on application performance analysis and optimization in computer systems' research):
* http://github.com/ctuning/ck
We also promote new publication model where articles and artifacts are submitted to an open archive at the time of submission, while all reviews are public (i.e. it's a cooperative effort to validate and improve shared code, data and results rather than just shaming and rejecting problematic ones). We tried it at our ADAPT workshop this year, and it was very successful:
* http://adapt-workshop.org
* http://arxiv.org/abs/1406.4020
Hope it will help address some of the raised issues!
So there's a big difference between reproducibility & making code and data available. While open access is a big ingredient in reproducibility, there's the problem of "dependency hell." This basically amounts to the crazy amount of dependencies required to successfully rerun and reuse someone else's code & data -- for instance, if you are on a different OS, if you have a different version of a software library, if you don't document what you are doing (publishing a bunch of CSVs and scripts does nothing to help people reproduce things), then your research won't be reproducible. While there's great OA in CS, there's not great reproducibility.
ReplyDeleteIt is a big opportunity for me to apply on a personal statement revision because it will bring them perfect site for us.
ReplyDeleteGood job in presenting the correct content with the clear explanation. The content looks real with valid information. Good Wor, Learn how our role-based and specialty aws certifications help you demonstrate your deep AWS knowledge.
ReplyDeletethank you for sharing useful information.
ReplyDeleteweb programming tutorial
welookups
Usually, I visit your blogs and get updated with the information you include but today’s blog would be the most appreciable...
ReplyDeleteThanks
Cpa offers
Superb
ReplyDeleteSAP Training in Chennai
SAP ABAP Training in Chennai
SAP Basis Training in Chennai
SAP FICO Training in Chennai
SAP SD Training in Chennai
SAP MM Training in Chennai
SAP PM Training in Chennai
SAP PP Training in Chennai
SAP MDG Training in Chennai
SAP EHS Training in Chennai
Your blog was amazing
ReplyDeletehttp://alltopc.com/
https://getdailybook.com/
ReplyDeleteInteresting.. Nice Blog, Thanks for Sharing this useful information...
ReplyDeleteData science training in chennai
Data science course in chennai
Thanks for sharing such a fantastic blog. I really like it. Keep sharing some more articles.
ReplyDeleteAI Patasala-Data Science course in Hyderabad
AI Patasala-Artificial Intelligence Course
AI Patasala-Machine Learning Course in Hyderabad
I think I have never seen such blogs before that have completed things with all the details which I want. So kindly update this ever for us.
ReplyDeletedigital marketing courses in hyderabad with placement
I was impressed by the information that you have on your site. It showed me how much experience you have in this area, and also gave me some options to consider.
ReplyDeleteAWS Training in Hyderabad
AWS Course in Hyderabad
Great Article. I really liked your blog post! It was well organized, insightful and most of all helpful.
ReplyDeleteArtificial Intelligence Training in Hyderabad
Artificial Intelligence Course in Hyderabad
I love this article. It's well-written. Thanks for all the effort you put into it! I enjoyed reading it and plan to read many more of your articles in the future.
ReplyDeleteData Science Training in Hyderabad
Data Science Course in Hyderabad
This comment has been removed by the author.
ReplyDeleteHi, I was browsing the internet for information and found your blog. I am impressed with the information you have on this blog. Thanks for sharing
ReplyDeleteMLOps Training
I truly like you're composing style, incredible data, thankyou for posting.
ReplyDeleteData Science Courses in Bangalore
Get Trained on Online Data Science Course by real-time industry experts and excel your career in Data Science technology.
ReplyDelete
ReplyDeleteHey friend, it is very well written article, thank you for the valuable and useful information you provide in this post. Keep up the good work! FYI, please check these depression, stress and anxiety related articles.
Federal Bank Signet Credit Card 2021 Review , The High Five Habit Free pdf Download , 10 lines about online classes in English
I am impressed by the information that you have on this blog. It shows how well you understand this subject.
ReplyDeleteData Science Course in Ahmedabad
I have never seen such blogs ever before that have complete things with all the details I want. So kindly update this ever for us.
ReplyDeletefull stack web development course in malaysia
Really impressed! Information shared was very helpful Your website is very valuable. Thanks for sharing.
ReplyDeleteFood Product Development Consultant
360digiTMG not only focuses on providing world-class data science online courses, but it also prepares its students to ace interviews and land secure jobs upon completion of the course. This post focuses on the efforts of 360digiTMG in helping its students land lucrative jobs, thereby motivating hundreds of data science aspirants like me.best data science course with placement in hyderabad
ReplyDeletePenang's IT companies understand the importance of cybersecurity and data protection it companies in penang
ReplyDeleteI recently sourced Biodegradable plastic bags supplier in Chennai and I am extremely satisfied with their quality and performance. These bags are strong, durable, and break down much faster than traditional plastic, which is great for reducing environmental impact. The supplier offers a wide range of sizes and types, making it easy to find the perfect fit for various needs. Their customer service was excellent, providing quick responses and efficient delivery. If you're looking for reliable biodegradable plastic bags suppliers in Chennai, I highly recommend checking them out!
ReplyDelete