Automatically pack your research to be run elsewhere!
ReproZip allows you to pack your research along with all necessary data files, libraries, environment variables and options.
Then anybody can reproduce the research on a different machine, without tracking down and installing the dependencies, or even having to run the same operating system!
How It Works
ReproZip works by tracing the systems calls used by the experiment to automatically identify which files should be included. You can review and edit this list and the metadata before creating the final package file. Packages can be reproduced in different ways, including chroot environments, Vagrant-built virtual machines, and Docker containers; more can be added through plugins.
$ pip install reprozip $ reprozip trace ./myexperiment -my --options inputs/somefile.csv other_file_here.bin experiment: 0%... 25%... 50%... 75%... 100% result: 42.137 Configuration file written in .reprozip/config.yml Edit that file then run the packer -- use 'reprozip pack -h' for help $ reprozip pack my_experiment.rpz [REPROZIP] 17:26:42.588 INFO: Creating pack my_experiment.rpz... [REPROZIP] 17:26:42.589 INFO: Adding files from package coreutils... [REPROZIP] 17:26:42.601 INFO: Adding files from package libc6... [REPROZIP] 17:26:42.906 INFO: Adding other files... [REPROZIP] 17:26:43.450 INFO: Adding metadata...
Reproducibility is a core component of the scientific process: it helps researchers all around the world to verify the results and also to build on them, alowing science to move forward. In natural science, long tradition requires experiments to be described in enough detail so that they can be reproduced by researchers around the world. The same standard, however, has not been widely applied to computational science, where researchers often have to rely on plots, tables, and figures included in papers, which loosely describe the obtained results.
The truth is computational reproducibility can be very painful to achieve for a number of reasons. Take the author-reviewer scenario of a scientific paper as an example. Authors must generate a compendium that encapsulates all the inputs needed to correctly reproduce their experiments: the data, a complete specification of the experiment and its steps, and information about the originating computational environment (OS, hardware architecture, and library dependencies). Keeping track of this information manually is rarely feasible: it is both time-consuming and error-prone. First, computational environments are complex, consisting of many layers of hardware and software, and the configuration of the OS is often hidden. Second, tracking library dependencies is challenging, especially for large experiments. If authors did not plan for it since the beginning of the project, reproducibility is drastically hampered.
For reviewers, even with a compendium in their hands, it may be hard to reproduce the results. There may be no instructions about how to execute the code and explore it further; the experiment may not run on his operating system; there may be missing libraries; library versions may be different; and several issues may arise while trying to install all the required dependencies, a problem colloquially known as dependency hell.
ReproZip helps alleviate these problems by allowing the user to easily capture all the necessary components in a single, distributable package. Also, the tool makes it easier to reproduce an experiment by providing different unpacking methods and interfaces that avoids the need to install all the required dependencies and that makes it possible to run the experiment under different inputs.
Various examples of ReproZip packages, including instructions on how to reproduce them, are available in the reprozip-examples GitHub repository.
The following video shows how to use ReproZip to make your experiment reproducible. The example used in the video is based on this blog post from B. Hanzra on digit recognition using OpenCV and scikit-learn, and is available for download here.
If you wish to cite ReproZip in a paper, please use the following:
ReproZip: Computational Reproducibility With Ease, F. Chirigati, R. Rampin, D. Shasha, and J. Freire. In Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 2085-2088, 2016
Contact us at [email protected] for feedback, questions, concerns, and issues. Also, please use this mailing list to share your use cases with us, as well as to report on best practices and lessons learned for reproducibility!
ReproZip is currently being developed at NYU. The team includes:
Zhonheng Li (summer 2017)