maintenance/talks/icg-2018/slides.pdfpc

101 lines
7.6 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

[duration]
15
[end_user_slide]
16
[font_size]
12
[notes]
### 1
When designing scientific experiments, it is of utmost importance to constrain and control variables. The goal is to prevent unimportant circumstances from distorting analyses and introducing misleading statistical artefacts. In the ideal experiment all variables are tightly constrained.
### 2
In an effort to approximate this ideal, wetlab researchers carefully control the experiment environment, design their methods to avoid batch effects, and keep tabs on every ingredient of their experiments: how it was sourced or synthesized, when it was added, at what temperature it was stored, etc.
When done right, this allows findings to be shared with the scientific community in the knowledge that others will be able to repeat the experiments and confirm the conclusions.
### 3
In short: to repeat an experiment we first need to be able to reproduce (or recreate) its full environment.
### 4
This is not only true for wetlab experiments but also for experiments and analyses involving computers.
To repeat a computer-supported experiment, we need to first reproduce its software environment. This could be on the same computer a few months later, on an HPC cluster in the same institute, or even at a different site in the lab of someone building upon your work. The point is: we need to be able to capture all relevant state on one machine and be able to recreate it somewhere else.
### 5
How hard could this possibly be?
It turns out that the answer is: very.
### 6
Software is much more complex than we like to think.
A real-life genomics analysis pipeline can consist of hundreds of applications and libraries that affect its behaviour.
What you see here are dozens of interconnected software packages. A package could be a software library or an applications. Changes to any of these packages could be significant. And yet, this is just a *tiny* fraction of the complexity of a real-life genomics pipeline.
For software authors, it is *not* feasible to record every version and configuration manually. Likewise, for users it would not be feasible to follow manual instructions for hundreds of applications and libraries.
At this point I expect some in this room to begin whispering to the person sitting next to them that we already *have* a solution to this problem.
### 7
A common approach is to shrink-wrap the environment and distribute it as a so-called “container”. Another term that could be used is “application bundle” or “virtual disk” containing the software and its dependencies. While this makes it much easier to *install* an application, it does not help us to *recreate* the environment independently, exactly, or with deliberate fine-grain modifications.
We dont only want to recreate an environment, but we may want to have the option of implementing *specific* changes without having anything else in the environment change.
When container images are built, they modify or extend existing third-party images by fetching network resources that are not guaranteed to be fixed. Building a container from a Dockerfile on one day and again a month later will usually result in very different containers.
### 8
Containers are opaque. They are much like a smoothie: you can not see what ingredients went into it. You have no guarantee that the binary you *have* really corresponds to the source code you *want*.
Considering that we are increasingly processing personal data in the coming age of personalized medicine, we have a responsibility to pay close attention to what exactly our container smoothies are made of.
Of course, differences dont have to be malicious. In fact, when you build the same software on different machines or on the same computer at different times, it is not uncommon that you get two different binaries and you cant easily tell why.
These problems of reproducibility, usability, and the usability of reproducibility were on our minds when our research group started building a collection of genomics pipelines called PiGx.
### 9
PiGx stands for “pipelines in genomics”.
The pipelines that are part of PiGx were designed to automate the exploratory analysis of common kinds of data sets. This includes RNAseq, ChIPseq, single cell RNAseq, and bisulfite sequencing data sets. We are currently working on supporting even more kinds of sequencing data, such as ATACseq or nanopore data sets,
Under the hood, the pipelines connect battle-tested bioinformatics tools with the help of a workflow scheduler called Snakemake. Heres a simplified workflow diagram for one of the pipelines:
### 10
(explain the diagram)
The users dont need to know any of this. We wanted to empower our friends in the wetlab, who are not bioinformaticians, to see patterns in their own data even without the help of experienced bioinformaticians.
### 11
To this end, all pipelines provide a consistent and intuitive interface to users. The only inputs to the pipelines other than the raw data are a sample sheet describing the experimental design and a settings file to override defaults. The output is an interactive HTML report and session files to resume the analysis.
### 12
We wanted the pipelines to be easily installed. We also wanted to guarantee that any two users installing the pipelines will get bit-for-bit the same software, without having to impose tedious reproducibility protocols *and* without resorting to low-level application bundles.
For this important task we picked a tool called *Guix*.
### 13
Guix is a general purpose software package manager that is designed with reproducibility in mind. It is not a special bioinformatics software system — it just so happens that its design is exactly what we need in computational science.
Guix comes with a rich language that enables the user community to comprehensively describe complex software environments recursively. Guix evaluates this description by building each package in a clean, isolated environment that contains *only* declared dependencies, and nothing more, not even core system libraries.
It does so for the target package and for all of its dependencies recursively. With this mechanism there cannot be any ambiguity: a recursive package definition in Guix describes the software environment *comprehensively* as a graph with zero degrees of freedom. All software variables are constrained.
This means that when you use the same version of Guix to build a piece of software twice on different computers, running different operating systems, at different points in time you will get the same binary, bit for bit if not, thats a bug. Guix provides source-to-binary transparency.
### 14
We packaged PiGx and its dependencies for Guix, so that it can be installed reproducibly with just the single command you saw before.
But you dont *have* to use Guix to use PiGx. Guix can also export complete software environments reproducibly as Docker- or Singularity-flavoured bundles. If someone gives you one of these smoothies, you dont need to trust them. Guix makes it trivial to rebuild and verify them.
### 15
We built all PiGx pipelines and the more than 300 runtime dependencies repeatedly on very different machines to see to what degree Guix can guarantee that the generated binaries are identical. While doing this we found a handful of minor reproducibility bugs, but close to 98% of all packages were bit for bit the same.
Thats very high and will get better as the community removes sources of non-determinism from packages.
### 16
Summary!
Questions:
what single cell protocols are supported?
- single cell: UMI \[unique molecular identifier\] base protocols (10x genomics, dropseq, …)