{ "pdfpcFormat": 1, "duration": 60, "endSlide": 34, "disableMarkdown": true, "noteFontSize": 14, "pages": [ { "idx": 0, "label": "1", "overlay": 0, "note": "Reproducible software deployment in scientific computing\n\n" }, { "idx": 1, "label": "2", "overlay": 0, "note": "For most of my career I’ve been doing\nwhat would nowadays be called “DevOps”, an unholy mix of system\nadministration and software engineering for enterprises that don’t\nwant to pay enough money for either. These days I mostly write Lisp\ncode and do hazy things in the cloud.\n\nSince 2014 I work at the MDC in Berlin to develop and provide support\nfor reproducible scientific software and reproducible science. In the\nsame period I’ve been contributing to the GNU Guix project and\nvolunteered as a co-maintainer for a couple of years.\n\n" }, { "idx": 2, "label": "3", "overlay": 0, "note": "What’s in a title?\nLet’s work our way backwards: we use computers (laptops, workstations,\nHPC clusters, rented on-demand virtual machines known as the “cloud”)\nto do science. The scientific method involves testing hypotheses\nthrough repeated experiments. In scientific computing these\nexperiments largely happen by executing software that has been\ndeployed (or literally “rolled out”) on the computing substrate.\n\nThe last word — “reproducible” — is the most confusing, because it\nmeans different things to different people. For some definitions an\nexperiment is considered reproducible when the most important ideas\nare described so that someone else could perform a similar experiment.\nSome call software installation “reproducible” when installation can\nbe scripted. Some insist that an experiment is reproducible only when\nall experimental conditions are comprehensively described and the\nresult is invariably identical on every run.\n\nFor my purposes today, reproducibility is the ability to rerun an\nexperiment and produce results that are virtually indistinguishable\nfrom the published results.\n\n" }, { "idx": 3, "label": "4", "overlay": 0, "note": "Let’s take a step back and ask ourselves why reproducibility is desirable at all. There are two goals: + establish trust, and + facilitate further experimentation. Konrad Hinsen, a researcher at the French CNRS, has written extensively on the methodology of computational science. In a blog post entitled “Reproducibility does not imply reproduction” he explains the role of doubt: " }, { "idx": 4, "label": "5", "overlay": 0, "note": " [T]here is no point in repeating a computation identically. The results will be the same. So the only reason to re-run a computation is when there are doubts about the exact software or data that were used […]. The point of computational reproducibility is to dispel those doubts. The holy grail of computational reproducibility is not a world in which every computation is run five times, but a world in which a straightforward and cheap analysis of the published material verifies that it is reproducible, so that there is no need to run it again. " }, { "idx": 5, "label": "6", "overlay": 0, "note": " The second point is to facilitate experimentation. We need to have fine-grain control over the variables that might affect the outcome of our scientific experiments. We want to start with a known good state and make deliberate changes to one variable at a time, so that we can be sure that changes in the results are only due the variables we deliberately changed. So we are in need of a system with strong reproducibility guarantees and flexible high-level abstractions that enable us to do /interesting/ things to the software environment instead of being sentenced to low-level drugdery. " }, { "idx": 6, "label": "7", "overlay": 0, "note": " The first condition when repeating an experiment is to reproduce its environment. In computational science the experimental environment is largely defined by software, so we need to be able to easily reproduce the complete software environment. " }, { "idx": 7, "label": "8", "overlay": 0, "note": " Examples include: + a researcher resumes a project after a few months on the same computer + a researcher begins collaboration with a colleague and wants to set up the same software on the colleague’s computer + the researcher wants to run the computations on an HPC cluster and needs to deploy the software to an HPC cluster + after publication independent researchers elsewhere want to continue the project + a decade later researchers want to revisit the experiment in light of new discoveries " }, { "idx": 8, "label": "9", "overlay": 0, "note": " All of these scenarios are fundamentally the same. It’s just bits on deterministic machines, right? How hard could it possibly be to recreate a software environment? " }, { "idx": 9, "label": "10", "overlay": 0, "note": " Turns out that the scale of the problem is often much larger than anticipated. What you see here are dozens of interconnected software packages. A package could be a software library or an application. Changes to any of these packages could be significant. We don’t know if all of them are relevant for the subset of the behaviors we care about, but we can’t easily dismiss them either. Only *one* of these nodes corresponds to the code of the application itself — if all you know is the application name and its version you are missing a very large number of other ingredients that potentially influence the behavior of the application. On the other hand, a user who receives a comprehensive book full of all library names and version strings and configuration options would still have no practical way of actually building an environment according to these specifications. " }, { "idx": 10, "label": "11", "overlay": 0, "note": " * What options do we have? In the 1980s and early 90s people built software manually. We would configure, make, and make install libraries and applications. In the early days this was fine because most applications would depend only on a very small number of large domain-specific libraries. In the late 1990s we had package managers such as APT on Debian or RPM, so system administrators could more easily install large amounts of software into a single global namespace for all users. Selected alternative tools would be provided on shared systems (such as HPC clusters or supercomputers) via environment modules: a set of environment variables pointing to alternative locations in the shared file system to override selected defaults. These systems are heavily dependent on system administrators who provide global defaults, install alternatives globally, and manually resolve combinatorial conflicts. The difficulty in avoiding these conflicts is one of the reasons why environment modules rarely represent the full diversity of user requirements. " }, { "idx": 11, "label": "12", "overlay": 0, "note": " * Example: Conda Enter Conda, an example of user-controlled management of software, independent from system administrators. Conda is incredibly popular because it solves a common problem: it frees users on a shared system from having to petition system administrators to install software for them globally, enabling them to install applications and libraries into independent environments. It performs this task very well. " }, { "idx": 12, "label": "13", "overlay": 0, "note": " And yet Conda is know to have repeatedly failed to recreate environments. Conda lets users impose version constraints for their software environments. A SAT solver then finds an actual environment that satisfies these constraints. Since the result of the solver depends on the state of the Conda binary repositories the result of the solver can vary with time. An even more common problem: Conda binaries often contain references to system libraries such as symbols in the GNU C library which may only be satisfied by *some* GNU+Linux distributions (say a recent Ubuntu) and not others (a dusty RHEL). Conda environments are incomplete and lossy. In recent years, steps have been taken to refine the constraints to allow for a longer shelf-life of exported environments, but it is hard to predict when old environment files will stop working again due to the lack of a rigorous method of capturing all ingredients. " }, { "idx": 13, "label": "14", "overlay": 0, "note": " * Containers and reproducibility Another popular attempt to bundle up software environments is to use container tools such as Docker. Docker and tools like it made the fiddly Linux kernel features of process isolation and file system virtualization easily accessible. Let's take a quick look at what the kernel does. " }, { "idx": 14, "label": "15", "overlay": 0, "note": " The kernel Linux presents a number of interfaces to the C library and low level applications. Your code --- sometimes through many intermediaries --- talks to the kernel either via the C library or less commonly via system calls. These kernel interfaces together provide processes with the familiar Unix persona. When the kernel prepares to launch a process it creates a view on the slice of the system hardware that is relevant to the process: the process appears to have exclusive access to the CPU and memory, while the kernel actually just virtualizes these resources. " }, { "idx": 15, "label": "16", "overlay": 0, "note": " The kernel can *also* virtualize other resources that make up the Unix interface. It can present independent namespaces for different subsystems to a process. These namespaces include the process table, network devices, facilities for inter-process communication, the file system, user ids, and virtual time. Give a process an empty mount namespace and it cannot see the host system's files any more. Give it a separate user namespace and it will think that it is the only user on the system, and also has root access. In the common case of Docker (or Singularity/Apptainer), people run a process in a separate mount namespace, so that it cannot access the root file system (and all the globally installed software) and instead uses a provided binary bundle as the root file system. " }, { "idx": 16, "label": "17", "overlay": 0, "note": " This makes it much easier to *install* an application, but there is no way to *recreate* the bundled root file system independently, exactly, or with deliberate fine-grain modifications. This fine-grain control is a necessary requirement for the interactive exploratory process of computational science. We don’t only want to clone an environment, but we may want to have the option of making *specific* changes without having anything else in the environment change. Containers lack transparency. Looking at the binary image you cannot tell what ingredients really went into it. You have no guarantee that the binary you *received* really corresponds to the source code you *reviewed*. " }, { "idx": 17, "label": "18", "overlay": 0, "note": " When container images are built, they modify or extend existing third-party images by fetching network resources that are not guaranteed to be immutable as time passes. Dockerfiles are imperative and execute traditional package management commands or perform downloads and successively mutate the new root file system. We end up with the raw bits of a new root file system and sacrifice any higher order abstractions. When building a container image from a Dockerfile on one day and again a month later it is not unusual to get two very different containers. The secret ingredient in even the most transparent container smoothie is the current state of the internet. " }, { "idx": 18, "label": "19", "overlay": 0, "note": " So far we looked at software management approaches that are derived from the traditional approach of mutating shared storage. Even with modern containers we still work with a single blob of a shared root file system; we just ignore the existing system’s file system. Computing practices are not so different from biological evolution. Evolution is descent with modification. All modification is subject to the cumulative constraints of past modifications; this means that backtracking is often prohibitively expensive. Giraffes, for example, are stuck with their ridiculously long laryngeal nerve that takes a detour from the head down the neck, around the aortic arch, all the way back up to the head. Software deployment practices likewise are limited by the burden of decisions in the past that continue to influence the trajectory of our projects. What if we didn’t try to tack on reproducibility onto legacy methods of software installation but instead built a system from reproducible primitives? " }, { "idx": 19, "label": "20", "overlay": 0, "note": " * The functional approach In 2006 Eelco Dolstra published his PhD thesis entitled “The Purely Functional Software Deployment Model”. The core idea is simple: treat the transformation from source code and tools to executable files as a pure function. The output of a function is fully determined by its inputs and nothing else. " }, { "idx": 20, "label": "21", "overlay": 0, "note": " * The functional approach In 2006 Eelco Dolstra published his PhD thesis entitled “The Purely Functional Software Deployment Model”. The core idea is simple: treat the transformation from source code and tools to executable files as a pure function. The output of a function is fully determined by its inputs and nothing else. " }, { "idx": 21, "label": "22", "overlay": 0, "note": " Inputs are source code files, any tools that run to translate the code to a binary, any libraries needed by these tools, any libraries that the software needs to link with, etc. " }, { "idx": 22, "label": "23", "overlay": 0, "note": " The output is a tree of files, some executable some not." }, { "idx": 23, "label": "24", "overlay": 0, "note": " GNU Guix is one implementation of this functional idea. Guix comes with a very large collection of about 28000 package recipes that are each built in complete isolation (no internet, no root file system, no /bin, no /lib). This is enforced by a daemon that spawns jails where dedicated unprivileged user accounts build software. When compiling these packages only declared inputs are available, nothing else. The resulting files are stored in a unique output directory that is derived from the set of all inputs. Any change to any of the inputs results in a new output directory. This simple property ensures that countless variants of applications and libraries can be installed on the same system without conflicts. Existing software doesn’t affect new software, and new software cannot affect existing software. A Guix package is unambiguously described by its complete dependency graph; this includes all libraries it needs, any tools that are used to create it, any source code, and any configurations. Building the same package twice on different machines will (in the absence of bugs) result in the exact same files. It doesn’t matter whether you are using Ubuntu or RHEL, or whether you are doing this in 2020 or 2024. Guix heavily caches builds and deduplicates identical files, so the overall space consumption is lower than one would expect. " }, { "idx": 24, "label": "25", "overlay": 0, "note": " This same simple idea is easily extended from individual packages to groups of packages in the same environment... " }, { "idx": 25, "label": "26", "overlay": 0, "note": " ...or to lightweight containers *without* the need to replace the root file system... " }, { "idx": 26, "label": "27", "overlay": 0, "note": " ...or even to full blown GNU+Linux systems... " }, { "idx": 27, "label": "28", "overlay": 0, "note": " ...whether that be system containers, virtual machines, or bare-metal installations. Let us next look at the simplest features and work our way up. " }, { "idx": 28, "label": "29", "overlay": 0, "note": " As mentioned earlier, version numbers fail to describe software completely. Let me show you an example with the humble “hello” package. [...] All of these applications are “hello” version 2 point 12 point one, but some have patches, others use GCC 11, and yet others use different configuration flags. With Guix these are all distinct. " }, { "idx": 29, "label": "30", "overlay": 0, "note": " “guix build” is a low-level command. Users don’t need to care about all these /gnu/store locations. They would instead use Guix like a traditional package manager or like a shell. [demo] " }, { "idx": 30, "label": "31", "overlay": 0, "note": " Guix can be used declaratively. A manifest file declares what software the environment should contain, and Guix can instantiate an environment according to these specifications. Previously I said that recreating environments is just a necessary step to facilitate further experimentation. Guix has a number of built-in transformations to modify selected parts of the massive dependency graph. Libraries can be replaced with variants that are optimized for specific CPUs, recursively for any package that uses them. Or a patch can be applied to a selected package, keeping the rest of the stack unaltered. Beyond these transformation presets Guix is fully programmable in Scheme and packages can be rewritten with the flexibility of a modern general purpose programming language. " }, { "idx": 31, "label": "32", "overlay": 0, "note": " All the information about all dependencies of any package available through Guix is part of Guix itself. So by changing the version of Guix we can move backwards and forwards in time to install software as it was available when that particular version of Guix was current. This means that for fully reproducible environments we only need two pieces of information: + the exact version of Guix we used at the time, and + the name(s) of the package(s) " }, { "idx": 32, "label": "33", "overlay": 0, "note": " We can let Guix describe itself in a way that it can understand. guix describe -f channels > channels.scm You can think of the output as a complete snapshot of all the software known to Guix (including all the relationships between libraries, configurations, and tools) at this time. " }, { "idx": 33, "label": "34", "overlay": 0, "note": " Given that Guix knows exactly what relationships there are between applications and their dependencies it can also export all the bits in whatever format you want. GUIX PACK is a way to generate application bundles --- in docker format, singularity, or just as a plain tarball. This lets you share the bits with people who are not (yet) using Guix. But remember that the resulting blob is an OUTPUT. By dumping all the bits into an file system image we lose all the higher level abstractions that enable us to perform controlled computational experiments. " } ] }