maintenance/talks/2023-dfn/script.pdfpc

{
  "pdfpcFormat": 1,
  "duration": 60,
  "endSlide": 34,
  "disableMarkdown": true,
  "noteFontSize": 14,
  "pages": [
    {
      "idx": 0,
      "label": "1",
      "overlay": 0,
      "note": "Reproducible software deployment in scientific computing\n\n"
    },
    {
      "idx": 1,
      "label": "2",
      "overlay": 0,
      "note": "For most of my career I’ve been doing\nwhat would nowadays be called “DevOps”, an unholy mix of system\nadministration and software engineering for enterprises that don’t\nwant to pay enough money for either.  These days I mostly write Lisp\ncode and do hazy things in the cloud.\n\nSince 2014 I work at the MDC in Berlin to develop and provide support\nfor reproducible scientific software and reproducible science.  In the\nsame period I’ve been contributing to the GNU Guix project and\nvolunteered as a co-maintainer for a couple of years.\n\n"
    },
    {
      "idx": 2,
      "label": "3",
      "overlay": 0,
      "note": "What’s in a title?\nLet’s work our way backwards: we use computers (laptops, workstations,\nHPC clusters, rented on-demand virtual machines known as the “cloud”)\nto do science.  The scientific method involves testing hypotheses\nthrough repeated experiments.  In scientific computing these\nexperiments largely happen by executing software that has been\ndeployed (or literally “rolled out”) on the computing substrate.\n\nThe last word — “reproducible” — is the most confusing, because it\nmeans different things to different people.  For some definitions an\nexperiment is considered reproducible when the most important ideas\nare described so that someone else could perform a similar experiment.\nSome call software installation “reproducible” when installation can\nbe scripted.  Some insist that an experiment is reproducible only when\nall experimental conditions are comprehensively described and the\nresult is invariably identical on every run.\n\nFor my purposes today, reproducibility is the ability to rerun an\nexperiment and produce results that are virtually indistinguishable\nfrom the published results.\n\n"
    },
    {
      "idx": 3,
      "label": "4",
      "overlay": 0,
      "note": "Let’s take a step back and ask ourselves why reproducibility is
desirable at all.  There are two goals:

+ establish trust, and
+ facilitate further experimentation.

Konrad Hinsen, a researcher at the French CNRS, has written
extensively on the methodology of computational science.  In a blog
post entitled “Reproducibility does not imply reproduction” he
explains the role of doubt:
"
    },
    {
      "idx": 4,
      "label": "5",
      "overlay": 0,
      "note": "
[T]here is no point in repeating a computation identically. The
results will be the same. So the only reason to re-run a computation
is when there are doubts about the exact software or data that were
used […].

The point of computational reproducibility is to dispel those
doubts. The holy grail of computational reproducibility is not a world
in which every computation is run five times, but a world in which a
straightforward and cheap analysis of the published material verifies
that it is reproducible, so that there is no need to run it again.
"
    },
    {
      "idx": 5,
      "label": "6",
      "overlay": 0,
      "note": "
The second point is to facilitate experimentation.  We need to have
fine-grain control over the variables that might affect the outcome of
our scientific experiments.  We want to start with a known good state
and make deliberate changes to one variable at a time, so that we can
be sure that changes in the results are only due the variables we
deliberately changed.

So we are in need of a system with strong reproducibility guarantees
and flexible high-level abstractions that enable us to do /interesting/
things to the software environment instead of being sentenced to
low-level drugdery.
"
    },
    {
      "idx": 6,
      "label": "7",
      "overlay": 0,
      "note": "
The first condition when repeating an experiment is to reproduce its
environment.  In computational science the experimental environment is
largely defined by software, so we need to be able to easily reproduce
the complete software environment.
"
    },
    {
      "idx": 7,
      "label": "8",
      "overlay": 0,
      "note": "
Examples include:

+ a researcher resumes a project after a few months on the same
  computer

+ a researcher begins collaboration with a colleague and wants to set
  up the same software on the colleague’s computer

+ the researcher wants to run the computations on an HPC cluster and
  needs to deploy the software to an HPC cluster

+ after publication independent researchers elsewhere want to continue
  the project

+ a decade later researchers want to revisit the experiment in light
  of new discoveries
"
    },
    {
      "idx": 8,
      "label": "9",
      "overlay": 0,
      "note": "
All of these scenarios are fundamentally the same.  It’s just bits on
deterministic machines, right?  How hard could it possibly be to
recreate a software environment?
"
    },
    {
      "idx": 9,
      "label": "10",
      "overlay": 0,
      "note": "
Turns out that the scale of the problem is often much larger than
anticipated.

What you see here are dozens of interconnected software packages.  A
package could be a software library or an application.  Changes to any
of these packages could be significant.  We don’t know if all of them
are relevant for the subset of the behaviors we care about, but we
can’t easily dismiss them either.

Only *one* of these nodes corresponds to the code of the application
itself — if all you know is the application name and its version you
are missing a very large number of other ingredients that potentially
influence the behavior of the application.

On the other hand, a user who receives a comprehensive book full of
all library names and version strings and configuration options would
still have no practical way of actually building an environment
according to these specifications.
"
    },
    {
      "idx": 10,
      "label": "11",
      "overlay": 0,
      "note": "
* What options do we have?

In the 1980s and early 90s people built software manually.  We would
configure, make, and make install libraries and applications.  In the
early days this was fine because most applications would depend only
on a very small number of large domain-specific libraries.

In the late 1990s we had package managers such as APT on Debian or
RPM, so system administrators could more easily install large amounts
of software into a single global namespace for all users.  Selected
alternative tools would be provided on shared systems (such as HPC
clusters or supercomputers) via environment modules: a set of
environment variables pointing to alternative locations in the shared
file system to override selected defaults.

These systems are heavily dependent on system administrators who
provide global defaults, install alternatives globally, and manually
resolve combinatorial conflicts.  The difficulty in avoiding these
conflicts is one of the reasons why environment modules rarely
represent the full diversity of user requirements.
"
    },
    {
      "idx": 11,
      "label": "12",
      "overlay": 0,
      "note": "
          * Example: Conda

Enter Conda, an example of user-controlled management of software,
independent from system administrators.  Conda is incredibly popular
because it solves a common problem: it frees users on a shared system
from having to petition system administrators to install software for
them globally, enabling them to install applications and libraries
into independent environments.  It performs this task very well.
"
    },
    {
      "idx": 12,
      "label": "13",
      "overlay": 0,
      "note": "
And yet Conda is know to have repeatedly failed to recreate
environments.  Conda lets users impose version constraints for their
software environments.  A SAT solver then finds an actual environment
that satisfies these constraints.  Since the result of the solver
depends on the state of the Conda binary repositories the result of
the solver can vary with time.

An even more common problem: Conda binaries often contain references
to system libraries such as symbols in the GNU C library which may
only be satisfied by *some* GNU+Linux distributions (say a recent
Ubuntu) and not others (a dusty RHEL).  Conda environments are
incomplete and lossy.

In recent years, steps have been taken to refine the constraints to
allow for a longer shelf-life of exported environments, but it is hard
to predict when old environment files will stop working again due to
the lack of a rigorous method of capturing all ingredients.
"
    },
    {
      "idx": 13,
      "label": "14",
      "overlay": 0,
      "note": "
* Containers and reproducibility

Another popular attempt to bundle up software environments is to use
container tools such as Docker.  Docker and tools like it made the
fiddly Linux kernel features of process isolation and file system
virtualization easily accessible.

Let's take a quick look at what the kernel does.

"
    },
    {
      "idx": 14,
      "label": "15",
      "overlay": 0,
      "note": "

The kernel Linux presents a number of interfaces to the C library and
low level applications.  Your code --- sometimes through many
intermediaries --- talks to the kernel either via the C library or
less commonly via system calls.

These kernel interfaces together provide processes with the familiar
Unix persona.

When the kernel prepares to launch a process it creates a view on the
slice of the system hardware that is relevant to the process: the
process appears to have exclusive access to the CPU and memory, while
the kernel actually just virtualizes these resources.
"
    },
    {
      "idx": 15,
      "label": "16",
      "overlay": 0,
      "note": "
The kernel can *also* virtualize other resources that make up the Unix
interface.  It can present independent namespaces for different
subsystems to a process.  These namespaces include the process table,
network devices, facilities for inter-process communication, the file
system, user ids, and virtual time.

Give a process an empty mount namespace and it cannot see the host
system's files any more.  Give it a separate user namespace and it
will think that it is the only user on the system, and also has root
access.

In the common case of Docker (or Singularity/Apptainer), people run a
process in a separate mount namespace, so that it cannot access the
root file system (and all the globally installed software) and instead
uses a provided binary bundle as the root file system.
"
    },
    {
      "idx": 16,
      "label": "17",
      "overlay": 0,
      "note": "
This makes it much easier to *install* an application, but there is no
way to *recreate* the bundled root file system independently, exactly,
or with deliberate fine-grain modifications.  This fine-grain control
is a necessary requirement for the interactive exploratory process of
computational science.

We don’t only want to clone an environment, but we may want to have
the option of making *specific* changes without having anything else in
the environment change.

Containers lack transparency.  Looking at the binary image you cannot
tell what ingredients really went into it.  You have no guarantee that
the binary you *received* really corresponds to the source code you
*reviewed*.
"
    },
    {
      "idx": 17,
      "label": "18",
      "overlay": 0,
      "note": "

When container images are built, they modify or extend existing
third-party images by fetching network resources that are not
guaranteed to be immutable as time passes.  Dockerfiles are imperative
and execute traditional package management commands or perform
downloads and successively mutate the new root file system.  We end up
with the raw bits of a new root file system and sacrifice any higher
order abstractions.

When building a container image from a Dockerfile on one day and again
a month later it is not unusual to get two very different containers.
The secret ingredient in even the most transparent container smoothie
is the current state of the internet.

"
    },
    {
      "idx": 18,
      "label": "19",
      "overlay": 0,
      "note": "

So far we looked at software management approaches that are derived
from the traditional approach of mutating shared storage.  Even with
modern containers we still work with a single blob of a shared root
file system; we just ignore the existing system’s file system.

Computing practices are not so different from biological evolution.
Evolution is descent with modification.  All modification is subject
to the cumulative constraints of past modifications; this means that
backtracking is often prohibitively expensive.  Giraffes, for example,
are stuck with their ridiculously long laryngeal nerve that takes a
detour from the head down the neck, around the aortic arch, all the
way back up to the head.

Software deployment practices likewise are limited by the burden of
decisions in the past that continue to influence the trajectory of our
projects.  What if we didn’t try to tack on reproducibility onto
legacy methods of software installation but instead built a system
from reproducible primitives?
"
    },
    {
      "idx": 19,
      "label": "20",
      "overlay": 0,
      "note": "
* The functional approach

In 2006 Eelco Dolstra published his PhD thesis entitled “The Purely
Functional Software Deployment Model”.  The core idea is simple: treat
the transformation from source code and tools to executable files as a
pure function.  The output of a function is fully determined by its
inputs and nothing else.
"
    },
    {
      "idx": 20,
      "label": "21",
      "overlay": 0,
      "note": "
* The functional approach

In 2006 Eelco Dolstra published his PhD thesis entitled “The Purely
Functional Software Deployment Model”.  The core idea is simple: treat
the transformation from source code and tools to executable files as a
pure function.  The output of a function is fully determined by its
inputs and nothing else.
"
    },
    {
      "idx": 21,
      "label": "22",
      "overlay": 0,
      "note": "

Inputs are source code files,
any tools that run to translate the code to a binary,
any libraries needed by these tools,
any libraries that the software needs to link with, etc.
"
    },
    {
      "idx": 22,
      "label": "23",
      "overlay": 0,
      "note": "
The output is a tree of files, some executable some not."
    },
    {
      "idx": 23,
      "label": "24",
      "overlay": 0,
      "note": "

GNU Guix is one implementation of this functional idea.

Guix comes with a very large collection of about 28000 package
recipes that are each built in complete isolation (no internet, no
root file system, no /bin, no /lib).  This is enforced by a daemon
that spawns jails where dedicated unprivileged user accounts build
software.  When compiling these packages only declared inputs are
available, nothing else.  The resulting files are stored in a unique
output directory that is derived from the set of all inputs.  Any
change to any of the inputs results in a new output directory.

This simple property ensures that countless variants of applications
and libraries can be installed on the same system without conflicts.
Existing software doesn’t affect new software, and new software cannot
affect existing software.

A Guix package is unambiguously described by its complete dependency
graph; this includes all libraries it needs, any tools that are used
to create it, any source code, and any configurations.

Building the same package twice on different machines will (in the
absence of bugs) result in the exact same files.  It doesn’t matter
whether you are using Ubuntu or RHEL, or whether you are doing
this in 2020 or 2024.

Guix heavily caches builds and deduplicates identical files, so the
overall space consumption is lower than one would expect.
      "
    },
    {
      "idx": 24,
      "label": "25",
      "overlay": 0,
      "note": "

This same simple idea is easily extended from individual packages to
groups of packages in the same environment...
      "
    },
    {
      "idx": 25,
      "label": "26",
      "overlay": 0,
      "note": "

...or to lightweight
containers *without* the need to replace the root file system...

      "
    },
    {
      "idx": 26,
      "label": "27",
      "overlay": 0,
      "note": "
...or even to full blown GNU+Linux systems...
      "
    },
    {
      "idx": 27,
      "label": "28",
      "overlay": 0,
      "note": "

...whether that be system
containers, virtual machines, or bare-metal installations.

Let us next look at the simplest features and work our way up.

      "
    },
    {
      "idx": 28,
      "label": "29",
      "overlay": 0,
      "note": "

As mentioned earlier, version numbers fail to describe software
completely.

Let me show you an example with the humble “hello” package. [...]

All of these applications are “hello” version 2 point 12 point one, but
some have patches, others use GCC 11, and yet others use different
configuration flags.  With Guix these are all distinct.

      "
    },
    {
      "idx": 29,
      "label": "30",
      "overlay": 0,
      "note": "

“guix build” is a low-level command.  Users don’t need to care about
all these /gnu/store locations.  They would instead use Guix like a
traditional package manager or like a shell.

[demo]
      "
    },
    {
      "idx": 30,
      "label": "31",
      "overlay": 0,
      "note": "

Guix can be used declaratively.  A manifest file declares what
software the environment should contain, and Guix can instantiate an
environment according to these specifications.

Previously I said that recreating environments is just a necessary
step to facilitate further experimentation.  Guix has a number of
built-in transformations to modify selected parts of the massive
dependency graph.  Libraries can be replaced with variants that are
optimized for specific CPUs, recursively for any package that uses
them.  Or a patch can be applied to a selected package, keeping the
rest of the stack unaltered.

Beyond these transformation presets Guix is fully programmable in
Scheme and packages can be rewritten with the flexibility of a modern
general purpose programming language.

      "
    },
    {
      "idx": 31,
      "label": "32",
      "overlay": 0,
      "note": "

    All the information about all dependencies of any package
    available through Guix is part of Guix itself.  So by changing the
    version of Guix we can move backwards and forwards in time to
    install software as it was available when that particular version
    of Guix was current.

This means that for fully reproducible environments we only need two
pieces of information:

+ the exact version of Guix we used at the time, and
+ the name(s) of the package(s)
"
    },
    {
      "idx": 32,
      "label": "33",
      "overlay": 0,
      "note": "
We can let Guix describe itself in a way that it can understand.

guix describe -f channels > channels.scm

You can think of the output as a complete snapshot of all the software
known to Guix (including all the relationships between libraries,
configurations, and tools) at this time.
"
    },
    {
      "idx": 33,
      "label": "34",
      "overlay": 0,
      "note": "

Given that Guix knows exactly what relationships there are between
applications and their dependencies it can also export all the bits in
whatever format you want.  GUIX PACK is a way to generate application
bundles --- in docker format, singularity, or just as a plain tarball.

This lets you share the bits with people who are not (yet) using Guix.
But remember that the resulting blob is an OUTPUT.  By dumping all the
bits into an file system image we lose all the higher level
abstractions that enable us to perform controlled computational
experiments.

"
    }
  ]
}