Building operating systems for science with Docker

Jonathan Karr

May 16, 2024

Over the past decade, Docker has become the standard for sharing individual scientific applications. For example, Docker images for over 10,000 applications are freely available from BioContainers, such as for running Slurm batch jobs, Nextflow workflows, and RShiny visualizations. Docker makes it easy both for developers to share their software and for users to run them. Docker is particularly helpful for circumventing the package hell of Python, the most popular programming language for science.

To interactively explore data and run simulations, scientists often need entire operating systems (OSes) with multiple applications. Out-of-the-box, Docker is tailored for individual applications, rather than OSes. For example, Docker containers typically only run a single service at a time, whereas scientists often run multiple programs simultaneously, such as JupyterLab and SSH. Virtual machine images, such as VirtualBox images, can capture OSes. However, they are harder to build and share.

Can we use Docker to manage OSes? Can we create OSes with Docker’s portability, reproducibility, and ease of use? Read on to learn the key principles. Spoiler alert – while possible, it takes technical knowledge and effort. To focus on your science, check out the Deep Origin platform.

The gap from Docker images for individual apps to OSes

To use Docker to manage OSes, first we need to understand what OSes require and how Docker images fall short.

Multiple scientific applications: Typically, Docker images are designed to provide a single application, such as a single RShiny data visualization. To install multiple applications and enable users to install yet more applications, package management systems, such as pip and conda, should be installed and then used to install scientific packages such as NumPy and SciPy.
Service management: By default, Docker containers can run a single service, such as JupyterLab. Consequently, Docker images typically lack service management tools, such as systemd, for running multiple services. For interactive computing, process managers should be added to Docker images.
Startup applications: Unlike Docker containers, OSes typically launch multiple programs when they start. For example, a Windows desktop could be configured to automatically launch Slack and Zoom. A scientist might want to automatically launch programs such as RStudio Server. To set up Docker images to automatically launch programs, a service needs to be defined for each startup program.
User accounts: Docker containers are usually run as root. To protect users against mistakes, users should interact with containers as non-root users.
Inline help: For efficient downloading and bootup, most Docker images do not contain manuals for programs. This help information isn’t needed for the typical Docker use case. For interactive data analysis and simulation, these manuals should be installed.

How to build a Docker image for an OS

Now that you understand the typical capabilities of a Docker image, follow the steps below to expand your favorite image into an OS for your science.

Choose a base image which can be expanded into an OS. We recommend Ubuntu because it’s the most popular desktop Linux operating system among scientists. Begin the Dockerfile for your OS with the statement below.
```
FROM ubuntu:jammy
```

To protect against mistakes, set the password for the root user and add a non-root user by adding the following to your Dockerfile.

ARG ROOT_PASSWORD=*******
RUN echo "root:root" | chpasswd \
    \
    && groupadd user \
    && useradd \
        --system \
        --create-home \
        --home-dir /home/user \
        --shell /bin/bash \
        --gid user \
        --groups user \
        --uid 1000 \
        user \
    && echo "$ROOT_PASSWORD:$ROOT_PASSWORD" | chpasswd \
    && echo "export PATH=\$HOME/.local/bin:\$PATH" >> "${END_USER_HOME}/.bashrc"

Install the scientific applications that you need and make it easy to install more applications as necessary. For Python, we recommend installing pip, mamba and micromamba – efficient implementations of conda –, as well as your favorite Python packages such as Matplotlib, NumPy, and Scipy. Add the following statements to your Dockerfile.

RUN apt update --yes \
    && apt install --yes --no-install-recommends bzip2

USER user
RUN curl https://api.anaconda.org/download/conda-forge/micromamba/1.5.1/linux-64/micromamba-1.5.1-0.tar.bz2 \
        --location \
        --show-error \
        --silent \
        --output "micromamba-1.5.1-0.tar.bz2" \
    && tar \
        --file='micromamba-1.5.1-0.tar.bz2' \
        --bzip2 \
        --extract \
        --directory='/home/user/.local' \
        bin/micromamba \
    && mkdir --parents '/home/user/.apps/conda' \
    && micromamba install \
        --root-prefix '/home/user/.apps/conda' \
        --name base \
        --channel conda-forge \
        --yes \
        python \
        mamba \
        pip \
    && /home/user/.apps/conda/bin/mamba init bash --user \
    && mkdir --parents '/home/user/.apps/conda/etc/conda/activate.d' \
    && echo '#!/bin/sh' > "/home/user/.apps/conda/etc/conda/activate.d/env_vars.sh" \
    && sed --in-place "s/'\/home\/bench\-user\/.apps\/conda\/bin\/conda'/\"\$HOME\/.apps\/conda\/bin\/conda\"/g" "/home/user/.bashrc" \
    && sed --in-place "s/\"\/home\/bench\-user\//\"\$HOME\//g" "/home/user/.bashrc" \
    && mamba config --set pip_interop_enabled True \
    \
    && micromamba install \
        --root-prefix '/home/user/.apps/conda' \
        --name base \
        --channel conda-forge \
        --yes \
        jupyterlab \
        matplotlib \
        nodejs \
        numpy \
        pandas \
        scipy \
        Seaborn
USER root

To enable your Docker containers to manage multiple services, such as JupyerLab and RStudio Server, install a service manager and configure it to launch when you run your image. We recommend using s6-overlay. Add the following statements to your Dockerfile.

RUN apt update --yes \
    && apt install --yes --no-install-recommends \
        xz-utils \
    \
    && curl https://github.com/just-containers/s6-overlay/releases/download/v3.1.5.0/s6-overlay-noarch.tar.xz \
        --location \
        --show-error \
        --silent \
        --output s6-overlay-noarch-3.1.5.0.tar.xz \
    && tar \
        --directory=/ \
        --xz \
        --extract \
        --preserve-permissions \
        --file=s6-overlay-noarch-3.1.5.0.tar.xz \
    \
    && curl https://github.com/just-containers/s6-overlay/releases/download/v3.1.5.0/s6-overlay-x86_64.tar.xz \
        --location \
        --show-error \
        --silent \
        --output s6-overlay-x86_64-3.1.5.0.tar.xz \
    && tar \
        --directory=/ \
        --xz \
        --extract \
        --preserve-permissions \
        --file=s6-overlay-x86_64-3.1.5.0.tar.xz \
    \
    && curl https://github.com/just-containers/s6-overlay/releases/download/v3.1.5.
0/s6-overlay-symlinks-noarch.tar.xz \
        --location \
        --show-error \
        --silent \
        --output s6-overlay-symlinks-noarch-3.1.5.0.tar.xz \
    && tar \
        --directory=/ \
        --xz \
        --extract \
        --preserve-permissions \
        --file=s6-overlay-symlinks-noarch-3.1.5.0.tar.xz \
    \
    && curl https://github.com/just-containers/s6-overlay/releases/download/v3.1.5.0/s6-overlay-symlinks-arch.tar.xz \
        --location \
        --show-error \
        --silent \
        --output s6-overlay-symlinks-arch-3.1.5.0.tar.xz \
    && tar \
        --directory=/ \
        --xz \
        --extract \
        --preserve-permissions \
        --file=s6-overlay-symlinks-arch-3.1.5.0.tar.xz
ENV S6_CMD_WAIT_FOR_SERVICES_MAXTIME=0
ENTRYPOINT ["/init"]

To configure JupyterLab to automatically run when your image is run, create an s6 service for JupyterLab.

Save the following script to jupyerlab-run within the context for your Dockerfile. This script instructs s6 how to run JupyerLab as the non-root user we created in Step 2 above.

#!/command/exec bash
cd /home/user
exec \
    runuser \
        --user user \
        --pty \
        -- \
        env \
            -i \
            HOME="/home/user" \
            bash \
                -i \
                -l \
                -c "jupyter lab --port=8080"

Append the following statements to your Dockerfile. These statements instruct s6 that JupyterLab should be a long running application (an application which is launched at startup and automatically restarted if it fails), instruct s6 to launch JupyterLab after s6 has launched itself, and copy the script above which instructs s6 how to run JupyterLab.

RUN mkdir --parents /etc/s6-overlay/s6-rc.d/jupyterlab \
    && echo “longrun” > /etc/s6-overlay/s6-rc.d/jupyterlab/type \
    && mkdir --parents /etc/s6-overlay/s6-rc.d/jupyterlab/dependencies.d \
    && touch /etc/s6-overlay/s6-rc.d/jupyterlab/dependencies.d/base \
    && touch /etc/s6-overlay/s6-rc.d/user/contents.d/jupyterlab
COPY jupyterlab-run /etc/s6-overlay/s6-rc.d/jupyterlab/run

To make it easy to get help for the applications in your OS, install the manual for each application by adding the following to your Dockerfile.
```
RUN apt update --yes && (yes ||:) | unminimize
```

How to reliably build Docker images for users

To reliably build Docker images, we further recommend building images modularly and reproducibly, and rigorously testing your images. Below are three of our favorite tools. Same spoiler alert – they take effort to learn and use. To remain focused on your science, check out our platform.

dockerfile-x: Tool for defining Dockerfiles modularly.
repro-sources-list.sh: Tool for configuring package managers to build images reproducibly.
Container Structure Tests: Framework for validating Docker images.

Deep Origin: The computational platform for life scientists

Tired yet? Shoehorning DevOps for science wasn’t designed to be easy. It wasn’t made for biologists! It was made for software engineers to deploy software at scale such as for ecommerce shops.

As investigators explore increasingly complex studies involving more and higher dimensional experimental modalities which generate larger and more heterogeneous data, engineering, operations, and infrastructure are becoming increasingly vital. How can scientists prepare for this future? One solution is for scientists to cross-train as DevOps engineers. This is impractical. Science education is already too onerous. A second solution is for each lab to build a DevOps team. This is expensive. For example, most academic projects cannot afford a DevOps engineer.

Just as software engineers need DevOps tools like Docker to run applications at scale, scientists need SciOps tools to run experiments at scale. Similarly, just as DevOps tools enable software engineers to focus on software development, SciOps tools should enable scientists to focus on designing and analyzing experiments.

To remain focused on your science, check out our R&D platform. With our platform, you can start analyzing your data in minutes. Simply choose one of our scientific OSes, or software blueprints, and use it to configure a cloud workstation. We have OSes for a variety of topics – from single-cell genomics to molecular simulation –, or contact us about developing a custom OS for you.

Conclusion

Modern scientists need suites of applications, or OSes, to interactively analyze data and run simulations within increasingly large and complex studies. Docker is a powerful DevOps tool for running software applications. However, significant technical knowledge and effort is required to build Docker images for scientific OSes.

Just as DevOps tools help software engineers run software, scientists need SciOps tools to run experiments. We founded Deep Origin to solve these problems for the life science community. To remain focused on your science, check out our platform or contact us.

‍

Heading

Jonathan Karr

May 16, 2024