This post briefly describes a useful trick to execute binaries across different mount-namespaces. It consists of a neat technique involving setns(2) and fexecve(3).

fxe-rs

A demo repository to illustrate this trick is available at fxe-rs. It just contains fxe: a small, pure-Rust program which is able to execute Linux binaries across mount-namespaces.

As an example, let's take a busybox container and pretend we want to run modinfo crc16 to inspect the details of that specific kernel module. This typically requires asking the container runtime to bind-mount all relevant directories beforehand:

$ docker run -v /lib/modules:/lib/modules busybox modinfo crc16

filename:       /lib/modules/4.11.0-1-amd64/kernel/lib/crc16.ko
description:    CRC16 calculations
license:        GPL
depends:        
intree:         Y
vermagic:       4.11.0-1-amd64 SMP mod_unload modversions

This is usually the simplest way to proceed and it works well in most cases. In some other cases however this may introduce additional issues: the source path may not be known beforehand (e.g. parts of it may be dynamic), or it may sometime not exist, or it may contain symlinks pointing outside of it, or an intermediate bind-mount may be harmful to have (e.g. when performing new mounts, due to different subtree and propagation settings).

In such cases, it may be useful to trade the bind-mounts in favor of a shared PID-namespace, in order to allow the containerized process to escape its mount-namespace and to execute other binaries directly in the target. In this case, we use the fxe example binary to escape to the host mount-namespace and execute modinfo there, without any additional bind-mount:

$ docker run --privileged --pid=host quay.io/lucab/fxe:latest /fxe /proc/1/ns/mnt

filename:       /lib/modules/4.11.0-1-amd64/kernel/lib/crc16.ko
description:    CRC16 calculations
license:        GPL
depends:        
intree:         Y
vermagic:       4.11.0-1-amd64 SMP mod_unload modversions

This is a simple (and mostly pointless) example just to show this trick applied in a running scenario. Better usecases typically involve mounts and host manipulation, see references below.

Technical details

The core of it is the following:

// Get a FD to the `busybox` binary. 
let exe_fd = fs::File::open("/bin/busybox")?.into_raw_fd();

// Move to PID-1 mount-namespace.
let ns_fd = fs::File::open("/proc/1/ns/mnt")?.into_raw_fd();
nix::sched::setns(ns_fd, sched::CLONE_NEWNS)?;
nix::unistd::close(ns_fd)?;

// Execute binary in the target.
let args = vec![/* "modinfo", "crc16" */];
let env = vec![/* ... */];
nix::unistd::fexecve(exe_fd, args.as_slice(), env.as_slice())?;

As comments say, this can be roughly split into three steps:

The underlying executable (busybox) must be opened, in order to obtain a File Descriptor (FD) for fexecve(3).
The running process needs to move itself to the target mount-namespace (PID-1 as seen from procfs) via setns(2).
Once running in a different namespace, the FD obtained in the first step can be used to execute the binary.

The interesting part here is right before executing the last step. At that point, the process is running in a different namespace where the original binary is not anymore available (viz. two unrelated containers) and it would not be able to execute it by path. However, it is still able to reference it by the file descriptor which has been obtained in the first step, and thus to execute it via fexecve(3).

As a note for advanced readers, the same technique can be implemented with a direct usage of execveat(2).

For more details, check the documentation and the source in the repository.

Caveats

This has of course several caveats and it should not be seen as a general purpose technique.

It is explicitly trading off some high-level isolation (sandboxing, PID segmentation, privileges dropping) for more low-level flexibility. This is generally useless and dangerous for unprivileged workloads, however some system-helpers may already be requiring it (e.g. containerized mount utilities).

This actively exploits the non-orthogonality of Linux namespaces, exploiting a shared PID-namespace to pivot across private mount-namespaces. As usual, this may make people very angry and can be widely regarded as a bad move.

Moreover, the syscalls shown here introduce additional restrictions on the running process (single-thread namespace manipulation, exec by FD) which may make this technique impossibly hard in some languages (e.g. Golang).

Motivations

There are several reasons behind this project and post, all of them equally important.

The initial one was experimenting around bind-mounts. I have hit usecases where the source path was not known a-priori, and I was looking for some way to remove this requirement.

Then, there is currently a push in the Kubernetes world to move to containerized mount-utilities. This is however stepping into the minefield of recursive-shared mount propagation through the init mount-namespace. I've shot myself in the foot with that before, and I wanted to look for an alternative approach.

Moreover, I stepped into fexecve(3) and I wanted to verify that the usage described in this post was actually feasible.

The last part of the story is clearly personal research. I feel that Rust is generally well suited for container-related technologies, and that this field may benefit from some more experiments and docs. This is my little contribution to it :)

Notes

No further notes or edits on this post so far.

Binary execution across Linux mount-namespaces

fxe-rs

Technical details

Caveats

Motivations

Notes