Rebound: auto-restarting processes in containers

I just added rebound as a automatic process restarter for use in Docker containers, mostly because I had a need for it and could not find an existing implementation.

It is, like gosu and tini, intended to be used inside Docker containers and is used to automatically restart a process, typically a daemon.

rebound /usr/bin/my-server --port 8080

Rebound will accept a binary and automatically restart the process if it crashes, for example, because it is running out of memory or has a bug causing it to segfault. If the process exits cleanly with an exit code of 0, the process will not be restarted (but you can make it do that by passing in option -0, if you need to) and also exit cleanly on SIGTERM and SIGINT, allowing shutdown of the daemon process. It will also reap any child processes and prevent zombie processes as well as having crash loop protection to deal with misbehaving processes.

To understand why it is needed, we need to do a quick digression into Kubernetes.

Kubernetes pods. Kubernetes has evolved as the standard container management system. As a basic description, it contain logic to manage multiple Docker containers and group them together into Pods and also manages the full life-cycle of containers, takes care of scaling and fail-over, load-balancing, and service discovery, and a bunch of other things. Most of you reading this is familiar with that, so nothing new here.

Kubernetes organizes a set of containers into a pod (see figure), which is a collection of containers, volumes, and other components that need to run on the same machine. If one of the containers crash, or if the container is unresponsive, Kubernetes knows how to start or restart the new container. In addition, volumes are shared between containers so if a container crashes, the data in the volume remain until the container has restarted.

Because of this, it is normally not necessary to have any restart logic for processes inside a container: you just create your pod and connect all containers using volumes. However, there is one particular case where you need to deal with restarting the process yourself and that is related to databases.

Kubernetes and database containers. If you have a container with the database process running, you want to avoid restarting it unnecessarily. This requires flushing shared buffers and making sure everything is persisted to disk. Once you restart the database server, you then need to start loading data into shared buffers to get the system performing well again.

PostgreSQL control plane. If you’re building a more complex database system where you want to control the life-cycle of the database server yourself and do not want to let Kubernetes handle this in the normal manner, you need to add a control plane (process) to the PostgreSQL container that present an API for the other processes in the pod and can call pg_ctl to initialize, start, and stop the PostgreSQL process. However, if you start a database process using pg_ctl, it starts as a process inside the same container as the control plane.

This poses no problem normally: the database process will happily run inside the container alongside the control plane, but all software can crash, including the control plane process. The control plane process itself is (or can be implemented as) a lightweight binary and restarting it if it is necessary does not pose a problem. However, if the control plan process crashes due to a bug or lack of resources (memory, for example), you want to avoid restarting the entire container since that would mean that you need to restart the database process as well.

Automatically restarting the control plane. Here rebound is useful. If the control plane process crashes for unexpected reasons (that is, it does not do a normal shutdown), you want to quickly restart it so that you do not have to restart the container.

The implementation is trivial: it just spawns a sub-process executing the binary provided and then use waitpid(3) to wait for it to exit. If it exits for any abnormal reasons, the process will just restart the process again. If it exits normally or due to SIGTERM or SIGINT (or SIGKILL, which can’t be blocked), rebound with also exit since this means that the container is shutting down. Spawning a process and waiting for it to exit using waitpid, wait4, or any of the other wait methods is an old trick, but to my surprise I could not find any standards tool for this to use in Docker containers, so I just tossed one together.

Using rebound in your Docker container. If you want, you can download and compile it yourself. There are example containers for libc builds (Debian to be precise) and musl builds (Alpine image), but they are also available on DockerHub, so if you use a multi-stage build, you can add this code to your container build.

FROM mkindahl/rebound:musl AS rebound

FROM alpine:3
COPY --from=rebound /usr/local/bin/rebound /usr/local/bin/rebound
RUN apk add --no-cache my-app
ENTRYPOINT ["rebound"]
CMD ["my-app"]

Mats

dbmsdrops.kindahl.net

Long time developer with a keen interest in databases, distributed systems, and programming languages. Currently working as Database Architect at Timescale.

Comments

Leave a Reply