Docker Images - Finally Understandable - header image

Docker Images - Finally Understandable

Last updated on April 22, 2024 -
Star me on GitHub →  

Tips & Tricks to build Docker images in the fastest amount of time and with the smallest possible size.

What are we trying to understand?

Whenever you’re building Docker images, say, you want to bake your Java/Node/Python application into one, you’ll be confronted with the following two questions:

  • How can I make the docker build command run as fast as possible?

  • How can I make sure that the resulting Docker image is as small as possible?

You will want to continue reading for answers to these questions.

Docker Image Layers 101

Take a look at the following Dockerfile.

FROM eclipse-temurin:17-jdk
ARG JAR_FILE=build/libs/*.jar
COPY  ${JAR_FILE} app.jar
ENTRYPOINT ["java","-jar","/app.jar"]

By running docker build -t myapp . on this Dockerfile, you will get (one) Docker image, which will be based on a Java 17 (Eclipse-Temurin) image, as well as contain and run our Java application (the app.jar file).

What might not immediately be obvious, is that every single line from your Docker line, will result in the creation of one Docker image layer - every image consists of several such layers.

You can confirm this by running e.g.:

 docker image history myapp

Which will return the image layers on new lines:

IMAGE          CREATED              CREATED BY                                      SIZE      COMMENT
3ca5a60826f0   8 minutes ago   ENTRYPOINT ["java" "-jar" "/app.jar"]           0B        buildkit.dockerfile.v0
<missing>      8 minutes ago   COPY build/libs/*.jar app.jar # buildkit        19.7MB    buildkit.dockerfile.v0
<missing>      8 minutes ago   ARG JAR_FILE=build/libs/*.jar                   0B        buildkit.dockerfile.v0
... (other layers from the base image left out)

There is a layer for our ENTRYPOINT line, one for COPY and one for ARG.

The layer containing our app.jar file (COPY) is roughly 20MB large, with 0B metadata layers for the ENTRYPOINT and ARG lines.

Now, what do we do with this information?

Your layers can easily bloat

Imagine you want to install a package through your package manager, and for that, you want to run apt update, which updates the package manager’s index.

FROM eclipse-temurin:17-jdk
RUN apt update -y
ARG JAR_FILE=build/libs/*.jar
COPY  ${JAR_FILE} app.jar
ENTRYPOINT ["java","-jar","/app.jar"]

Let’s have a look at the resulting layers (docker image history myapp) and focus on the very last line (RUN /bin/sh -c…​):

IMAGE          CREATED         CREATED BY                                      SIZE      COMMENT
c14a18a04751   8 seconds ago   ENTRYPOINT ["java" "-jar" "/app.jar"]           0B        buildkit.dockerfile.v0
<missing>      8 seconds ago   COPY build/libs/*.jar app.jar # buildkit        19.7MB    buildkit.dockerfile.v0
<missing>      8 seconds ago   ARG JAR_FILE=build/libs/*.jar                   0B        buildkit.dockerfile.v0
<missing>      8 seconds ago   RUN /bin/sh -c apt update -y # buildkit         45.7MB    buildkit.dockerfile.v0

Wooha! Running apt-update has added a new layer with a whooping 45.7MB to our resulting Docker image. Now every time you push or pull your image, you’ll need to transfer those additional megabytes.

Layers are additive

Let’s continue with the example above and add a couple more run commands, to install the latest mysql package.

FROM eclipse-temurin:17-jdk
RUN apt update -y
RUN apt install mysql -y
RUN rm -rf /var/lib/apt/lists/*
ARG JAR_FILE=build/libs/*.jar
COPY  ${JAR_FILE} app.jar
ENTRYPOINT ["java","-jar","/app.jar"]

In addition, we’re removing the apt index cache (the 45.7MB from above) with the rm -rf /var/lib/apt/lists/* command. Let’s see what our image history now looks like:

59f82a5b4c5a   6 seconds ago   ENTRYPOINT ["java" "-jar" "/app.jar"]           0B        buildkit.dockerfile.v0
<missing>      6 seconds ago   COPY build/libs/*.jar app.jar # buildkit        19.7MB    buildkit.dockerfile.v0
<missing>      6 seconds ago   ARG JAR_FILE=build/libs/*.jar                   0B        buildkit.dockerfile.v0
<missing>      6 seconds ago   RUN /bin/sh -c rm -rf /var/lib/apt/lists/* #…   0B        buildkit.dockerfile.v0
<missing>      7 seconds ago   RUN /bin/sh -c apt install -y mysql-server #…   605MB     buildkit.dockerfile.v0
<missing>      8 minutes ago   RUN /bin/sh -c apt update -y # buildkit         45.7MB    buildkit.dockerfile.v0

Waah, what’s that? Even though we deleted the apt cache files, the 45.7MB layer is still there (in addition to the 605MB MySQL layer, btw).

That’s because layers are strictly additive / immutable. You can surely delete those files from your current layer, but the older/previous layers will still contain them.

How can you get around this? A simple workaround would be to run all three RUN commands on a single line (== a single resulting layer)

FROM eclipse-temurin:17-jdk
RUN apt update -y &&  \
    apt install -y mysql-server &&  \
    rm -rf /var/lib/apt/lists/*
ARG JAR_FILE=build/libs/*.jar
COPY  ${JAR_FILE} app.jar
ENTRYPOINT ["java","-jar","/app.jar"]

Let’s look at the image’s history now:

IMAGE          CREATED          CREATED BY                                      SIZE      COMMENT
4b8c0f7f895a   14 seconds ago   ENTRYPOINT ["java" "-jar" "/app.jar"]           0B        buildkit.dockerfile.v0
<missing>      14 seconds ago   COPY build/libs/*.jar app.jar # buildkit        19.7MB    buildkit.dockerfile.v0
<missing>      14 seconds ago   ARG JAR_FILE=build/libs/*.jar                   0B        buildkit.dockerfile.v0
<missing>      14 seconds ago   RUN /bin/sh -c apt update -y &&      apt ins…   605MB     buildkit.dockerfile.v0

Ha! We at least saved the 45.7MB for now. What else is wrong with this, though?

Make it reproducible

You ideally want your builds to be reproducible (who would have thought). By running apt update and then installing whatever latest package there is in the repo, you effectively break that reproducibility, because package versions might change between builds.

The gist:

  • Install only specific versions of whatever you are trying to install

  • Avoid (package-manager-of-your-choice)'ing in your Dockerfiles for your application in the first place - instead, build a new base image and use that in your Dockerfile’s FROM. This will also be a lot faster!

Layer order matters

You’ll want to make sure to put layers that change a lot towards the bottom of your Dockerfile, whereas more stable layers should be ordered on top.

Why? Because when building images, you’ll need to rebuild every layer starting from the layer(s) that changed between builds.

A practical example: Imagine that you want to package an index.html file into your image, which changes a lot, i.e. more often than anything else.

FROM eclipse-temurin:17-jdk
COPY index.html index.html
RUN apt update -y &&  \
    apt install -y mysql-server &&  \
    rm -rf /var/lib/apt/lists/*
ARG JAR_FILE=build/libs/*.jar
COPY  ${JAR_FILE} app.jar
ENTRYPOINT ["java","-jar","/app.jar"]

You can see the COPY index.html index.html line added almost at the top of the Dockerfile. Now, every time the index.html file changes, you’ll need to rebuild all subsequent layers, i.e. the _RUN apt-update, ARG & COPY app.jar layers - a huge time sink. On my machine, all of the above takes roughly 17 seconds to finish.

If, however, you re-order the statement towards the bottom, Docker can re-use all previous layers, as they haven’t changed.

FROM eclipse-temurin:17-jdk
RUN apt update -y &&  \
    apt install -y mysql-server &&  \
    rm -rf /var/lib/apt/lists/*
ARG JAR_FILE=build/libs/*.jar
COPY  ${JAR_FILE} app.jar
COPY index.html index.html
ENTRYPOINT ["java","-jar","/app.jar"]

Now a new docker build only takes, 0.5 seconds (on my machine), much much better!

Here are the golden layering rules:

  • Files that rarely change or are time/network-intensive (e.g. installing new software) → Top

  • Files that change often (e.g. source code) → Very Low

  • ENV, CMD, etc → Bottom

When does Docker re-build layers?

Docker doesn’t always rebuild all image layers, whenever you run docker build. There is a specific set of rules,on when and how Docker will cache your layers and you can read about them in the official documentation.

The gist is, whenever you run Docker build, Docker will:

  • Either check the commands in the Dockerfile for changes (e.g. did you change RUN blah to RUN doh).

  • Did any of the involved files (or rather their checksums), in the case of ADD or COPY, change?

.dockerignore

When you run docker build -t <tag> ., the ., your current directory, will actually be your so-called build context. Meaning all the files inside your current directory will be tar’ed up and sent to your local or remote Docker daemon to perform the build.

If you want to make sure that some directories never make it to your build daemon, thus keeping things snappy and small, you can create a .dockerignore file, which has a similar syntax to .gitignore.

In general, you should put any files/directories that are not relevant to your build here (e.g. your .git folder), which is especially important when using commands like COPY . /somewhere, because then your entire project will end up in the resulting image.

An npm example: You might want to run e.g. npm install during build time and let it download its dependencies, instead of (slowly) copying your node_modules folder in, so that would also make a good candidate for the dockerignore file. However, if you do that, here’s another trick you’d want to know about: directory caching.

Directory Caching

Say you run npm install, pip install gradlew build etc. to build your image. This will lead to dependencies being downloaded and a new image layer being created. Now, if that image layer has to be rebuilt, all dependencies will be re-downloaded on the next build, because there won’t be a .npm, .cache or .gradle folder available with the already downloaded dependencies.

But you can change that! Let’s take pip as an example and change the following line:

FROM ...
RUN pip install -r requirements.txt
CMD ...

to:

RUN --mount=type=cache,target=/root/.cache pip install -r requirements.txt

This will tell Docker to mount a caching layer/folder (/root/.cache) into the container during build time - in this case, the folder that pip caches its dependencies in, for the root user. The trick is: this folder will not end up in the resulting image, but/and will be available to pip in all subsequent builds - and you’ll get a nice speed up!

The same goes for NPM, Gradle, or any other package manager out there. Just make sure to specify the correct target folder.

What are multistage builds?

Coming Soon.

Fin

This article should have given you a good grasp of Docker image fundamentals. If you have any questions or other comments, please post them in the comment section below.

Acknowledgments & References

Thanks to Maarten Balliauw, Andreas Eisele for comments/corrections/discussion.

There's more where that came from

I'll send you an update when I publish new guides. Absolutely no spam, ever. Unsubscribe anytime.


Share

Comments

let mut author = ?

I'm @MarcoBehler and I share everything I know about making awesome software through my guides, screencasts, talks and courses.

Follow me on Twitter to find out what I'm currently working on.