Speed Up That Build, now! – Part 2: Filesystem revisted

Notice: This blog is no longer maintained. All new content is being published as guides/books/screencasts. Have a look around!

Speed Up That Build, now! – Part 2: Filesystem revisted

June 12, 2014 - Andreas Eisele

Last time in our "Speed Up That Build, now!" propaganda series we tried to show a specific approach using a ram disk to speed up a given build. This time we try to generalize our ideas to give you some additional options on the file system choice. As an added bonus we even try to do proper benchmarking this time. ;)

Benchmarking Approach

To give you some reproduce-able figures we will be using the tool fio. All tests will be executed on a VirtualBox VM. The reason for this is that usually your build server will as well be a VM on some arbitrary host hardware that you don't have any control over. If that's not the case for you please consider skipping the nasty file system details and go straight to our first advice here.

The host hardware in question here is that:

Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz (4 Cores/Threads)
8GB Generic DDR-2 RAM

The VM (Ubuntu Server 14.04 LTS, Linux Kernel 3.13.0-27-generic) is configured to see that hardware instead:

Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz (1 Core/Thread)
4GB RAM
SATA controller: Intel Corporation 82801HM/HEM (ICH8M/ICH8M-E) SATA Controller [AHCI mode] (rev 02)
10GB HDD space on a pre-created disk image (non-dynamic)

We are aware that combined with disk caching, kernel caching, etc. this adds a lot more moving layers of uncertainty. Sadly that is also the picture you will find as well in the real world. No matter of scientific approach will save you from that.

Our test scenario for FIO consists of random asynchronous reads and writes of 64MB files by 4 concurrent threads. We also try to use direct IO to use as few buffering layers as possible. The job file in question for that looks like this:

; -- start job file --<br />
[random-read-write]<br />
ioengine=libaio<br />
iodepth=4<br />
rw=randrw<br />
size=64m<br />
direct=1<br />
numjobs=4<br />
; -- end job file --<br />

Have a look here for a detailed explanation about the various options.

Starting Point: ext4 default

To get some first numbers we took what our clean install of Ubuntu Server 14.04 provided us as a default. That is ext4 as of Kernel 3.13.0 with no explicit mount options.

random-read-write: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=4<br />
[...]<br />
Run status group 0 (all jobs):<br />
   READ: io=130880KB, aggrb=458KB/s, minb=114KB/s, maxb=116KB/s, mint=280636msec, maxt=285251msec<br />
  WRITE: io=131264KB, aggrb=460KB/s, minb=115KB/s, maxb=116KB/s, mint=280636msec, maxt=285251msec</p>
<p>Disk stats (read/write):<br />
  sda: ios=32641/32847, merge=1/115, ticks=4260404/316136, in_queue=4576544, util=100.00%<br />

Phew, FIO spits out a lot of information. Ignorant as we are we try to focus just on the summary and from that only one number is enough for a comparison. aggrb= tells us what the aggregated throughput summed up to. Apart from that we can see that we managed to utilize the disk the whole time. We also spend some time in the queue but in the end reading as well as writing across all threads accumulated to 460KB/s. That may not be much but we are not measuring raw performance here. The question is whether we can improve on this.

Option 1: tune that ext4

Tuning a file system has much to do with your underlying use case. There is no solid "use that" advice option-wise. We tried to tune some settings in our last post like that:

# /etc/fstab<br />
UUID=b0da9226-a954-45f3-bce0-e186b82f0d48 /build/real     ext4    noatime,data=writeback,nobh,barrier=0,commit=300        0       2<br />

What does that mean?

noatime: advices the kernel to not update the access time every time a file is accessed
data=writeback: allows data to be written to main storage area after the metadata has been committed to the file system's journal
nobh: avoid associating buffer heads
barrier=0: don't use write barriers (sacrifice proper journal entry write ordering for performance)
commit=300: up the sync interval (from buffer to disk) from the default of 5s to 300s

Run status group 0 (all jobs):<br />
   READ: io=130880KB, aggrb=466KB/s, minb=116KB/s, maxb=118KB/s, mint=275510msec, maxt=280426msec<br />
  WRITE: io=131264KB, aggrb=468KB/s, minb=117KB/s, maxb=119KB/s, mint=275510msec, maxt=280426msec</p>
<p>Disk stats (read/write):<br />
  sda: ios=32682/32797, merge=0/6, ticks=4441360/13788, in_queue=4455544, util=100.00%<br />

Well that is not awfully different from the defaults. It looks like the throughput increased by 8KB/s but this could very well just be an arbitrary fluctuation in our measurements. To be honest the most likely reason for that is that we either hit the upper bound of what our CPU is capable already or that the usage of direct IO corresponds to what we tried to achieve with our tuning options. What will happen if we take the disk right out of the picture then?

Option 2: ram disk

Much in the same way as we advocated the use of ram disks to get rid of some IO penalties last time, let's just blindly use tmpfs now and forget about storing the output of our build for the time being. What does the benchmark say about tmpfs? Nothing at first because direct IO seems to be unavailable for tmpfs. After removing that option the result is that:

Run status group 0 (all jobs):<br />
   READ: io=130880KB, aggrb=137478KB/s, minb=34369KB/s, maxb=34882KB/s, mint=938msec, maxt=952msec<br />
  WRITE: io=131264KB, aggrb=137882KB/s, minb=34470KB/s, maxb=34985KB/s, mint=938msec, maxt=952msec<br />

That looks very very fast. Just like as if we were accessing the RAM directly. Well, that's kinda what we did. Of course now we have some blob of data persisted in the virtual memory of our machine that may or may not swapped out by the host system later on. There is no guarantee that accessing memory from a VM has to be fast all the time. But then there also isn't for normal processes on a real machine. In the end it always boils down to how many page misses you encountered etc. On top of that also our results from the last post seem to suggest that the net gain from a fast file system does not apply to the whole build one by one. That leaves us with one last question: What if we don't want to throw away all our builds all the time?

Option 3: ram disk synced to real disk

If you think about it a modern file system is just doing that already. You have various read and write buffers living in kernel space and at some time these are more or less synced to the underlying block storage. The kernel has to be very smart about that because one usually doesn't want to loose it's work of the last hour if the laptop battery dies or some power outage in the data center occurs. So buffer sizes are kept at a sensible small level and syncing happens very regularly.

On the other side our build scenario does not include very valuable data. After all it's just artifacts built in a defined way from some already well defined and persisted source. So keeping this output data serves only two purposes. It could speed up our build by reducing check-out times or the ability to smartly re-use already parts of the last build (if your build tool, unlike Maven, is capable of that logic). It also could serve as a starting point for your artifacts to be deployed automatically into your on-site repository (Nexus, Redhat Satellite, PPA, etc.).
So our idea was to build a poor man's syncing file system setup that operates without all the safety and sanity a proper file system provides but may therefore perform a little better. The caveat is of course that data may be easily lost as well.

Our first naive approach was to just unionize a tmpfs and an ext4 partition via some version of UnionFS. That didn't work. Other people had the same ideas though. That is why a whole set of simple daemon implementations exists in the Linux open source community to sync ram disks to backing stores.
In our case we used the goanysync implementation. It just takes one directory on a real disk and moves it's contents to a provided tmpfs location. It then symlinks that to the original path so that it can be used as a drop-in solution. One can then manually trigger syncs back to a backup directory on the original disk or even do that periodically. If later on goanysync is stopped again the contents are synced back one last time to that backup directory and this directory gets moved to the original path again. All in all that is no magic and in no way a very thorough solution but it can work.

Run status group 0 (all jobs):<br />
   READ: io=130880KB, aggrb=357595KB/s, minb=89398KB/s, maxb=102893KB/s, mint=318msec, maxt=366msec<br />
  WRITE: io=131264KB, aggrb=358644KB/s, minb=89661KB/s, maxb=103194KB/s, mint=318msec, maxt=366msec<br />

Looks like we preserved our performance characteristics of the pure ram disk (at least it didn't go slower). Still after syncing back to the disk we have all our files backed by the hard disk so they will survive a reboot.

Summary

Once again we tried to compare various options regarding the IO portion of a build. There is no clear winner in this. All these numbers have to be measured for a given use case in a given environment. There are trade-offs especially when it comes to ram disks and tuning options. Still if you can't change anything about your hardware try at least to optimize the software side of your build to minimize IO wait times. We hope our ideas can help with that.

file system configuration	aggregated read	aggregated write
ext4 defaults	458 KB/s	460 KB/s
ext4 tuned	466 KB/s	468 KB/s
tmpfs	137478 KB/s	137882 KB/s
tmpfs + sync	357595 KB/s*	358644 KB/s*

*The numbers in this case were larger when measuring but the overall operation took roughly the same time. These ram disk numbers are insanely high anyways so this might just be some benchmarking variation. There is no reason why putting a sync daemon on top of tmpfs could speed up tmpfs.