Performance discovery: IOPS vs. IOPS
RavenDB is a transactional database, we care deeply about ACID. The D in ACID stands for durability, which means that to acknowledge a transaction, we must write it to a persistent medium. Writing to disk is expensive, writing to the disk and ensuring durability is even more expensive.After seeing some weird performance numbers on a test machine, I decided to run an experiment to understand exactly how durable writes affect disk performance.A few words about the term durable writes. Disks are slow, so we use buffering & caches to avoid going to the disk. But a write to a buffer isn’t durable. A failure could cause it to never hit a persistent medium. So we need to tell the disk in some way that we are willing to wait until it can ensure that this write is actually durable. This is typically done using either fsync or O_DIRECT | O_DSYNC flags. So this is what we are testing in this post.I wanted to test things out without any of my own code, so I ran the following benchmark. I pre-allocated a file and then ran the following commands.Normal writes (buffered) with different sizes (256 KB, 512 KB, etc). dd if=/dev/zero of=/data/test bs=256K count=1024 dd if=/dev/zero of=/data/test bs=512K count=1024Durable writes (force the disk to acknowledge them) with different sizes:dd if=/dev/zero of=/data/test bs=256k count=1024 oflag=direct,sync dd if=/dev/zero of=/data/test bs=256k count=1024 oflag=direct,syncThe code above opens the file using:openat(AT_FDCWD, "/data/test", O_WRONLY|O_CREAT|O_TRUNC|O_SYNC|O_DIRECT, 0666) = 3I got myself an i4i.xlarge instance on AWS and started running some tests. That machine has a local NVMe drive of about 858 GB, 32 GB of RAM, and 4 cores. Let’s see what kind of performance I can get out of it.Write sizeTotal writesBuffered writes256 KB256 MB1.3 GB/s512 KB512 MB1.2 GB/s1 MB1 GB1.2 GB/s2 MB2 GB731 Mb/s8 MB8 GB571 MB/s16 MB16 GB561 MB/s2 MB8 GB559 MB/s1 MB1 GB554 MB/s4 KB16 GB557 MB/s16 KB16 GB553 MB/sWhat you can see here is that writes are really fast when buffered. But when I hit a certain size (above 1 GB or so), we probably start having to write to the disk itself (which is NVMe, remember). Our top speed is about 550 MB/s at this point, regardless of the size of the buffers I’m passing to the write() syscall. I’m writing here using cached I/O, which is something that as a database vendor, I don’t really care about. What happens when we run with direct & sync I/O, the way I would with a real database? Here are the numbers for the i4i.xlarge instance for durable writes.Write sizeTotal writesDurable writes256 KB256 MB1.3 GB/s256 KB1 GB1.1 GB/s16 MB16 GB584 GB/s64 KB16 GB394 MB/s32 KB16 GB237 MB/s16 KB16 GB126 MB/sIn other words, when using direct I/O, the smaller the write, the more time it takes. Remember that we are talking about forcing the disk to write the data, and we need to wait for it to complete before moving to the next one. For 16 KB writes, buffered writes achieve a throughput of 553 MB/s vs. 126 MB/s for durable writes. This makes sense, since those writes are cached, so the OS is probably sending big batches to the disk. The numbers we have here clearly show that bigger batches are better.My next test was to see what would happen when I try to write things in parallel. In this test, we run 4 processes that write to the disk using direct I/O and measure their output. I assume that I’m maxing out the throughput on the drive, so the total rate across all commands should be equivalent to the rate I would get from a single command. To run this in parallel I’m using a really simple mechanism - just spawn processes that would do the same work. Here is the command template I’m using:parallel -j 4 --tagstring 'Task {}' dd if=/dev/zero of=/data/test bs=16M count=128 seek={} oflag=direct,sync ::: 0 1024 2048 3072This would write to 4 different portions of the same file, but I also tested that on separate files. The idea is to generate a sufficient volume of writes to stress the disk drive.Write sizeTotal writesDurable & Parallel writes16 MB8 GB650 MB/s16 KB64 GB252 MB/sI also decided to write some low-level C code to test out how this works with threads and a single program. You can find the code here. I basically spawn NUM_THREADS threads, and each will open a file using O_SYNC | O_DIRECT and write to the file WRITE_COUNT times with a buffer of size BUFFER_SIZE.This code just opens a lot of files and tries to write to them using direct I/O with 8 KB buffers. In total, I’m writing 16 GB (128 MB x 128 threads) to the disk. I’m getting a rate of about 320 MB/sec when using this approach.As before, increasing the buffer size seems to help here. I also tested a version where we write using buffered I/O and call fsync every now and then, but I got similar results. The interim conclusion that I can draw from this experiment is that NVMes are pretty cool, but once you hit their limits you can really feel it. There is another aspect to consider though, I’m running this on a disk that i
RavenDB is a transactional database, we care deeply about ACID. The D in ACID stands for durability, which means that to acknowledge a transaction, we must write it to a persistent medium. Writing to disk is expensive, writing to the disk and ensuring durability is even more expensive.
After seeing some weird performance numbers on a test machine, I decided to run an experiment to understand exactly how durable writes affect disk performance.
A few words about the term durable writes. Disks are slow, so we use buffering & caches to avoid going to the disk. But a write to a buffer isn’t durable. A failure could cause it to never hit a persistent medium. So we need to tell the disk in some way that we are willing to wait until it can ensure that this write is actually durable.
This is typically done using either fsync or O_DIRECT | O_DSYNC flags. So this is what we are testing in this post.
I wanted to test things out without any of my own code, so I ran the following benchmark.
I pre-allocated a file and then ran the following commands.
Normal writes (buffered) with different sizes (256 KB, 512 KB, etc).
dd if=/dev/zero of=/data/test bs=256K count=1024
dd if=/dev/zero of=/data/test bs=512K count=1024
Durable writes (force the disk to acknowledge them) with different sizes:
dd if=/dev/zero of=/data/test bs=256k count=1024 oflag=direct,sync
dd if=/dev/zero of=/data/test bs=256k count=1024 oflag=direct,sync
The code above opens the file using:
openat(AT_FDCWD, "/data/test", O_WRONLY|O_CREAT|O_TRUNC|O_SYNC|O_DIRECT, 0666) = 3
I got myself an i4i.xlarge instance on AWS and started running some tests. That machine has a local NVMe drive of about 858 GB, 32 GB of RAM, and 4 cores. Let’s see what kind of performance I can get out of it.
Write size | Total writes | Buffered writes |
256 KB | 256 MB | 1.3 GB/s |
512 KB | 512 MB | 1.2 GB/s |
1 MB | 1 GB | 1.2 GB/s |
2 MB | 2 GB | 731 Mb/s |
8 MB | 8 GB | 571 MB/s |
16 MB | 16 GB | 561 MB/s |
2 MB | 8 GB | 559 MB/s |
1 MB | 1 GB | 554 MB/s |
4 KB | 16 GB | 557 MB/s |
16 KB | 16 GB | 553 MB/s |
What you can see here is that writes are really fast when buffered. But when I hit a certain size (above 1 GB or so), we probably start having to write to the disk itself (which is NVMe, remember). Our top speed is about 550 MB/s at this point, regardless of the size of the buffers I’m passing to the write() syscall.
I’m writing here using cached I/O, which is something that as a database vendor, I don’t really care about. What happens when we run with direct & sync I/O, the way I would with a real database? Here are the numbers for the i4i.xlarge instance for durable writes.
Write size | Total writes | Durable writes |
256 KB | 256 MB | 1.3 GB/s |
256 KB | 1 GB | 1.1 GB/s |
16 MB | 16 GB | 584 GB/s |
64 KB | 16 GB | 394 MB/s |
32 KB | 16 GB | 237 MB/s |
16 KB | 16 GB | 126 MB/s |
In other words, when using direct I/O, the smaller the write, the more time it takes. Remember that we are talking about forcing the disk to write the data, and we need to wait for it to complete before moving to the next one.
For 16 KB writes, buffered writes achieve a throughput of 553 MB/s vs. 126 MB/s for durable writes. This makes sense, since those writes are cached, so the OS is probably sending big batches to the disk. The numbers we have here clearly show that bigger batches are better.
My next test was to see what would happen when I try to write things in parallel. In this test, we run 4 processes that write to the disk using direct I/O and measure their output.
I assume that I’m maxing out the throughput on the drive, so the total rate across all commands should be equivalent to the rate I would get from a single command.
To run this in parallel I’m using a really simple mechanism - just spawn processes that would do the same work. Here is the command template I’m using:
parallel -j 4 --tagstring 'Task {}' dd if=/dev/zero of=/data/test bs=16M count=128 seek={} oflag=direct,sync ::: 0 1024 2048 3072
This would write to 4 different portions of the same file, but I also tested that on separate files. The idea is to generate a sufficient volume of writes to stress the disk drive.
Write size | Total writes | Durable & Parallel writes |
16 MB | 8 GB | 650 MB/s |
16 KB | 64 GB | 252 MB/s |
I also decided to write some low-level C code to test out how this works with threads and a single program. You can find the code here. I basically spawn NUM_THREADS threads, and each will open a file using O_SYNC | O_DIRECT and write to the file WRITE_COUNT times with a buffer of size BUFFER_SIZE.
This code just opens a lot of files and tries to write to them using direct I/O with 8 KB buffers. In total, I’m writing 16 GB (128 MB x 128 threads) to the disk. I’m getting a rate of about 320 MB/sec when using this approach.
As before, increasing the buffer size seems to help here. I also tested a version where we write using buffered I/O and call fsync every now and then, but I got similar results.
The interim conclusion that I can draw from this experiment is that NVMes are pretty cool, but once you hit their limits you can really feel it. There is another aspect to consider though, I’m running this on a disk that is literally called ephemeral storage. I need to repeat those tests on real hardware to verify whether the cloud disk simply ignores the command to persist properly and always uses the cache.
That is supported by the fact that using both direct I/O on small data sizes didn’t have a big impact (and I expected it should). Given that the point of direct I/O in this case is to force the disk to properly persist (so it would be durable in the case of a crash), while at the same time an ephemeral disk is wiped if the host machine is restarted, that gives me good reason to believe that these numbers are because the hardware “lies” to me.
In fact, if I were in charge of those disks, lying about the durability of writes would be the first thing I would do. Those disks are local to the host machine, so we have two failure modes that we need to consider:
- The VM crashed - in which case the disk is perfectly fine and “durable”.
- The host crashed - in which case the disk is considered lost entirely.
Therefore, there is no point in trying to achieve durability, so we can’t trust those numbers.
The next step is to run it on a real machine. The economics of benchmarks on cloud instances are weird. For a one-off scenario, the cloud is a godsend. But if you want to run benchmarks on a regular basis, it is far more economical to just buy a physical machine. Within a month or two, you’ll already see a return on the money spent.
We got a machine in the office called Kaiju (a Japanese term for enormous monsters, think: Godzilla) that has:
- 32 cores
- 188 GB RAM
- 2 TB NVMe for the system disk
- 4 TB NVMe for the data disk
I ran the same commands on that machine as well and got really interesting results.
Write size | Total writes | Buffered writes |
4 KB | 16 GB | 1.4 GB/s |
256 KB | 256 MB | 1.4 GB/s |
2 MB | 2 GB | 1.6 GB/s |
2 MB | 16 GB | 1.7 GB/s |
4 MB | 32 GB | 1.8 GB/s |
4 MB | 64 GB | 1.8 GB/s |
We are faster than the cloud instance, and we don’t have a drop-off point when we hit a certain size. We are also seeing higher performance when we throw bigger buffers at the system.
But when we test with small buffers, the performance is also great. That is amazing, but what about durable writes with direct I/O?
I tested the same scenario with both buffered and durable writes:
Mode | Buffered | Durable |
1 MB buffers, 8 GB write | 1.6 GB/s | 1.0 GB/s |
2 MB buffers, 16 GB write | 1.7 GB/s | 1.7 GB/s |
Wow, that is an interesting result. Because it means that when we use direct I/O with 1 MB buffers, we lose about 600 MB/sec compared to buffered I/O. Note that this is actually a pretty good result. 1 GB/sec is amazing.
And if you use big buffers, then the cost of direct I/O is basically gone. What about when we go the other way around and use smaller buffers?
Mode | Buffered | Durable |
128 KB buffers, 8 GB write | 1.7 GB/s | 169 MB/s |
32 KB buffers, 2 GB | 1.6 GB/s | 49.9 MB/s |
Parallel: 8, 1 MB, 8 GB | 5.8 GB/s | 3.6 GB/s |
Parallel: 8, 128 KB, 8 GB | 6.0 GB/s | 550 MB/s |
For buffered I/O - I’m getting simply dreamy numbers, pretty much regardless of what I do