Fundamentals of Fast Bulk IO

1. IO buffers

Synchronous access

The simplest way for a program to read (or to write) a chunk of file is to allocate a buffer for the data, issue a request to the OS and then sit there and wait until the request is fullfiled. Once the data is in (or out), proceed to the next step.

Since the program effectively stops running while waiting for the request to complete, this is called a synchronous IO.

It is very simple to implement, it keeps the code nice and tidy and it is widely used in software for reading/writing files.

However, if we want to read/write fast, we can do significantly better.

Asynchronous access

Instead of waiting for a request to complete, a program can make a note that the request is pending and move on to doing other things. Then, it will periodically check if the request is done and when it is, it will deal with the result.

Since we are no longer blocking around an IO request, this is called an asynchronous IO.

Note that in terms of the IO performance, it is so far exactly the same as the synchronous case.

Asynchronous, multiple buffers

Now that we are free to do something else while our request is pending, what we can do is submit another request. And then, perhaps, few more, all of which will be pending and queued somewhere in the guts of the OS.

What this does is it ensures that once the OS is done with one request, it will immediately have another one to process.

This eliminates idling when reading/writing data from the storage device, so we have data flowing through the file stack continuously.

Knowing when to stop

It may seem that if we just throw a boatload of requests at the OS, it should allow us to go through a file as quickly as possible.

However there's really no point in having too many requests in a queue, because it simply doesn't give us any faster processing.

What we need is to merely make sure the request queue is never empty, so if we can achieve that with as few requests as possible, we'll have the fastest processing rate and the lowest memory usage.

2. IO buffer size

Another question is how much data to request in one go.

If we ask for too little, the OS may end up reading more than asked for and then trimming the result to fit it into our buffer.

For example, NTFS defaults to 4KB cluster size on desktop versions of Windows, so asking for a smaller chunk is going to be wasteful.

If we ask for too much, it may translate into several requests further down the file stack and it's not likely to get us any speed gains.

3. IO mode

Buffered and unbuffered access

Windows has two principal modes for accessing files - the so-called buffered and unbuffered IO.

In buffered mode all requests pass through the Cache Manager, which does just what it name implies - it tries to fulfill read requests from the cache and aggregate/delay write requests when appropriate.

In unbuffered mode requests are always fulfilled with an actual disk read/write. They still pass through the Cache Manager to expire any cached copies of same data, but this comes with less overhead than in buffered mode.

Sequential vs random access

Windows also allows programs to indicate if they are planning to work with file in a sequential manner, which happens to be an exceedingly common pattern. This is what happens when files are saved, loaded, copied and when programs are launched.

* This is also a reason why CCSIO has the S in the middle - it is concerned with testing IO performance for this particular access type.

The other pattern is that of a random access. This is when a program is jumping around a file when reading or writing it. This is what database applications do, mostly.

Sequential access and the Cache Manager

Opening file for sequential IO acts as a hint to the cache manager and allows it to apply a different caching strategy.

In particular, the cache manager will pre-fetch data when possible and it will also discard cached data more aggressively.

Conversely, using buffered non-sequential access for large files is a recipe for trashing the file cache. One of those "use with care" things.

In theory, this setting - declaring intended access patter - should only matter for buffered (cached) access.

In practice, it happens to have a noticeable effect on performance of the unbuffered access too, so we have to consider and test for this as well.

4. Summed up

Performance of any bulk IO operation depends on three principle variables - the size of IO buffers, their count and the IO mode.

Some combinations of these parameters will yield a subpar performance, while others will push the stack to its limit and deliver the best IO rate possible.

Based on what we've seen so far there's no universal recipe that works well across all device and volume types. Moreover, even looking at device classes - HDDs, SSDs, network shares, etc. - there doesn't appear to be any dominating combinations either. To each (device) its own.

That said, using larger buffers (from 512KB) and unbuffered mode seems to usually deliver rates that are within 5-10% of maximum achievable throughput.

If in doubt, test your setup with CCSIO Bench and derive your own conclusions.

□