Fundamentals of Fast Bulk IO
1. IO buffers
The simplest way for a program to read (or to write) a chunk
of file is to allocate a buffer for the data, issue a request
to the OS and then sit there and wait until the request is
Once the data is in (or out), proceed to the next step.
Since the program effectively stops running while
waiting for the request to complete, this is called
It is very simple to implement, it keeps the code
nice and tidy and it is widely used in software
for reading/writing files.
However, if we want to read/write fast
we can do significantly better.
Instead of waiting for a request to complete,
a program can make a note that the request is pending
and move on to doing other things. Then, it will
periodically check if the request is done and when it
is, it will deal with the result.
Since we are no longer blocking around an IO
request, this is called an asynchronous
Note that in terms of the IO performance, it is
so far exactly the same as the synchronous case.
Asynchronous, multiple buffers
Now that we are free to do something else while our
request is pending, what we can do is submit
. And then, perhaps, few more,
all of which will be pending and queued somewhere
in the guts of the OS.
What this does is it ensures that once the OS
is done with one request, it will immediately
have another one to process.
This eliminates idling when reading/writing data
from the storage device, so we have data flowing
through the file stack continuously
Knowing when to stop
It may seem that if we just throw a boatload
of requests at the OS, it should allow us to
go through a file as quickly as possible.
However there's really no point in having
too many requests in a queue, because it simply
doesn't give us any faster processing.
What we need is to merely make sure the request queue
is never empty, so if we can achieve that with as
few requests as possible, we'll have the fastest
processing rate and the lowest memory usage.
2. IO buffer size
Another question is how much data to request in one
If we ask for too little, the OS may end up reading
more than asked and then trimming the result to fit
it into our buffer.
For example, NTFS defaults to 4KB cluster size
on desktop versions of Windows, so asking for a
smaller chunk is going to be wasteful.
If we ask for too much, it may translate into
several requests further down the file stack
and it's not likely to get us any speed gains.
3. IO mode
Buffered and unbuffered access
Windows has two principal modes for accessing
files - the so-called buffered and unbuffered IO.
In buffered mode
all requests pass through
the Cache Manager, which does what it says on a
tin - it tries to fulfill read requests from the
cache and aggregate/delay write requests when
In unbuffered mode
requests are always
fulfilled with an actual disk read/write. They
still pass through the Cache Manager, but
presumably with far less overhead than in
Sequential vs random access
Windows also allows programs to indicate if they
are planning to work with file in a sequential
which happens to be an exceedingly common
This is what happens when files are saved,
loaded, copied and when programs are launched.
This is also a reason why
has the S
in the middle - it is concerned with testing
IO performance for this particular access type.
The opposite pattern is a random
access whereby a program is routinely jumping
all over the file when reading/writing it.
This is what database applications do, mostly.
Sequential access and the Cache Manager
Opening file for sequential IO acts as a hint
to the Cache Manager and allows it to apply
different caching strategy.
In particular, the cache manager will pre-fetch
data when possible and it will also discard
cached file bits more aggressively.
Conversely, using buffered non-
access for large files is a recipe for trashing
the file cache. One of those "use with care" things.
, this setting should only matter for
buffered (cached) access.
, it has a noticeable effect on
performance of the unbuffered access as well,
so we have to test for this as well.
4. Summed up
Performance of any bulk IO operation depends on
three principle variables - the size of IO buffers,
their count and the IO mode.
Some combinations of these parameters will yield
a subpar performance, while others will push the
stack to its limit and deliver the best IO rate
Based on what we've seen so far there's no universal
that works well across all device and
volume types. Moreover, even looking at distinct
device classes - HDDs, SSDs, network shares, etc.
- there doesn't appear to be any dominating
combinations either. To each (device) its own.
That said, using larger buffers
(from 512KB) and
seems to usually deliver
rates that are within 5-10% of maximum achievable
If in doubt, test your setup with CCSIO Bench
and derive your own conclusions.