How memory maps (mmap) deliver faster file access in Go

79 points by ingve 5 hours ago

buybackoff 4 hours ago

It looks suspicious at 25x. Even 2.5x would be suspicious unless reading very small records.

I assume both cases have the file cached in RAM already fully, with a tiny size of 100MB. But the file read based version actually copies the data into a given buffer, which involves cache misses to get data from RAM to L1 for copying. The mmap version just returns the slice and it's discarded immediately, the actual data is not touched at all. Each record is 2 cache lines and with random indices is not prefetched. For the CPU AMD Ryzen 7 9800X3D mentioned in the repo, just reading 100 bytes from RAM to L1 should take ~100 nanos.

The benchmark compares actually getting data vs getting data location. Single digit nanos is the scale of good hash tables lookups with data in CPU caches, not actual IO. For fairness, both should use/touch the data, eg copy it.

a-dub 2 hours ago

doing these sorts of benchmarks is actually quite tricky. you must clear the page cache by allocating >1x physical ram before each attempt.
moreover, mmap by default will load lazy, where mmap with MAP_POPULATE will prefetch. in the former case, reporting average operation times is not valid because the access time distributions are not gaussian (they have a one time big hit at first touch). with MAP_POPULATE (linux only), there is long loading delay when mmap is first called, but then the average access times will be very low. when pages are released will be determined by the operating system page cache eviction policy.
the data structure on top is best chosen based on desired runtime characteristics. if it's all going in ram, go ahead and use a standard randomized hash table. if it's too big to fit in ram, designing a structure that is aware of lru style page eviction semantics may make sense (ie, a hash table or other layout that preserves locality for things that are expected to be accessed in a temporally local fashion.)
kragen 2 hours ago

> For the CPU AMD Ryzen 7 9800X3D mentioned in the repo, just reading 100 bytes from RAM to L1 should take ~100 nanos.
I think this is the wrong order of magnitude. One core of my Ryzen 5 3500U seems to be able to run memcpy() at 10 gigabytes per second (0.1 nanoseconds per byte) and memset() at 31 gigabytes per second (0.03 nanoseconds per byte). I'd expect a sequential read of 100 bytes to take about 3 nanoseconds, not 100 nanoseconds.
However, I think random accesses do take close to 100 nanoseconds to transmit the starting row and column address and open the row. I haven't measured this on this hardware because I don't have a test I'm confident in.
- bcrl 2 hours ago
  
  100 nanoseconds from RAM is correct. Latency != bandwidth. 3 nanoseconds would be from cache or so on a Ryzen. You ain't gonna get the benefits of prefetching on the first 100 bytes.
  - kragen 2 hours ago
    
    Yes, my comment clearly specified that I was talking about sequential reads, which do get the benefits of prefetching, and said, "I think random accesses do take close to 100 nanoseconds".
    
    bcrl 2 hours ago
    
    If you're doing large amounts of sequential reads from a filesystem, it's probably not in cache. You only get latency that low if you're doing nothing else that stresses the memory subsystem, which is rather unlikely. Real applications have overhead, which is why microbenchmarks like this are useless. Microbenchmarks are not the best first order estimate for programmers to think of.
    
    kragen 2 hours ago
    
    Yes, I went into more detail on those issues in https://news.ycombinator.com/item?id=45689464, but overhead is irrelevant to the issue we were discussing, which is about how long it takes to read 100 bytes from memory. Microbenchmarks are generally exactly the right way to answer that question.
    Memory subsystem bottlenecks are real, but even in real applications, it's common for the memory subsystem to not be the bottleneck. For example, in this case we're discussing system call overhead, which tends to move the system bottleneck inside the CPU (even though a significant part of that effect is due to L1I cache evictions).
    Moreover, even if the memory subsystem is the bottleneck, on the system I was measuring, it will not push the sequential memory access time anywhere close to 1 nanosecond per byte. I just don't have enough cores to oversubscribe the memory bus 30×. (1.5×, I think.) Having such a large ratio of processor speed to RAM interconnect bandwidth is in fact very unusual, because it tends to perform very poorly in some workloads.
    If microbenchmarks don't give you a pretty good first-order performance estimate, either you're doing the wrong microbenchmarks or you're completely mistaken about what your application's major bottlenecks are (plural, because in a sequential program you can have multiple "bottlenecks", colloquially, unlike in concurrent systens where you almost always havr exactly one bottleneck.) Both of these problems do happen often, but the good news is that they're fixable. But giving up on microbenchmarking will not fix them.
    
    bcrl an hour ago
    
    If you're bottlenecked on a 100 byte read, the app is probably doing something really stupid, like not using syscalls the way they're supposed to. Buffered I/O has existed from fairly early on in Unix history, and it exists because it is needed to deal with the mismatch between what stupid applications want to do versus the guarantees the kernel has to provide for file I/O.
    The main benefit from the mmap approach is that the fast path then avoids all the code the kernel has to execute, the data structures the kernel has to touch, and everything needed to ensure the correctness of the system. In modern systems that means all kinds of synchronization and serialization of the CPU needed to deal with $randomCPUdataleakoftheweek (pipeline flushes ftw!).
    However, real applications need to deal with correctness. For example, a real database is not just going to just do 100 byte reads of records. It's going to have to take measures (locks) to ensure the data isn't being written to by another thread.
    Rarely is it just a sequential read of the next 100 bytes from a file.
    I'm firmly in the camp that focusing on microbenchmarks like this is frequently a waste of time in the general case. You have to look at the application as a whole first. I've implemented optimizations that looked great in a microbenchmark, but showed absolutely no difference whatsoever at the application level.
    Moreover, my main hatred for mmap() as a file I/O mechanism is that it moves the context switches when the data is not present in RAM from somewhere obvious (doing a read() or pread() system call) to somewhere implicit (reading 100 bytes from memory that happens to be mmap()ed and was passed as a pointer to a function written by some other poor unknowing programmer). Additionally, read ahead performance for mmap()s when bringing data into RAM is quite a bit slower than on read()s in large part because it means that the application is not providing a hint (the size argument to the read() syscall) to the kernel for how much data to bring in (and if everything is sequential as you claim, your code really should know that ahead of time).
    So, sure, your 100 byte read in the ideal case when everything is cached is faster, but warming up the cache is now significantly slower. Is shifting costs that way always the right thing to do? Rarely in my experience.
    And if you don't think about it (as there's no obvious pread() syscall anymore), those microseconds and sometimes milliseconds to fault in the page for that 100 byte read will hurt you. It impacts your main event loop, the size of your pool of processes / threads, etc. The programmer needs to think about these things, and the article mentioned none of this. This makes me think that the author is actually quite naive and merely proud in thinking that he discovered the magic Go Faster button without having been burned by the downsides that arise in the Real World from possible overuse of mmap().
Scaevolus 3 hours ago

Yeah, 3.3ns is about 12 CPU cycles. You can indeed create a pointer to a memory location that fast!

nteon 4 hours ago

the downside is that the go runtime doesn't expect memory reads to page fault, so you may end up with stalls/latency/under-utilization if part of your dataset is paged out (like if you have a large cdb file w/ random access patterns). Using file IO, the Go runtime could be running a different goroutine if there is a disk read, but with mmap that thread is descheduled but holding an m & p. I'm also not sure if there would be increased stop the world pauses, or if the async preemption stuff would "just work".

Section 3.2 of this paper has more details: https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf

vlovich123 4 hours ago

To me this indicates a limitation of the API. Cause you do want to maintain that the kernel can page out that memory under pressure while userspace accesses that memory asynchronously while allowing the thread to do other asynchronous things. There’s no good programming model/OS api that can accomplish this today.
- wmf 2 hours ago
  
  If C had exceptions a page fault could safely unwind the stack up to the main loop which could work on something else until the page arrives. This has the advantage that there's no cost for the common case of accessing resident pages. Exceptions seem to have fallen out of favor so this may trade one problem for another.
- avianlyric 3 hours ago
  
  There is no sensible OS API that could support this, because fundamentally memory access is a hardware API. The OS isn’t involved in normal memory reads, because that would be ludicrously inefficient, effectively requiring a syscall for every memory operation, which effectively means a syscall for any operation involving data I.e. all operations.
  Memory operations are always synchronous because they’re performed directly as a consequence of CPU instructions. Reading memory that’s been paged out results in the CPU itself detecting that the virtual address isn’t in RAM, and performing a hardware level interrupt. Literally abandoning a CPU instruction mid execution to start executing an entirely separate set of instructions which will hopefully sort out the page fault that just occurred, then kindly ask the CPU to go back and repeat the operation that caused the page fault.
  OS is only involved only because it’s the thing that provided the handling instructions for the CPU to execute in the event of a page fault. But it’s not in anyway actually capable of changing how the CPU initially handles the page fault.
  Also the current model does allow other threads to continue executing other work while the page fault is handled. The fault is completely localised to individual thread that triggered the fault. The CPU has no concept of the idea that multiple threads running on different physical core are in anyway related to each other. It also wouldn’t make sense to allow the interrupted thread to someone kick off a separate asynchronous operation, because where is it going to execute? The CPU core where the page fault happened is needed to handle the actual page fault, and copy in the needed memory. So even if you could kick off an async operation, there wouldn’t be any available CPU cycles to carry out the operation.
  Fundamentally there aren’t any sensible ways to improve on this problem, because the problem only exists due to us pretending that our machines have vastly more memory than they actually do. Which comes with tradeoffs, such as having to pause the CPU and steal CPU time to maintain the illusion.
  If people don’t like those tradeoffs, there’s a very simple solution. Put enough memory in your machine to keep your entire working set in memory all the time. Then page faults can never happen.
  - kragen 2 hours ago
    
    > There is no sensible OS API that could support this, because fundamentally memory access is a hardware API.
    Not only is there a sensible OS API that could support this, Linux already implements it; it's the SIGSEGV signal. The default way to respond to a SIGSEGV is by exiting the process with an error, but Linux provides the signal handler with enough information to do something sensible with it. For example, it could map a page into the page frame that was requested, enqueue an asynchronous I/O to fill it, put the current green thread to sleep until the I/O completes, and context-switch to a different green thread.
    Invoking a signal handler only has about the same inherent overhead as a system call. But then the signal handler needs another couple of system calls. So on Linux this is over a microsecond in all. That's probably acceptable, but it's slower than just calling pread() and having the kernel switch threads.
    Some garbage-collected runtimes do use SIGSEGV handlers on Linux, but I don't know of anything using this technique for user-level virtual memory. It's not a very popular technique in part because, like inotify and epoll, it's nonportable; POSIX doesn't specify that the signal handler gets the arguments it would need, so running on other operating systems requires extra work.
    im3w1l also mentions userfaultfd, which is a different nonportable Linux-only interface that can solve the same thing but is, I think, more efficient.
  - blibble 2 hours ago
    
    > There is no sensible OS API that could support this, because fundamentally memory access is a hardware API.
    there's nothing magic about demand paging, faulting is one way it can be handled
    another could be that the OS could expose the present bit on the PTE to userland, and it has to check it itself, and linux already has asynchronous "please back this virtual address" APIs
    > Memory operations are always synchronous because they’re performed directly as a consequence of CPU instructions.
    although most CPU instructions may look synchronous they really aren't, the memory controller is quite sophisticated
    > Fundamentally there aren’t any sensible ways to improve on this problem, because the problem only exists due to us pretending that our machines have vastly more memory than they actually do. Which comes with tradeoffs, such as having to pause the CPU and steal CPU time to maintain the illusion.
    modern demand paging is one possible model that happens to be near universal amongst operating system today
    there are many, many other architectures that are possible...
  - im3w1l 3 hours ago
    
    I think you have a misunderstanding of how disk IO happens. The CPU core sends a command to the disk "I want some this and that data", then the CPU core can go do something else while the disk services that request. From what I read the disk actually puts the data directly into memory by using DMA, without needing to involve the CPU.
    So far so good, but then the question is to ensure that the CPU core has something more productive to do then just check "did the data arrive yet?" over and over and coordinating that is where good apis come in.
    
    dapperdrake 2 hours ago
    
    (Not the person you are replying to.)
    There is nothing in the sense of Python async or JS async that the OS thread or OS process in question could usefully do on the CPU until the memory is paged into physical RAM. DMA or no DMA.
    The OS process scheduler can run another process or thread. But your program instance will have to wait. That’s the point. It doesn’t matter whether waiting is handled by a busy loop a.k.a. polling or by a second interrupt that wakes the OS thread up again.
    That is why Linux calls it uninterruptible sleep.
    EDIT: io_uring would of course change your thread from blocking syscalls to non-blocking syscalls. Page faults are not a syscall, as GP pointed out. They are, however, a context-switch to an OS interrupt handler. That is why you have an OS. It provides the software drivers for your CPU, MMU, and disks/storage. Here this is the interrupt handler for a page fault.
    
    bcrl 2 hours ago
    
    What everyone forgets is just how expensive context switches are on modern x86 CPUs. Those 512 bit vector registers fill up a lot of cache lines. That's why async tends to win over processes / threads for many workloads.
    
    lmz 2 hours ago
    
    It's hard to say on one hand "I use mmap because I don't want fancy APis for every read" and on the other "I want to do something useful on page fault" because you don't want to make every memory read a possible interruption point.
    
    ori_b 2 hours ago
    
    I think you have a misunderstanding of how the OS is signaled about disk I/O being necessary. Most of the post above was discussing that aspect of it, before the OS even sends the command to the disk.
- im3w1l 3 hours ago
  
  There are apis that sort of let you do it: mincore, madvise, userfaultfd.
  - bcrl 2 hours ago
    
    None of those APIs are cheap enough to call in a fast path.

Ingon 3 hours ago

When I adopted mmap in klevdb [1], I saw a dramatic performance improvements. So, even as klevdb completes a write segment, it will reopen, on demand, the segment for reading with mmap (segments are basically part of write only log). With this any random reads are super fast (but of course not as fast as sequential ones).

[1] https://github.com/klev-dev/klevdb

gustavpaul 3 hours ago

The MmapReader is not copying the requested byte range into the buf argument, so if ever the underlying file descriptor is closed (or the file truncated out of band) any subsequent slice access will throw SIGBUS, which is really unpleasant.

It also means the latency due to pagefaults is shifted from inside mmapReader.ReadRecord() (where it would be expected) to wherever in the application the bytes are first accessed, leading to spooky unpreditactable latency spikes in what are otherwise pure functions. That inevitably leads to wild arguments about how bad GC stalls are :-)

An apples to apples comparison should be copying the bytes from the mmap buffer and returning the resulting slice.

loeg an hour ago

> so if ever the underlying file descriptor is closed
Nit: Mmap mapping lifetimes are not attached to the underlying fd. The file truncation and latency concerns are valid, though.
dapperdrake 2 hours ago

It’s not accessible until it is in user space. (Virtual memory addresses mapped to physical RAM holding the data.)
Good point.

liuliu 3 hours ago

mmap is a good crutch when you 1. don't have busy polling / async IO API available and want to do some quick & dirty preloading tricks; 2. don't want to manage the complexity of in-memory cache, especially cross-processes ones.

Obviously if you have kernel-backed async IO APIs (io_uring) and willing to dig into the deeper end (for better managed cache), you can get better performance than mmap. But in many cases, mmap is "good-enough".

commandersaki 4 hours ago

This is a good article but I'm wondering what is the relationship between this website/company and varnish-cache.org, since in the article they make claims of releasing Varnish Cache, and the article wasn't written by Poul-Henning Kamp.

wmf 2 hours ago

Varnish hasn't been a solo project for many years. Also PHK's version is now called Vinyl Cache while the corporate fork is called Varnish.

habibur 4 hours ago

Is mmap still faster than fread? That might have been true in the 90s but I was wondering about current improvements.

If you have enough free memory, the file will be cached in memory anyway instead of residing on disk. Therefore both will be reading from memory, albeit through different API.

Looking for recent benchmark or view from OS developers.

stingraycharles 4 hours ago

In our experience building a high performance database server: absolutely. If your line of thinking is “if you have enough free memory”, then these types of optimizations aren’t for you. one of the main benefits is eliminating an extra copy.
additionally, mmap is heavily optimized for random access, so if that’s what you’re doing, then you’ll have a much better time with it than fread.
(I hope a plug is not frowned upon here: if you like this kind of stuff, we’re a fully remote company and hiring C++ devs: https://apply.workable.com/quasar/j/436B0BEE43/ )
- YouAreWRONGtoo 4 hours ago
  
  If you can't post a salary, you shouldn't post a job opening.
  (Not that you can afford me.)
  Also, your company is breaking the law by false advertising. It suggests your current leadership is fucking stupid. Why do you work for a criminal enterprise?
  - jasonwatkinspdx 3 hours ago
    
    I'd be shocked if anyone would hire you after seeing this behavior...
  - vlovich123 3 hours ago
    
    What’s the false advertising?
    
    deaddodo 3 hours ago
    
    Yeah, I took a look at the posting and it’s a bog standard job posting.
    I assume they’re referring to the no-salary aspect and (based on their speech style) are in the US. But, even in that case, it would only matter if the posting were targeted to one of the states that require salary information and the company operated or had a presence in said state. Since it’s an EU company, that’s almost definitely not the case.
loeg an hour ago

read, or fread? fread is the buffered version that does an extra copy for no reason that would benefit this use case.
do_not_redeem 4 hours ago

Even if the file is cached, fread has to do a memcpy. mmap doesn't.
- gpderetta 4 hours ago
  
  fread is (usually) buffered io, so it actually does two additional mem copies (kernel to FILE buffer then to user buffer)

MayCXC 4 hours ago

wowie. mmap also dramatically improved perf for LLaMA: https://justine.lol/mmap/

kragen 2 hours ago

The simple answer to "How do memory maps (mmap) deliver faster file access?" is "sometimes", but the blog post does give some more details.

I was suspicious of the 25× speedup claim, but it's a lot more plausible than I thought.

On this Ryzen 5 3500U running mostly at 3.667GHz (poorly controlled), reading data from an already-memory-mapped page is as fast as memcpy (about 10 gigabytes per second when not cached on one core of my laptop, which works out to 0.1 nanoseconds per byte, plus about 20 nanoseconds of overhead) while lseek+read is two system calls (590ns each) plus copying bytes into userspace (26–30ps per byte for small calls, 120ps per byte for a few megabytes). Small memcpy (from, as it happens, an mmapped page) also costs about 25ps per byte, plus about 2800ps per loop iteration, probably much of which is incrementing the loop counter and passing arguments to the memcpy function (GCC is emitting an actual call to memcpy, via the PLT).

So mmap will always be faster than lseek+read on this machine, at least if it doesn't have a page fault, but the point at which memcpy from mmap would be 25× faster than lseek+read would be where 2×590 + .028n = 25×(2.8 + .025n) = 70 + .625n. Which is to say 1110 = .597n ∴ n = 1110/.597 = 1859 bytes. At that point, memcpy from mmap should be 49ns and lseek+read should be 1232ns, which is 25× as big. You can cut that size more than in half if you use pread() instead of lseek+read, and presumably io_uring would cut it even more. If we assume that we're also taking cache misses to bring in the data from main memory in both cases, we have 2×590 + .1n = 25×(2.8 + .1n) = 70 + 2.5n, so 1110 = 2.4n ∴ n = 1110/2.4 = 462 bytes.

On the other hand, mmap will be slow if it's hitting a page fault, which sort of corresponds to the case where you could have cached the result of lseek+read in private RAM, which you could do on a smaller-than-pagesize granularity, which potentially means you could hit the slow path much less often for a given working set. And lseek+read has several possible ways to do make the I/O asynchronous, while the only way to make mmap page faults asynchronous is to hit the page faults in different threads, which is a pretty heavyweight mechanism.

On the other hand, lseek+read with a software cache is sort of using twice as much memory (one copy is in the kernel's buffer cache and another copy is in the application's software cache) so mmap could still win. And, if there are other processes writing to the data being queried, you need some way to invalidate the software cache, which can be expensive.

(On the gripping hand, if you're reading from shared memory while other processes are updating it, you're probably going to need some kind of locking or lock-free synchronization with those other processes.)

So I think a reasonably architected lseek+read (or pread) approach to the problem might be a little faster or a little slower than the mmap approach, but the gap definitely won't be 25×. But very simple applications or libraries, or libraries where many processes might be simultaneously accessing the same data, could indeed get 25× or even 256× performance improvements by letting the kernel manage the cache instead of trying to do it themselves.

Someone at a large user of Varnish told me they've mostly removed mmap from their Varnish fork for performance.

loeg an hour ago

> lseek+read is two system calls
You'd never do that, though -- you'd use pread.

mholt 4 hours ago

Just this month, I've learned the hard way that some file systems do not play well with mmap: https://github.com/mattn/go-sqlite3/issues/1355

In my case, it seems that Mac's ExFAT driver is incompatible with sqlite's WAL mode because the driver returned a memory address that is misaligned on ARM64. Most bizarre error I've encountered in years.

So, uh, mind your file systems, kids!

vlovich123 3 hours ago

I would be very careful about that conclusion. Reading that thread it sounds like you’re relying on Claude to make this conclusion but you haven’t actually verified what the address being returned actually is.
The reason I’m skeptical is three fold. The first is that it’s generally impossible for a filesystem to mmap return a pointer that’s not page boundary aligned. The second is that unaligned accesses are still fine on modern ARM is not a SIGBUS. The third is that Claude’s reasoning that the pointer must be 8-byte aligned and that indicates a misaligned read is flawed - how do you know that SQLite isn’t doing a 2-byte read at that address?
If you really think it’s a bad alignment it should be trivial to reproduce - mmap the file explicitly and print the address or modify the SQLite source to print the mmap location it gets.
- mholt 3 hours ago
  
  I'd love to be wrong, but the address it's referring to is the correct address from the error / stack trace.
  I honestly don't know anything about this. There's no search results for my error. ChatGPT and Claude and Grok all agreed one way or another, with various prompts.
  Would be happy to have some help verifying any of this. I just know that disabling WAL mode, and not using Mac's ExFAT driver, both fixed the error reliably.
  - achierius 2 hours ago
    
    But is that the address being returned by mmap? Furthermore, what instruction is this crashing on? You should be able to look up the specific alignment requirements of that instruction to verify.
    > ChatGPT and Claude and Grok all agreed one way or another, with various prompts.
    This means less than you'd think: they're all trained on a similar corpus, and Grok in particular is probably at least partially distilled from Claude. So they tend to come to similar conclusions given similar data.

nawgz 4 hours ago

Sounds interesting. Why wouldn’t the OS itself default to this behavior? Could it fall apart under load, or is it just not important enough to replace the legacy code relying on it?

trenchpilgrim 4 hours ago

1. mmap was added to Unix later by Sun, it wasn't in the original Unix
2. As the article points out mmap is very fast for reading huge amounts of data but is a lot slower at other file operations. For reading smallish files, which is the majority of calls most software will make to the filesystem, the regular file syscalls are better.
3. If you're on a modern Linux you might be better off with io_uring than mmap.
- scottlamb 4 hours ago
  
  All true, and it's not just performance either. The API is just different. mmap data can change at any time. In fact, if the file shrinks, access to a formerly valid region of memory has behavior that is unspecified by the Single Unix Specification. (On Linux, it causes a SIGBUS if you access a page that is entirely invalid; bytes within the last page after the last valid byte probably are zeros or something? unsure.)
  In theory I suppose you could have a libc that mostly emulates read() and write() calls on files [1] with memcpy() on mmap()ed regions. But I don't think it'd be quite right. For one thing, that read() behavior after shrink would be a source of error.
  Higher-level APIs might be more free to do things with either mmap or read/write.
  [1] just on files; so it'd have to track which file descriptors are files as opposed to sockets/pipes/etc, maintaining the cached lengths and mmap()ed regions and such. libc doesn't normally do that, and it'd go badly if you bypass it with direct system calls.