Tuesday, January 5, 2010

ZFS, ARC, page cache, and 1970s buffer tuning

Some time ago Pawel Dawidek ported Sun's ZFS to FreeBSD. ZFS has many interesting features.

One of the more problematic features on general purpose systems is the disk block caching in the ARC (Adjustable Replacement Cache). The way FreeBSD's native file system do caching is by storing a file's pages at the appropriate offset in the file's vnode's associated vm object. Typically, unless a page is actively being written to or read from disk, i.e. it is associated with a buf, it can be taken away from the vm object by the pagedaemon to be used somewhere else in the system. In ZFS' ARC disk blocks are cached by their DVA (oversimplifying, but effectively their block offset on disk). I won't go in to the rationale for this design choice, but this precludes us from using the existing mechanism for coupling file caching with the VM system. ARC buffers are allocated from "wired" memory, meaning that the pagedaemon cannot evict them and it has no information about their usage relative to other pages in the system (i.e. the buffer can be released and the memory re-allocated to some other service).

As a result of this, the VM does not have the ability to determine how "hot" a ZFS buffer is vis a vis other ZFS buffers or other parts of memory. The VM can only influence the overall usage  of the ARC by calling the lowmem handler to free some number of buffers. However, it has no way of determining relative priority of ZFS memory usage versus user applications. A system administrator is left to deciding a priori what the minimum ARC size is (the size below which memory pressure and the lowmem handler will not shrink it) and what the maximum size is (the ARC memory above which the ARC will not allocate further buffers). On a dedicated system one can set a large arc_max and arc_min, however, none of this yields graceful management of resources under mixed workloads.

Ideally one could keep the default ARC settings small but still use all of memory for block caching if it isn't in use by other applications.

To this end I've added a caching layer in between the memory allocation functions (zio_buf_alloc and zio_buf_free) and their consumers. Caching of buffer pages is limited by ZFS' rather idiosyncratic allocation patterns. Thus, the bulk of allocations by allocation count are still malloc backed (not eligible for page caching). Only the 128k buffer allocations all tend be size aligned with worst alignment I've seen in practice being 32k aligned and thus practical to track by their alignment on disk. Unfortunately, the block offset isn't usually available at allocation time so I allocate 128k buffers using the anonymous allocation function, geteblk(), and then synchronize with the vm object when I/O is done using the buffer.

All I/O traverses zio_create, thus if a read or write is being done to a top level logical device the I/O is synchronized with cache in the new function zio_sync_cache(...)

more later ...

1 comment:

Richard Elling said...

For Solaris, it is rare that arc_max needs to be tuned.
The predominate case is when a memory-hungry workload is also present and can benefit from large pages. If left unchecked, the ARC (or any other user program) can steal enough large pages that the memory-hungry application is left with small pages and the resulting performance degradation. The second most common case is when memory-hungry applications start and stop causing sudden, dramatic changes in memory pressure. I can think of no good reason to tune arc_max as a matter of habit.
-- richard