ZFS really has some interesting quirks. One of them is that it is truly designed to deal with dumb-as-a-rock storage. If you have a box of SATA disks with firmware flakier than Paris Hilton on a coke binge, then ZFS has truly been designed for you.
As a result, ZFS doesn’t trust that anything it writes to the ZFS Intent Log (ZIL) made it to your storage, until it flushes the storage cache. After every write to the ZIL, ZFS executes an fsync() call to instruct the storage to flush its write cache to the disk. In fact, ZFS won’t return on a write operation until the ZIL write and flush have completed. If the devices making up your zpool are individual hard drives…particularly SATA ones…this is a great behavior. If the power goes kaput during a write, you don’t have the problem that the write made it to drive cache but never to the disk.
The major problem with this strategy only occurs when you try to layer ZFS over an intelligent storage array with a decent battery-backed cache.
Most of these arrays have sizable 2GB or greater caches with 72-hour batteries. The cache gives a huge performance boost, particularly on writes. Since cache is so much faster than disk, the array can tell the writer really quickly, "I’ve got it from here, you can go back to what you were doing". Essentially, as fast as the data goes into the cache, the array can release the writer. Unlike the drive-based caches, the array cache has a 72-hour battery attached to it. So, if the array loses power and dies, you don’t lose the writes in the cache. When the array boots back up, it flushes the writes in the cache to the disk. However, ZFS doesn’t know that its talking to an array, so it assumes that the cache isn’t trustworthy, and still issues an fsync() after every ZIL write. So every time a ZIL write occurs, the write goes into the array write cache, and then the array is immediately instructed to flush the cache contents to the disk. This means ZFS doesn’t get the benefit of a quick return from the array, instead it has to wait the amount of time it takes to flush the write cache to the slow disks. If the array is under heavy load and the disks are thrashing away, your write return time (latency) can be awful with ZFS. Even when the array is idle, your latency with flushing is typically higher than the latency under heavy load with no flushing. With our array honoring ZFS ZIL flushes, we saw idle latencies of 54ms, and heavy load latencies of 224ms.
You have two options to rid yourself of the bane of existence known as write cache flushing:
* Disable the ZIL. The ZIL is the way ZFS maintains consistency until it can get the blocks written to their final place on the disk. That’s why the ZIL flushes the cache. If you don’t have the ZIL and a power outage occurs, your blocks may go poof in your server’s RAM…’cause they never made it to the disk Kemosabe. See Dracko article #570 on how to disable ZIL
* Tell your array to ignore ZFS’ flush commands. This is pretty safe, and massively beneficial.
The former option, is really a no go because it opens you up to losing data. The second option really works well and is darn safe. It ends up being safe because if ZFS is waiting for the write to complete, that means the write made it to the array, and if its in the array cache you’re golden. Whether famine or flood or a loose power cable come, your array will get that write to the disk eventually. So its OK to have the array lie to ZFS and release ZFS almost immediately after the ZIL flush command executes.
So how do you get your array to ignore SCSI flush commands from ZFS? That differs depending on the array, but I can tell you how to do it on an Engenio array. If you’ve got any of the following arrays, its made by Engenio and this may work for you:
* Sun StorageTek FlexLine 200/300 series
* Sun StorEdge 6130
* Sun StorageTek 6140/6540
* IBM DS4x00
* many SGI InfiniteStorage arrays (you’ll need to check to make sure your array is actually OEM’d from Engenio)
On a StorageTek FLX210 with SANtricity 9.15, the the following command script will instruct the array to ignore flush commands issued by Solaris hosts:
//Show Solaris ICS option
show controller[a] HostNVSRAMbyte[0x2, 0x21];
show controller[b] HostNVSRAMbyte[0x2, 0x21];
set controller[a] HostNVSRAMbyte[0x2, 0x21]=0x01;
set controller[b] HostNVSRAMbyte[0x2, 0x21]=0x01;
// Make changes effective
// Rebooting controllers
show "Rebooting A controller.";
show "Rebooting B controller.";
If you notice carefully, I said the script will cause the array to ignore flush commands from Solaris hosts. So all Solaris hosts attached to the array will have their flush commands ignored. You can’t turn this behavior on and off on a per host basis. To run this script, cut and paste the script into the script editor of the "Enterprise Management Window" of the SANtricity management GUI. That’s it! A key note here is that you should definitely have your server shut down, or at minimum your ZFS zpool exported before you run this. Otherwise, when your array reboots ZFS will kernel panic the server. In our experience, this will happen even if you only reboot one controller at a time, waiting for one controller to come back online before rebooting the other. For whatever reason, MPXIO which normally works beautifully to keep a LUN available when losing a controller, fails miserably with this situation. Its probably the array’s fault, but whatever the issue, that’s the reality. Plan for downtime when you do this.