Microsoft Cluster Shared Volumes & NetApp DSM Performance Issue

Symptom: The Cluster Service may report that a Cluster Shared Volume is not longer available on this node because of ‘STATUS_VOLUME_DISMOUNTED(c000026e)’. All I/O will temporarily be queued until a path to the volume is restablished. Other symptoms include slower disk performance, higher storage processor utilization, delays writing to disk and higher than normal average disk queue length.

Cause: Using Microsoft Cluster Shared Volumes and the NetApp DSM v3.4 users may experience behaviour in the way the NetApp DSM issues SCSI-3 persistent reservations. The coordinator node, introduced by Microsoft Cluster Shared Volumes uses SCSI-3 persistent reservations to schedule writes to disk, in conjunction with the NetApp DSM results substantial delays in I/O processing while processing persistent reservations. This becomes more apparent while processing multiple high volume streams of small 4K writes, such as Microsoft SQL Server.

Resolution: There are two options, you can stop using Microsoft Cluster Shared Volumes or stop using the NetApp DSM as the stock Microsoft DSM doesn’t exhibit this behaviour. Based on options of differences in performance, I would recommend not using Microsoft Cluster Shared Volumes purely on in result on performance. Here in an example with a FAS2040 with Data OnTap 7.3.3 with three aggregates each with raid group comprised of 28x300GB 15K FC disk:

Configuration I/Ops Read Write Latency
NetApp DSM 93,212 1.1TBps 912GBps 4ms
Microsoft DSM with CSV 51,175 561GBps 312GBps 35ms
NetApp DSM with CSV 93,112 311GBps 281GBps 92ms

Analysis: NetApp has an example of how this was designed to work, here. It is interesting to note they tested Cluster Shared Volumes with a FAS 3000 series storage unit with a PAC (flash cache) card, its worth noting they tested it on one of there fastest largest storage systems. In NetApp’s example the Flash Cache changed the geometry calculations when performing SCSI-3 persistent reservations.