To Stripe or Not To Stripe

Or, Whether Tis Nobler to Suffer the Bytes and I/O of Queue Depth

Feb 10, 2021

First, some ground rules:

Rule #1: There is nothing wrong with LVM Striping, or tweaking Queue Depth, or any other Linux I/O sub-system tuning
Rule #2: What I say here is based on the HANA implementation in which I'm involved
Rule #3: If someone interprets this article as a criticism of how they run their HANA environment, they should read Rules #1 and #2

With the preliminaries out of the way, it's time to discuss what I hope won't devolve into a Holy War

In The Beginning

In the beginning, the Storage spun, the Read/Write Heads stepped, and the Operating System lived in harmony with the Actual Hardware.

Then the Operating System ate the fruit of the Tree of Virtualization...and all bets were off.

Much and more has been written over the past 10-12 years about Queue Depth, LUN striping, and other Linux OS techniques to optimize disk I/O, especially concerning storage for database applications (including HANA). Debates have raged over round-robin versus service-time, whether or not to stripe Logical Volumes across multiple LUNs, and the impact of HBA Queue Depth on very large (10TB+) filesystems.

I very much suspect that the people advancing their views were absolutely correct about what they found important in their environment at the time.

However, in 2021, I also suspect that advice from just 5 years ago may be all but worthless for brand-new hardware; I'm not sure how much weight to give to statements that were made with honest certainty even more recently.

Welcome to the Jungle

Once upon a time, and not really all that long ago, I was managing computing infrastructure where 9 Gigabyte disks (spinning at a phenomenal 4500RPM!) were considered state-of-the-art. Today, state-of-the-art means solid-state drives that don't spin at all, and 9 GB is the size of the on-drive cache (well, maybe not that much, but you get the idea).

Instead of inserting PCIe3 cards in a chassis, my LPARs have Virtual HBAs that exist as software. The relationship linking the HBA perceived by the Linux kernel to the actual chips on the PCB is most-accurately described as a delusion shared between the kernel and the driver. There's nothing inherently wrong with that, of course, but it's important to keep those facts in mind when considering these topics.

While designing the SLES environment for which I'm responsible, I had the advantage of a "green field", and a hardware platform architected by a fellow who knew the value of getting the best available at the time. The latest Fiber Channel cards, 4-port 10GB Ethernet NICs, SSD-based SAN, Power9 and hex-core X86s.

So, in such an environment, does, say, Queue Depth actually matter? Is there really a benefit to striping the Logical Volume for /hana/data across LUNs? When weighing round-robin multipathing versus service-time, does it truly make a difference?

I propose that none of those are terribly important. One reason I think that is my experience with our new EMC/Dell PowerMAX SAN.

Please Pass the Stone Tablets

My organization brought in a number of contract resources to assist with unboxing all that shiny new gear, racking and cabling the lot, and providing guidance on initial configuration. As the lead (OK, only) Linux expert, it fell to me ensure that the Linux LPARs could multipath the SAN-based storage.

The contractor who helped the storage guys with the initial SAN implementation provided me with configuration parameters for multipathd. Troublemaker that I am, I reviewed his recommendations line-by-line, digging through the Linux documentation and ensuring that they made sense.

They didn't.

The proposal included configuration items clearly documented as Deprecated. Worse, there were even settings I knew to be unsafe (in one case, the man page for multipathd explicitly said to avoid the configuration).

It wasn't the resource's fault - he'd based his suggestions on information published by EMC/Dell. However, further investigation revealed EMC/Dell had essentially not changed their configuration recommendations since Linux kernel v2.6 (SLES v11); SLES v15GA uses kernel v4.12.

That's about 10 years during which EMC/Dell didn't bother keeping their documentation current. It might not have been so bad if they hadn't charged customers so much for those stone tablets.

In early 2019, through the resource's contacts, I made EMC/Dell aware of the problems; antiquated documentation abruptly vanished from the website. In response to our queries, the new guidance became "Use the defaults"

So much for all that tuning.

In the end I tweaked two items and deployed the successful configuration. And it got me thinking as I looked across the rest of the stack.

On The Depths of Queues

If there's been one aspect of Linux I/O sub-system configuration, as it pertains to SAN-based storage, that creates endless confusion, it's probably Queue Depth. Over the past decade, I've seen it generate more head-scratching, if not hair-pulling, than anything else. It's like multipathd's no_path_retry queue and its deprecated cousin features queue_if_no_path.

In the SLES-on-Power environment I designed, the ibmvfc driver defaults to a Queue Depth of 16 on the Virtual HBA. So, what does that mean?

The best explanation I have for Queue Depth is the number of pending I/O requests the HBA is able to hold before it starts to immediately return a "busy" reply to further requests - it also effectively functions as a FIFO buffer. Some HBAs will set Queue Depth based on the number of paths, some use arbitrary values (I haven't yet discovered how the IBM driver goes about picking 16); certain drivers will let you change it, others won't.

But do you need to alter it? It's hard to find current information, and even less likely to find information specific to a particular hardware environment.

In my experience, there is no one answer, no guaranteed way to optimize I/O across all possible OSes, platforms and workloads. That said, there are things in Linux that will give you clues. In particular, using a tool such a dstat, or extracting data from sysfs, may be useful. Analyzing /sys/block/<DEVICE>/stat and other related files is a good place to start; look for high average times for disk I/O operations, long wait times, and persistent "in-flight" counters.

In the HANA environment for which I'm responsible, I chose not to tinker with Queue Depth on the LPAR Virtual HBAs. But the filesystem sizes are relatively small (under 5TB), and the infrastructure fast, so there isn't a lot that performance tuning can accomplish. Other environments, perhaps those using iSCSI, or having larger filesystems, or running on older infrastructure, may arrive at different conclusions.

A final note: Queue Depth exists on the storage array as well. The extent to which you might be able to see and/or manipulate those settings varies from environment to environment. But don't make the mistake of only looking at the host side of the configuration.

Stripes Aren't Just For Candy Canes

Reversing direction to move up the stack from multipathing, what about LVM and in particular striping LVs across LUNs?

Early in the creation of the environment for which I designed the SLES implementation, I discussed LVM striping for the SAP consultants. Initially, at their recommendation, I striped the LVs for /hana/{data,log,shared}. For example, if the size of /hana/data was 2TB, and provisioned using eight (8) LUNS of 256GB each, then I created the LV underlying the filesystem with 8 stripes.

Striping an LV introduces a few management wrinkles, especially when it comes to adding storage. When it came time, post-Go Live, to add storage, I had to (in the example system) have the additional LUNs also allocated in sets of 8. If the DBAs asked for 512GB, that meant I needed 64GB, not 256GB, LUNs. That made it hard to adopt a "standard" LUN size.

After a year of running, I’ve concluded that, at least for the environment for which I'm building, striping doesn't improve anything. As I understand it, this SAP environment is on the small side of "medium-sized", so environments that have much larger filesystems, or different storage delivery technologies (such as iSCSI), will doubtless have other factors to consider than those I've seen.

One question that I've been unable to answer is the performance impact of having multiple sets of LUNs for striping. Part of the difficulty lies in the many layers between the process writing data to the filesystem and the actual mechanics of placing bits on an SDD. Gone are the days when the OS had an understanding of "disk geometry" that matched the physical properties of the storage hardware. I can't say I miss it, but I can recall running hard-drive formatting utilities and configuring storage by calculating tracks, heads, sectors-per-track and bytes-per-sector (woe betide the sysadmin who got his calculations wrong!). Although having thin-on-thin-on-thin storage still makes my teeth itch, and probably always will.

Tux Meets SAP

Discussion about this post