QoS, the next feature you’ll be asking for
by Enrico Signoretti on 03/12/2013 in Storage
Scale-out is today
Scale-out is something that is happening right now and you can find many scale-out solutions out there.
Looking at the SFD4 event alone, I can mention Coho data, GridStore, Nimble Storage, Cleversafe, Avere systems, CloudByte and Overland. All of them have a scale-out storage, different implementations, different approaches but same basic design: your storage infrastructure can scale by adding nodes instead of adding resources in the same box (ok, ok, Nimble storage has both scale out and up, they call it scale to fit, but that’s a detail in this context) .
And it’s not only at SFD4 (a place where most innovation can usually be found more innovation thanks to the number of startups presenting), if you take a look at the market there are plenty of solutions like, for example,Dell Equallogic, HP StoreVirtual, EMC Isilon, IBM SONAS/GPFS, NetApp and so on. Not to mention the open source space where scale-out file systems are gaining a lot of attention lately (e.g. Ceph and Gluster) or the object storage vendors (who isn’t scale-out in object storage?). Last but not least, VSAs are all scale-out solutions: Maxta, Hp StoreVirtual VSA, VMware VSAN, you name it.
We can spend weeks talking about different scale-out generations, real scalability of some solutions or even efficiency and performance issues of others… but we can take for granted that scale-out is no longer the latest news! It’s ready for the prime-time, it’s becoming pervasive and it’s a great option for many types of modern infrastructures.
QoS is a different story, it’s relatively new in the storage systems and we are just at the beginning.
What do scale-out and QoS have in common?
If you have a single server connected to a single disk you do not have a QoS problem. But, when the numbers grow, things change.
QoS is a concept that is very common now in Ethernet networking and it became very important a few years ago when the number of connected devices grew to a level where the noise generated by second tier applications was so loud it compromised mission critical workloads or specific vertical applications (like VOIP for example). Prioritization of IPs or protocols, band allocation, throughput throttling and others are all at the base of mechanisms thought up to grant the right QoS and, consequently, meet the required SLAs.
Why is QoS so important for storage?
Lately, the same story has been happening to storage too. In the last couple of years QoS has come up more and more often in storage talks with vendors and end users and, it’s just the start.
Large enterprises were the first to notice this phenomenon when, at the end of the ‘90s they adopted Windows and Unix systems alongside Mainframes. Not only because different protocols were involved, but also because the huge difference in the workloads. Eventually vendors released hardware partitioning mechanisms. They were very rigid and the user was able to slice its hardware and manage the “slices” as 2 or more separate arrays. Each domain/partition reconfiguration activity implied a service disruption!
When things got really complicated (due to virtualization), hardware partitioning showed all its limits and first “virtual partitioning” mechanisms saw the light. For example, 3PAR success with the ISPs partially came from the ability to create what they call Virtual Domains (you can easily non-disruptively create up to 1024 domains in a single system).
And now, with the cloud, where a number of hosts/VMs and different workloads outpaced the abilities of traditional arrays, new kinds of scale-out, QoS aware, arrays are seeing the light. The most recent examples come from Solidfire and CloudByte but also others, like NexGen and IBM XIV, have shown interesting solutions.
Not all the companies are large ISPs, with loads of VMs, but all the enterprises are facing, or will be facing, the same scaled-down issues. That’s why storage QoS is something you should care about.
Who does QoS?
QoS, as for any other feature of a storage system, can’t be added very easily without limits and constraints that could seriously hamper its efficiency. It should be part of the initial design of a storage system. QoS is all about resource management and time of reaction in adjusting the array behavior. If the the QoS feature is implemented on top of other features that could result in an excessive delay of the reaction and in a misjudgment in real needed resources, with direct consequence to the usability of this feature.
Most of the QoS capable arrays come from new vendors. Startups like Solidfire and CloudByte are really showing interesting approaches: the latter is implementing a sort of virtual controller (each virtual controller/tenant has its set of data volumes and QoS is controlled at that level) while the former has a very sophisticated and fine granular mechanism at the LUN level. Both are clearly targeting the ISPs market but no doubt that large enterprises could find interesting solutions for their private clouds here as well.
At the same time, companies like GridStore and the server-side cache vendors, are implementing QoS mechanisms at a higher level of the stack (closer to the physical server) and this has its advantages in virtualized infrastructures (as was already mentioned in this article).
Also other vendors have basic QoS mechanisms but they are often less sophisticated, less granular, and sometime useless in specific scenarios (due to what I described above).
Why it matters
Today, QoS is not for everyone. Current primary targets are ISPs/CSPs and large enterprises.
Now the question is whether or not it will really attract the proper attention in the ordinary enterprise too. From my point of view, QoS is a key factor for some SLA-focused cloud services and it’s a way to grant SLAs and predictable results at scale. It is the same for every kind of enterprise and its internal IT, even when we talk about scaled down scenarios.
Another big advantage of QoS for to the ordinary enterprise IT is the better overall efficiency added to the storage system. In fact the ability to correctly manage (in this case manage could simply mean “impose limits to”) IO intensive workloads could dramatically change day-to-day operations likes batch jobs, massive DB uploads and all similar activities that could severely hurt other production environments. And, as usual, better efficiency and resource management lead to a better overall TCO.