The migration to the VSAN cluster

jens-herremans
What’s hot in storage industry? The answer to that question has radically changed over the last couple of years. Hyper-converged storage is now the new kid on the block!

At SecureLink and Raido, we work with various vendors such as Nutanix and VMware. VSAN is a fairly new product of VMware that now combines their already well-established hypervisor with hyper-converged storage features.

Does VSAN meet our business needs?

A VSAN implementation consists of a minimum of 3 hosts. Each host has at least 1 disk group which consists of 1 performance device (SSD) and multiple capacity devices (SDD or HDD). These disk groups will be committed to the Virtual SAN storage. VSAN will present a shared data storage to all hypervisors in that cluster.

Since we needed a new hardware platform capable of running our ever-expanding lab, we took a closer look in order to see if this solution could meet our needs.

We chose a Supermicro 4-node 2U chassis in a Hybrid configuration and made sure all components were listed on VMware’s HCL.  The hardware had a few minuses on which I will elaborate later in this article.

Throughout the migration from our previous lab platform to the new VSAN cluster, we encountered a number of issues. This was mainly due to the fact that we couldn’t find any available official documentation on how to migrate the VSAN cluster from our temporary vCenter (which we used for initial testing) to the vCenter appliance of our lab environment.

On the first attempt, we migrated VSAN hosts in bulk to our lab vCenter. Unfortunately, this broke the existing VSAN configuration. The new vCenter tried to reconfigure the newly-added nodes with a new VSAN configuration (GUID). This resulted in corrupting the existing one.

If these were production nodes with critical business data, it would have resulted in 100% data loss.

vsan-1

Moving towards a healthy cluster

Obviously, we were not satisfied with these results. Therefore, we tried it again but this time, we used a different method:

First, we migrated a single host from the old cluster to the new VSAN cluster. That initial server will register with the new Vcenter and apply the cluster-enabled features. As there already is a VSAN partition of the old cluster on the server, the new cluster will reuse its UUID.

Subsequently, we moved the other 3 hosts to the cluster. This time, the cluster did not report any VSAN partition related issues, as it did in the previous attempt (see illustration above).

The cluster was healthy again. This is based on the only documentation I found about a VSAN moving cross DC / VCenter:

http://www.virtuallyghetto.com/2014/09/how-to-move-a-vsan-cluster-from-one-vcenter-server-to-another.html

Note: If you try to add all of the ESXi hosts from the existing VSAN Cluster to the new VSAN Cluster at once, you will see an error regarding UUID mismatch. The trick is to add one host first and once that has been done, you bulk add the remaining ESXi hosts and you will not have an issue. This is handy if you are trying to automate this process.

After the host migration was over, we migrated all our lab VMs (+/- 320 VM’s) to the new VSAN cluster and declared the new VSAN cluster open for business!

The true power of VSAN

As I said earlier, there were some drawbacks to the hardware we chose. We suffered from a HDD failure in one of our nodes. At that moment, we saw the true power of VSAN.

vsan-2The HDD was found and reported by VSAN as a permanent failure. This resulted in the automatic rebuilding of the failed VSAN object on a new physical disk somewhere else in the cluster. A VSAN object is a logical volume that distributes its data and metadata and that grants access to that data across the entire cluster.

The only thing we lost that moment was pure raw storage capacity for the VSAN. This was resolved by adding the new spare disk to the chassis, afterwards VSAN was completely healthy again.

Overall, I would definitely like to state that the capacity device failure and the rebuilding of VSAN is very robust and intuitive. All I had to do manually, was removing the failed device from the disk group and adding the new HDD to the same disk group.

On the VMware blog page, I found an interesting article about replacing disks and the associated procedures: https://blogs.vmware.com/storage/2014/12/02/vmware-virtual-san-operations-replacing-disk-devices/7

Mind the cache device (SSD) in a disk group is a single point of failure in the current build of VSAN 6.0. This means that if the SDD goes down in a host with 1 disk group, that host will be down in the VSAN cluster too. This can be solved by adding more disk groups. The possibility to add more than 1 SDDs to the cashing tier would be a very useful feature.

Capacity optimization

With the overall lab running smoothly, everything looked good again. Unfortunately, features such as deduplication and compression are unfortunately not possible because we opted for a hybrid configuration (SSD for cache tier, HDD for the capacity tier). This functionality is only available in VSAN All-Flash configurations.

Since we enabled those features on our former platform and since we designed our new VSAN platform with about the same raw capacity, not benefiting from dedup, compression or erasure coding, we completely filled up our VSAN data store. Right now, we have storage usage of 96% if a node should fail. That is definitely something to remember when you migrate from the old storage infrastructure to the new Hybrid VSAN environment. Consider all-flash from the very beginning if you want to benefit from the capacity optimization functionality!

Conclusion

To me, VSAN looks like a promising product and definitely has a bright future ahead. At SecureLink, we are convinced of this technology’s power, performance and flexibility. As I already mentioned we are currently running our entire lab on a 4-node VSAN configuration. We did see a HDD failure that was perfectly handled by VSAN. Unfortunately, we had more problems with our hardware vendor. With a Next business day SLA, it took us 7 workdays to install and replace a simple magnetic disk.

I have to state that there are some drawbacks in the current build of VSAN 6.0 that we could not ignore:

The cross VCenter migration holds your business critical data and VMs which increases the risk of a potential loss. With a Hybrid configuration, the lack of features like deduplication and compression need to be taken into account when building your VSAN. Other hyper-converged storage system vendors do have these features, even with hybrid configurations.

With the prices of SDDs dropping, Hybrid VSAN configuration will possibly become obsolete in the near future. New features are added almost every quarter. Version 6.5 just came out for example with a load of new features. The VSAN iSCSI service on the other hand, allows you to create LUNs and iSCSI targets which can be exposed to external applications and servers. On top of that, the VSAN API and the Powercli have been updated. That enables us to fully script a VSAN configuration and its management. I will write a new blog post on that topic later.

In short, I’m really excited about the future of VMWare’s VSAN and we will continue to keep a close eye on the product’s development.

In the meantime, I will continue to experiment in our new lab.