Last week I spent some time talking to a partner, ByteLife, about a solution that they’ve created for a customer. The customer needed an automated failover solution for their VMware View environment and this is the story on how they solved it.
First some background to what the customer saw as their problem. Customers using VMware View sometimes face issues trying to handle fault tolerance in case of site disaster. Even though VMware has a solution for failing over virtual machines to a secondary site, if you are lucky enough to have one, it does not support virtual desktop infrastructures (see here for the Compability Matrix, no SRM support for View).
As a result, VMware View is often installed either separately in both datacenters, guaranteeing that at least half of the desktops would survive the site failure, or even as a single environment that could break down in whole during a large outage.
The customers, whose business and workers usually don’t like losing access to their applications and/or desktops during a site failure, can choose a more complex setup and use specific manual failover tasks during the site failure. The good thing is that it is possible using solutions such as this from VMware or this from EMC. On the other hand – during a site failover, IT personnel already have a tremendous load and pressure to bring the site or the services back online – any additional service to worry about just adds to the unnecessary complexity of the crisis. Having an automated failover that can be initiated by few clicks in the remaining datacenter will free up the IT staff’s time, when they need it the most.
So for this specific customer, ByteLife has developed a solution called “VMware View Failover Automation” with the following key functionalities:
- Failover desktop pools and virtual machines in case of site crash
- Migrate desktop pools and virtual machines during maintenance, tests, and rebalancing the load between sites or as failback after disaster
- Restore storage synchronization between datacenters after the outage
- Integration with vSphere WebClient
But wait, there’s more!
For this, all you need is vCenter Orchestrator, no SRM. Yes, you read that right, no SRM. What’s even cooler is that you can actually use this for several sites, you’re not limited to just two sites! Imagine that, being able to failover any VMware View site, without SRM, within minutes.
Failover of the VMware View environment takes only minutes, depending on the number as well as nature of desktops and the components that are failed over. It’s been proven that the first users can restart their work in new site in less than 5 minutes after the failover is initiated, which I find pretty amazing compared to the other solutions I’ve seen on this subject. So how this work, what makes it so fast?
Looking at the picture above, let’s assume your current Linked clone pools is called *-A-Pool and your pool that’ll be used in a failover scenario is called *-A-Pool-Recovery. The pools are exactly the same, uses the same VM as base image, and some VMs are already pre-provisioned. So when failing over, all that’s done is registering the users to the *-A-Pool-Recovery pool, removing them from *-A-Pool and then they can reconnect. Same desktop Pool ID, same everything, so it’s fully transparent to the users. Some other settings are automated as well, like maximum amount of desktops per pool. All pools are enabled all the time, to make sure it’s possible to do changes and things like recompose on all pools to have a consistent image version across the entire environment. All automated, and seeing it live is really impressive.
But what about the manual pools? Well, they’re handled a bit differently. In case of a failure, the vCO workflow shuts down all manual VMs (if they are still reachable and running), removed from vCenter inventory, the datastores dismounted, replication flow of the datastore is then switched and the now primary datastore is attached to the secondary site, the VMs readded to the inventory, powered on, then the manual pool is modified in the AD LDS database to be moved from A to B. And of course, all user assignments are preserved. All automated, frickin awesome IMHO!
As this is based on vCO workflows, there’s no hardcoded input on pools or available sites, everything is collected using the Status Report, Migration, Failover and Restore Synchronization workflows. The vCO workflows only lists the pools and sites that actually have entitlements and are active, everything else is hidden meaning you can focus on getting your stuff up and running quickly instead of having to trawl through all the possible environments that *might* be used.
So, this can be used for failover, but also planned migration of VMs from one site to another if you want to balance the workload between sites for instance.
Another cool feature that came up during the discussion is that you could actually use this for recomposing large environments with very little downtime. Let’s say you’re currently using *-A-Pool as in our previous example, you could recompose the virtual desktops in *-A-Pool-Recovery, and just migrate your users over there. Instead of recomposing all existing VMs, you’d move your users to already recomposed images with fresh patches and everything installed, how cool is that?!
I found it very refreshing to see a totally new take on the failover methods for VMware View environments, and I’m certain it would benefit your environment.
And lastly, some technical info:
The solution is based on VMware vCenter Orchestrator workflows. The current version of VMware View Failover Automation is supported with VMware View 5.1 and up; and EMC VNX with MirrorView. The network latency between two sites must not exceed 5 ms.
Contact info for the solution:
Alar Kuuda (Project Manager) – firstname.lastname@example.org