This guide is intended to provide ideas and best practices for availability and recovery of a Citrix App Layering solution. The guide will define the components of the infrastructure and provide ideas for how to handle providing availability as well as local and disaster recovery. This guide is intended only to provide ideas. The actual implementation you design as a Workspace Engineer or Architect will be based on your organizations specific availability requirements, your chosen technologies and your specific implementation
The audience for this guide includes customers, field consultants and sales engineers that want to deploy a Citrix App Layering solution into production.
In this guide
This guide includes the following information:
Out of scope
While this guide describes availability of connection brokers, recovery and availability of your VDI broker infrastructure is outside the context of this guide. Refer to the documentation and support teams for your desktop connection broker software for more detailed information about their recovery options.
Conventions and terminology
In this guide, we will be discussing many workspace solution concepts. In order to make this easier to understand, we provide several conventions and terms here to simplify our documentation task.
A shared layer that includes one or more application that you assign to hosted virtual desktops. A desktop is a composite of an Operating System Layer, several Application Layers and a Platform Layer.
App Layering Management Console
The Web-based management console that allows you to manage all of the components in the App Layering environment. This console is located on the ELM.
A hosted virtual machine that is a local composite of the layers assigned to it and the elastic layers assigned to its user. Desktops are deployed outside of Citrix App Layering using images created by Citrix App Layering.
A virtual machine that you use to create an Operating System Layer. The gold image contains a Guest Operating System, configuration settings for virtual hardware, (disks, CPUs, network cards), and optionally, a set of applications.
Enterprise Layer Manager (ELM)
A virtual appliance that coordinates all of the communication in the Citrix App Layering environment. It includes the App Layering Console and the management infrastructure that controls the workflow of managing layered images and elastic layers.
Managed Machine (MM)
Refers to a VDI Desktop or a Session Host/Citrix Virtual Apps Server.
Operating System Layer
A shared layer that includes the Operating System that you assign to hosted virtual desktops. A desktop is a composite of an Operating System Layer, several Application Layers and a Platform Layer.
A special type of virtual machine that acts as a staging area for the creation of Application Layers as well as versions of Operating System and Application Layers.
A writable Elastic Layer for a user. When the user layer is enabled all files and registry settings written by the user go into this VHD file. The User Layer allows for saving user installed applications, files and the entire user profile. User Layers are defined by the user and Operating System Layer.
Citrix App Layering Agent
An agent that runs on PVS server or A Windows server that is used to download PVS vDisks from the ELM and process PowerShell scripts for uploaded images.
Understanding the App Layering infrastructure components
App Layering is used to provide image and application delivery to a shared Workspace solution like Citrix Virtual Apps, Citrix Virtual Apps and Desktops or VMware Horizon View. Any availability strategy for App Layering must fit into the overall availability and recovery design for the whole workspace solution.
The good news here is that everyone using App Layering is starting with a highly available local solution where each pool/delivery group is intrinsically highly available because it is spread across hosts and pools of storage.
Storage can be made highly available in different ways. Most use a storage array with a high degree of redundancy including multiple storage processors or heads and RAID technology. However, it is also possible to obtain higher availability using local storage based on a local RAID controller and flash disks on each host and spreading MM’s across hosts. If a host fails for any reason other hosts are running with MM’s from the pool of available desktops to pick up the user sessions. Of course, many vendors also have a solution to provide high levels of storage redundancy using local storage virtualized into a SAN. The key is to determine what works for your organization from a complexity, cost and availability standpoint while understanding the availability provided by your solution.
Networks can also be made highly available fairly easily because hypervisors are designed with a substantial amount of availability in mind. But it is important to take advantage of this to ensure you have at least two network paths available to each MM in your environment.
The App Layering appliance is a Centos based virtual appliance. This appliance hosts the App Layering console, all App Layering logic and the App Layering database which includes the definition and settings for all connectors and layers. The appliance is also where the layer library is stored. The layer library is just a virtual disk broken into several folders to store OS, App and Platform Layers.
If you were to browse the layer repository it would look something like this:
Note: Platform layers are stored with App layers.
The great thing about this design is that everything about layers is stored in the appliance. If the appliance is backed up you have a significant part of the App Layering infrastructure available for recovery.
It is also important to understand that MM’s never communicate directly with the ELM appliance. Images created by the ELM are published into the format and location required by the provisioning system. Elastic layer assignment is controlled by several json files stored on the elastic share and the user layer share is assigned by AD group again based on json files in the elastic share.
The ELM appliance is used to create layers, layer versions, and layered images. It is also used to deploy elastic layers to the elastic layers share and configure the json files used to assign layers to users. If the ELM appliance is down these tasks are affected but MM’s and existing elastic layers are not affected.
The ELM appliance is a normal virtual machine with a disk configured to boot the OS and store its components parts like web server, MySQL database etc. The appliance also has one or more large virtual disks used to store layers.
The ELM appliance should be backed up via some type of virtual machine backup to storage that is different from the storage used to store the appliance. Most organizations will use their normal VM backup solution. If no whole VM backup is available a good solution is to make a clone of the appliance after shutting it down. This can be done manually or scripted. The frequency should match the desired Recovery Point Objective (RPO) for this solution.
Backup products that support change block tracking would be preferred due to the size of the layer repository.
If the Recovery Time Objective for this solution is very short you may have to consider using a SAN/NAS solution that supports snapshotting at the storage level. This will not help if the storage is damaged but will certainly help if the appliance VM files are damaged or a user error happens. For example, deleting many layers due to miscommunications.
It is also possible to keep two ELM appliances in sync using the layer import/export functionality added in App Layering 4.3. This is currently a manual process but layers can be exported to a share and imported to another appliance from that share. Connectors and image templates would have to be recreated manually if suing this method to sync appliances.
Elastic Layers are layer mounted just before logon from an SMB share. Normal Elastic layers are mounted read-only from many desktops at once. Once mounted an Elastic layer will not ever be dismounted or remounted, the mount only happens at logon. For a Citrix Virtual Apps server or Session Host the mount will happen whenever a user logs in that is assigned an elastic layer if the server has not already mounted that layer. If there is a network interruption, the Elastic Layer will be available for the end user at next login.
Therefore, to provide true high availability for these layers a highly available share is required. Highly available shares can be provided by File Server clusters or multiple head NAS devices.
We are often asked if DFS-R provides higher availability for elastic layer shares. The answer is not really. If one of the file servers sharing the DFS links fails, all of the MM’s that have mounted elastic layers to that file server will fail. If the users log off and back on then the MM’s will reboot and they will obtain a mount on the replicated share but for their original machine the elastic layers will not work until the reboot/logout.
In order to scale it may be necessary to provision multiple elastic shares for application layers. The App Layering console will only populate a single share with layers. Then these layers can be copied either manually, with a script or using DFS-R. To use an alternate location the following registry key on the clients can be manipulated:
Value = \\unideskfs1\unidesk
This setting can be defined using a GPO or GPP applied to different machine OU’s.
All layers are stored on the ELM in the layer repository. It is possible to re-publish all the elastic layers to a new file share if the share were to require recreating but it is not quick or easy. You would have to select each layer and perform the Update Assignment process. This will check for the layer and copy a new vhd to the share if there is not one already there.
Of course, the layers are just vhd files stored on the share. They are opened read only so it is fairly easy to back them up using a file system back utility or a script. If your design includes two separate shares for elastic layers and you keep them in sync then a backup is probably not necessary since you also have a copy in the ELM and a backup of the ELM.
One easy way to synchronize the elastic layer share is with a robocopy script using the /mir directive. To run this on a schedule create a cmd script similar to the following:
robocopy \\unideskfs1\Unidesk\Unidesk\Layers \\wem\LayerShare\Unidesk\Layers /MIR /MT/log:D:\layercopy_log.txt /np
This will keep the two folders in sync. Here we are just syncing elastic layers which are those stored in the Unidesk\Layers folder.
The /MIR directive will add new files and remove files that have been deleted from the source folder which is the first folder defined.
The /MT switch is optional. It defines the copy as multithreaded. The default number of threads is 8 which means it will copy 8 files at a time. In testing, this more than doubled throughput seeing 800-900 Mbps on a 1 Gbps network link using SMB 3.0.
/np says to not log copy progress which will make the logs more readable.
In my test lab, I created this script and have it run at 2 AM every night.
By default user layers are stored on the same share as normal elastic layers. However, the requirements for user layers are very different than normal elastic layers. User layers are write- intensive where elastic layers are read-only. Most organizations will likely use a different file share or even file server for user layers, one that is optimized for writes. If the user layer share is different from the elastic layer share user assignment will be defined by AD user groups.
User Layer assignment is defined in the System>Storage Locations tab within the App Layering Console. You enter the share and the group associated with the share.
The most important fact to know about user layers is that they will be locked when in use at the file level. To back them up it will be necessary to do it at the SAN/NAS level or when they are not in use.
Backing up User Layers is much harder than Elastic Layers because the user Layer vhd file is opened for write whenever a user is logged on. Also, user layers are large and change constantly. This means that using something like robocopy is not a great solution because even if you can lock the file to copy it you would have to copy a very large file every time. That means you will be much better off using something like SAN replication or NetApp’s SnapMirror to replicate the user layers locally as a backup at a block level rather than copying the entire vhd file using something like robocopy. If you don’t have one of these advanced technologies it might work to spread the copy load over a couple of weeks so that there is not as much to copy every night. This could be scripted using PowerShell to ensure you get a backup at least one every x number of days.
Component multisite disaster recovery
The approach to Disaster recovery can be similar to local recovery. For the image side of things, the quickest way to keep images in sync is to use some type of replication process for the images. If you are using PVS it may be as simple as using robocopy to copy your vDisks across to the secondary site. If you are using MCS or Horizon View on vSphere you will need a process to replicate virtual machines like Veeam, Zerto, Vmware vSphere Replication or Site Recovery Manger there are many solutions available. This will also work to protect the ELM.
For the Elastic Layers SAN replication or a scripted copy can both work. Of course, if you are using User Layers then you will need something efficient at the SAN/NAS level so that changed blocks can be replicated underneath the clustered file system used for the share.
The reason this approach is better than having multiple connectors defined in the ELM and publishing directly to both sites is that when publishing we have to both compose the image and upload it to the store. If you use a process that will just replicate the already created image it will skip the composition process and therefore be more efficient. One caveat to this is that if you want a different configuration for the images deployed to DR then it would be better to publish directly to DR from the ELM because you could then have different layers defined in the Image Templates for DR. This is also a benefit of the Dual ELM Model.
It is also possible to use two ELM appliances one in each site, and then use the import/export functionality added in App Layering 4.3 to keep those ELMs in sync from a layer perspective. Then you can treat DR separately and build images there from a local ELM.
If this option is chosen then the sync will transfer over the WAN to the SMB share defined in Settings and Configuration. Then the layers can be synchronized to the SMB share used in the second site using something like Robocopy again using the /MIR switch. It would also be possible to develop a solution that did not sync all layers but only desired layers. If desired contact your App Layering solution architect for more details. Currently the import and export process must be kicked off manually.
In the Dual ELM model connectors and permissions for elastic shares must be created on each side. The only objects that get imported are the actual layers themselves. However, it is possible in this model to have different layers in each site as needed. For example, if it is really an active/active site scenario.