If there are two words that can cause the most veteran AIX systems administrator to spontaneously develop ulcers, it is these: server migration. If you've ever experienced the joy of shipping a server to a new location, trying to clone a production system down to the minutiae onto beefier hardware, or timing how to shut down and bring up a new box when a customer provides only 10 minutes of downtime, you know exactly how much fun a server migration can be. It makes Indiana Jones’ job of trying to grab the golden idol and outrun the boulder seem like a cinch. All it takes is one hard drive drop, one missing driver, or one undocumented “feature” to cause the faces of executive management to start melting.
Fortunately, with the advent of some of the newer IBM Power Systems hardware, there's a new tool in the arsenal to make server migrations a breeze. It's now possible to take a running LPAR from one Power 6 or 7 server and move it completely to another disparate server without any downtime whatsoever. There's no hassle of scheduling downtime, no worries about porting the OS, and no possibility of damaging the hardware while you're moving things.
In this article, we'll be taking a look at Live Partition Mobility (LPM), the next generation of AIX virtualization. We'll go through an overview of how LPM works, the best way to architect and design your LPM environment, how to perform LPM migrations, and a special technique for making LPM more highly available in case of unplanned downtime. I recommended that you be familiar with the basics of Power VM technology in order to use this information effectively and deploy it well.
How LPM Works
Live Partition Mobility (LPM) takes the contents of a running LPAR on a server, duplicates all of the features and the identity of the LPAR over to a destination system, and then swings everything over, cutting its ties with the original server. When complete, there's no trace left of the LPAR on the source system, and there are no dependencies on it.
The process of moving an LPAR between two Power Systems servers is called a migration, and it has five main steps. This process is initiated through the Hardware Management Console (HMC) and relayed to Virtual I/O (VIO) servers on each system. When running, the migration’s five steps play out as follows:
1. Creates the partition on the destination system including CPU utilization, RAM settings, and virtual I/O devices within the hypervisor. This also moves all the profiles for the LPAR along with it (Figure 1).
2. Instructs the VIO servers on the destination system on how to map out all of the virtual SCSI and virtual fiber channel devices (Figure 2).
3. Triggers specified VIO servers to act as Mover Service Partitions (MSPs) in transferring all of the CPU, memory, and I/O in use over to the destination system. At this point, the destination system starts taking the workload (Figure 3).
4. Updates the source VIO servers to remove any mappings to the storage devices.
5. Cleans up and removes the remnants of the LPAR from the source system (Figure 4).
In order for a migration to work successfully, you must meet a number of requirements before the HMC will let a migration take place. The main requirements are:
- The destination system must have enough available CPU and RAM to support the LPAR.
- The storage must be mapped ahead of time to both Power Systems from a Storage Area Network (SAN).
- Any Virtual Local Area Networks (VLANs) must be accessible to all of the VIO servers.
- The HMCs and VIO servers must be at compatible OS and software levels and able to all communicate with each other.
Next, you’ll see how to architect an LPM environment to meet these requirements, provide redundancy, and avoid some of the mishaps that can hamper migrations from working correctly.
Architecting an LPM Environment
There are three key things to creating an effective and properly designed Live Partition Mobility (LPM) environment: planning, planning, and more planning. By taking the extra time to chart everything out to the last adapter, WWN, and VLAN, you'll save yourself headaches down the road. The first step is learning how to architect your servers to use LPM and make your life easy.
Hardware Planning Phase. The very first step to creating an LPM environment is to take a complete inventory of the IBM Power Systems servers you'll be using. To do this, I recommend creating a spreadsheet for documenting your hardware. You should include things like how many CPUs and how much RAM the server possesses, but you'll want to spend extra time on documenting all of the slots in the server. Break down the inventory on a drawer-by-drawer basis and itemize what's in each slot, including Ethernet and fiber channel cards, and note how many ports or connections are on each card.
With this information, the next step is to carve out what resources will go to which VIO servers. You'll want to balance redundancy with performance to ensure that there are no single points of failure, but make sure there's sufficient bandwidth to handle the I/O that the VIO clients will require. You should also note to what SAN fabric or network switches the adapter cards will be connected, as this will become useful when it's time to zone resources.
Where possible, I've found it best to use two or more servers of identical make and model, with the same layout and hardware configuration. I also always recommend using two VIO servers per Power Systems server for availability and redundancy. Although it's possible to use LPM on a server with limited resources and a single VIO server, or mixing and matching two disparate hardware types such as a Power 6 p570 and a Power 7 p740, keeping the servers as closely matched as you can makes understanding what VIO clients are using what resources much easier. And even naming things like shared processor pools identically and providing them with the same CPU resource options on all the Power Systems servers will make resource management (and software license management for some products) simpler during migrations.
You'll also want to make sure that your HMC is at a supported level for managing LPM, your servers are supported and have a supported level of firmware, and that you have purchased and installed the necessary Capacity on Demand (CoD) codes from IBM to enable Advanced POWER Virtualization Enterprise Edition, Inactive partition mobility, and Active Partition Mobility. The links at the end of this article will help guide you to download this software and contact IBM where needed.
Virtual Resource Planning Phase. After you've mapped out all of your hardware, the next step is to start laying out how your VIO servers will provide virtual resources including Ethernet, virtual SCSI, and virtual fiber channel devices to the VIO clients. I again recommend using a spreadsheet for tracking this information, because things can get messy after just a few VIO clients get defined.
The first thing I do when working on my virtual resource planning is to devise an enumeration scheme. This is what will govern what devices get assigned to which VIO clients. For example, I usually follow a system like this:
- Virtual Devices 10-100 Virtual Ethernet
- Virtual Devices 200-400 Virtual SCSI
- Virtual Devices 600-800 Virtual fiber channel
This way, I know exactly what devices are which just by looking at their device number. Also, because it's easier to track numbers that stay the same as LPARs get migrated between Power Systems servers in an LPM environment, I recommend using device numbers under 1,000; when doing some testing some time ago, I found that LPM would re-enumerate virtual devices whose numbers were over 1,000, throwing the organization completely out the window. (NOTE: This will also come in handy during the last section of this article.)
When enumerating resources, another rule I recommend is to not use the same LPAR ID on two different Power Systems servers if the LPAR is intended to be migrateable. I'll usually give my two VIO servers on each Power System an LPAR ID of 1 and 2, respectively. But, beyond this, I'll provide each client its own unique LPAR ID for tracking purposes. If you build out everything on one Power System alone, it will help in tracking the numbers visually.
I also advocate using a virtual SCSI disk for the root disk of my VIO clients, and use virtual fiber channel adapters for non-rootvg disks. This is because it makes it much easier to boot the server up initially and manage SAN storage. Plus, should you need to patch any SAN-related drivers such as SDDPCM, it's simpler to do it on a server that sees its root disk as a virtual SCSI device instead of a complete SAN-boot solution. Just remember to map any virtual SCSI disks to all of the VIO servers you will be using in your LPM solution and try to keep them in the same order (i.e., hdisk1 is for server1, hdisk2 is for server2 on all the VIO servers).
VIO Server Built-Out Phase. Once you have everything done on paper, it’s time to start building out the VIO servers. Begin by patching the HMC and the firmware on all of the Power Systems servers to the same supported and compatible levels if needed. Next, load the VIO servers with the same level of the IOS software. This way, everything is kept in sync, and it makes patching and upgrades consistent when you need to install new software.
After the VIO servers have been loaded, shut them down and create a new profile for each of them with all of the virtual resources that were planned out above. As you create any virtual fiber channel devices, copy the WWNs into my spreadsheet so you can provide them to the SAN administrator for mapping (this is because devices can be zoned in advance, and sometimes SAN switches don't auto detect the WWNs when they're brought online the first time). Also make sure that any virtual Ethernet devices have all of the necessary VLANs included at this time to make use of any trunked connections.
Finally, boot up the VIO servers once more and create any Shared Ethernet Adapters (SEAs) that will be used by the VIO clients. I find it's also good to create a VLAN device for one of my SEAs so I can remotely connect to my VIO server and administer them without the console.
VIO Client Build-Out Phase. Now that your VIO servers have been built and configured, create all of your VIO clients and assign them the virtual resources from your spreadsheet. This is the most tedious part of the architecture process, so you might want to consider using the IBM System Planning Tool (SPT) to help facilitate the build-out.
When you have completed the build out of the VIO clients, log onto the VIO servers and define all of the virtual SCSI and virtual fiber channel mappings. When creating virtual SCSI devices for any hdisks assigned to the VIO server from the SAN, you'll need to convert the LPAR ID into a hexadecimal number for determining the vhost number. It's important to note, too, that although the vhost numbers may initially be sequential, these numbers can and will change after migrations take place. This is why it's so important to track all of the virtual device numbers earlier on.
Lastly, install the AIX OS onto your VIO clients and build them out as you would any LPAR. I strongly recommend using NIM to help facilitate these build outs and for cloning similar types of servers.
Completing the Planning. At this point, everything should be up, available, and ready to go. You should have two or more Power Systems servers with two VIO servers apiece, and a set of VIO clients running AIX on one of the servers awaiting migration. The planning, while tedious, is necessary to make LPM work as smoothly as possible.
Performing an LPM Migration
Now that you have everything set up and ready to go from the previous discussion, it's time to put Live Partition Mobility into action and perform a migration of a live AIX system. When this process is complete, you'll have an LPAR on your second Power Systems server and the system will have never skipped a beat.
Migration Prep Work. Before you start the LPM migration, open up a connection to all of your VIO servers, your VIO client that will be migrated, and your HMC. Take a moment to verify all of the device paths for your virtual SCSI and virtual fiber channel connections on the VIO servers with the lsmap command. You should see the paths for your VIO clients only on the VIO servers on the same Power Systems server; the other VIO servers should have no paths for the VIO clients. Examine the Power Systems servers on the HMC and confirm that the LPARs are laid out correctly. Also confirm that the Power Systems server to which you'll be migrating the LPAR has sufficient free resources (as mentioned earlier). Lastly, log onto your VIO client and run the topas command so you can watch the server performance as the migration takes place.
Starting the Migration. The next step up in access control entails multi-factor approaches. Two-factor authentication requires that a user possess a second security value, such as a one-time password, to foil an interloper attempting to replay a password intercepted through a phishing or keystroke capture attack. The second factor can also be a physical token, such as an LCD key fob that displays a constantly changing one-time password value, or a biometric factor, such as a fingerprint. You can even distribute the second factor via an independent communications channel at the time the user logs in: After a user name and password are entered, the system sends a randomly-generated one-time password to the user's previously registered cell phone via SMS message. You can combine security tests to create three- or four-factor authentication, depending on the sensitivity of the access you're trying to control.
The migration is driven through the HMC and the interactive menu is similar to other administrative functions such as Dynamic Logical Partitioning. Select your LPAR through the HMC and perform the following steps:
- Navigate through the pop-up menu button or the bottom frame in the HMC and select Operations -> Mobility -> Migrate. This will bring up the Partition Migration wizard screen.
- In the first menu for Migration Information, don't check any of the boxes to override virtual storage or network errors—you'll want to see any errors along the way.
- In the menu for Profile Name, you can leave the field blank. You would only want to specify a profile name if you don't want your current / default profile modified.
- In the menu for Remote HMC, you can leave the fields blank as well, unless you plan on switching things to another HMC for management.
- In the Destination System menu, pick the target Power Systems server onto which you would like to migrate the LPAR.
- In the Validation Errors / Warnings menu, you should expect to see potentially two messages—one to warn that as a part of the migration process, the HMC will create a new profile with the partition’s current state, and one message stating that if any virtual terminal windows are open at the time of migration, they might be closed or lose connectivity. If you encounter any other errors or warnings, you'll likely need to investigate and resolve them before you can progress further in the migration.
- In the Mover Service Partitions menu, you'll need to pick two of the VIO servers to act as Mover Service Partitions (MSPs). These servers will manage the handoff between the two servers. You should be able to pick any pair of servers with no problems. However, if your VIO servers are low on resources (in particular, RAM), you might want to pick two servers with the highest amount of resources available, temporarily adding them through Dynamic Logical Partitioning if needed. Otherwise, the migration could potentially affect VIO client servers that depend upon the VIO servers.
- In the VLAN Configuration menu, confirm that the VLAN IDs that you selected for your VIO client are present on both VIO servers on the target Power Systems Server. If not, you might not have redundancy and/or communications might be interrupted.
- In the Virtual Storage Adapters menu, you'll need to select which VIO servers will manage which Source Slot IDs for your VIO client. This is where cross-referencing with the information on your spreadsheets is important, because if your VIO servers haven't been zoned to the correct adapters, you'll lose connectivity. In my experience, I always find that I have to manually select the exact opposite of what the HMC chooses by default for some odd reason.
- In the Shared Processor Pools menu, select the shared processor pool into which your LPAR will go if you have multiple shared processor pools. The profile for the LPAR might specify a shared processor pool, or it might be actively a member of a specific pool, but it won't automatically select that pool number for you. Again, refer to your spreadsheet for help in picking the correct one.
- In the Wait Time menu, you can specify for how long the operating system should wait in minutes for applications to acknowledge the impending migration. I tend to leave this at the default of five minutes, although it could be increased for servers with large quantities of resources or high performance requirements.
- In the Summary menu, you'll get a complete synopsis of the plan for the LPAR migration. If everything looks correct, click the Finish button and the migration will begin.
- As the migration runs, a small window will pop up with a progress indicator. The percentage of completion can sometimes sit for several minutes at 0 percent progress before suddenly jumping up into a number such as 47 percent complete. Likewise, as a migration completes, it can sit at the 100 percent mark before the Close button becomes active. During this time, I recommend watching the codes on the HMC to see what exactly is happening and as the LPAR is being cloned over to the other Power Systems server. Likewise, if you watch the topas command on the AIX system, you'll see a message flash at the bottom that says, “Migration in process,” and might notice a change in the server’s resources such as higher CPU utilization or entitlement values increasing.
- Once the migration is complete, the Close button will become active. Click the button, and the migration is complete.
Post-Migration Activities. When the migration is complete, take a look at the Power Systems servers through the HMC and run the lsmap commands on the HMCs once more. You should see no trace of the LPAR on the original servers, and all of the virtual SCSI and virtual fiber channel mapping should be on the other VIO servers. You might notice that the vhost and vfchost numbers have changed along the way—this is completely normal. Although the enumeration for those devices may differ, the slot IDs you chose back when planning out the environment should have remained the same. I recommend performing several migrations to get used to the process.
Wrapping Up Migration. LPM is one of the best technologies to come along for AIX servers since the original introduction of LPAR technology at the turn of the century. But, there's one thing that LPM lacks: High Availability (HA). We’ll now look at a very cool trick to using LPM for creating a more highly available solution.
Leveraging LPM for High Availability
There's one big flaw with Live Partition Mobility (LPM): If the hypervisor on a Power Systems server is inaccessible—possibly due to a hardware failure, power loss, or because the server is in a Power Off state—it's impossible to migrate any of the LPARs. The HMC has to be able to interact with the hypervisor and VIO servers in order to take the resources from that server and bring it over. At this time, there's no way to pull over an LPAR’s profile and activate it on another Power Systems server.
To get around this, the typical recommendation is to invest in high-availability software, such as Power HA or Veritas Cluster Services. But these solutions come with two main drawbacks. First, they cost more money to manage the servers, and secondly, you'll need to have twice as many LPARs up and running with similar resource configurations in case of failure, which can take CPU, RAM, and I/O resources away from the larger computing environment. But, in testing LPM in some different situations, I learned a very cool trick to make an LPM environment highly available with no additional software, cost, or use of resources.
The Stub LPAR Solution. Tto make your LPM environment highly available, I recommend creating what some have called “stub LPARs,” or “cold LPARs,” on your other Power Systems servers. These LPARs are exact copies of your active LPARs, with the same resource allocations, I/O configuration, and identities of your AIX systems, except for three things. First, these LPARs will have different slot numbers for the virtual SCSI and virtual fiber channel devices, using a similar model for the enumeration of the devices from earlier in this article. Second, those devices will not be mapped on the VIO servers until absolutely needed. And third, the LPARs will remain inactive until they're needed.
For example, imagine that you have an AIX system, server1, on one of your p750 servers. This system has LPAR ID 10, 1 CPU, 2GB of RAM, a virtual Ethernet adapter for VLAN 400, two virtual SCSI adapters that go to each of the redundant VIO servers (slots 210 and 220, respectively), and two virtual fiber channel adapters that do the same (slots 610 and 620, respectively).
What you would do is create an LPAR on another p750 server and give it the same CPU, RAM, and virtual Ethernet resources. But you would alter it, naming it server1-stub, and change the enumeration. You would give it an LPAR ID of 110, two virtual SCSI adapters with slot numbers of 310 and 320, and two virtual fiber channel adapters with slot numbers of 710 and 720. The WWNs for these devices would be mapped identically to the original server’s zoning on your SAN, including all disks and LUNs.
In the event that the p750 that contains server1 dies, you would go to the other p750 and log onto its VIO servers. There, you would map any appropriate hdisks and fiber channel adapters to the vhosts and vfchosts that correspond to the server1-stub LPAR’s SLOT. By doing this, you're establishing a hardware configuration that will allow you to access the storage just like how it was set up on the other Power Systems server.
Once this is done, you can activate the LPAR and boot into SMS to select the root disks through the bootlist. After this point, the server1-stub LPAR will bring up the AIX system just as though it was on the original p750. The only difference is that it will be on different virtual SCSI and virtual fiber channel devices on the server (viewable with lsdev), and any commands such as lspath will list some of the paths as missing because it will have retained the original devices in the system’s ODM. But, functionally, it will be on the same IPs, have the same file systems mounted, and the same internals as if it were on the original p750.
Considering The Stub Server Solution. In determining if the stub server solution is something worth pursuing, you should consider the needs of the business and availability of the servers. Is it worth the time and planning to set up? Would it be better to have an automated HA solution? Do we have the resources to support a fully developed HA environment? To help you answer these questions, here are some pros and cons I have derived for deploying this solution:
Pros
- No additional cost or licensing fees. It uses what you have already paid for in LPM.
- Resources can be used for other LPARs instead of having them sit there waiting for a failure until you need them.
- It eliminates the possibility of cluster panics, split-brain scenarios, and other occasions when HA software may inadvertently cause a failover of resources.
- Time-wise, it can often bring resources back online more quickly than HA software.
- Once a stub LPAR is active, the LPAR can be migrated elsewhere through LPM as desired.
Cons
- It requires manual intervention in order to bring the stub LPAR online. (But, then again, I don’t know of any systems administrator who doesn’t get involved when HA software would trigger a failover or failback.)
- Resources might have to be juggled to accommodate for the stub LPAR being activated. In a well-designed environment, this should be anticipated. Actions like shutting down non-essential development and test servers can alleviate this problem.
- It isn't a supported solution by any vendor. If you contact your support technician and try to explain what a stub LPAR is doing, they might not understand what you're trying to accomplish.
- There is some minor hypervisor overhead for storing the profiles for all of the LPARs, even if they remain inactive.
No comments:
Post a Comment