Sunday 17 March 2013

Practical AIX troubleshooting


any of you probably remember the commercials from IBM that aired during Monday Night Football in the 1980s called, "You Make The Call." The spots would show an interesting play that had happened on the field. The narrator would explain what the players did, highlighting the questionable nature of the event, and then query the viewing audience as to what they would have ruled, saying, "You make the call." After a brief ad for IBM products or services, the narrator would come back and summarize the decision the referee made and what guidelines he used. It was good way to learn about American football and have some conversations around the dinner table.

Frequently used acronyms

  • LPAR: Logical partition
  • SAN: Storage area network
Just like that ad, in this article, you have a chance to make the call in the realm of troubleshooting practical problems in IBM AIX®. You get the tools and knowledge to triage, test, and temper your skills to solve some of those particularly vexing problems you might encounter. The article provides a couple of real-world, interesting situations that I have come across, gives you the steps to detect the anomalies, and pauses to give you a moment to deduce what was wrong before giving the answer.
Let's set the groundwork with a couple of problems I have run into as a systems administrator.
I needed to migrate an AIX 5.3 LPAR from an older IBM pSeries® p670 server on POWER4™ to a brand new pSeries p570 server on POWER6®. The older server was short on resources, using Workload Manager to manage resources for the main applications on the server, thus the new dynamic processor resources available on the new hardware would work perfectly in giving me the power I needed. I took a mksysb of the LPAR, restored it to the new hardware by using Network Installation Manager, and mapped over the SAN disks.
I booted up the LPAR, and all appeared well, until the applications were started. Immediately, users started calling in. They couldn't access their products at all. When I logged in, I found that the server was completely idle. None of the processes were taxing the server at all. Why were the users having problems?
I had a server with mirrored root disks. One day, the error report started logging problems about a bad block being unable to relocate itself on one of the disks. Knowing that this was symptomatic of an impending hardware failure, I began breaking the mirror. But the server said that it could not completely break the mirror, because the only good copy of one of the logical volumes was on the dying disk. How could I overcome this problem and replace the hardware?

With these two sample problems in mind, let's dive into the process for troubleshooting them.
At the first sign of trouble, the smartest thing to do is freeze. Much like Indiana Jones in "Raiders of The Lost Ark," if you get the idea that the floor might cause blow darts to come shooting in your direction, stop where you are, and don't continue merrily sprinting across the ground. Additional changes may only compound the problem and potentially worsen the situation. There is no sense in having to address multiple problems when one good one can affect uptime as is.
For the first sample problem, I had the users log off immediately, and I halted the applications. Knowing that it was possible that the users' data could be compromised when poor performance halted their queries and inputs, I didn't want their environment altered any further without first taking a look at the situation. Although the users weren't happy to hear that they couldn't use their new, beefier server at that moment, they were grateful that I was exercising all due caution. Plus, this gave me time to start working my way down the rest of the troubleshooting steps.
When I studied kung fu, I heard a story of a second-degree black belt who had disabled someone trying to steal her purse at a bus stop. The class was curious what technique she had used to take down her attacker. Was it the golden tiger style? Did she use the circular motions of pa kua? Or maybe, we wondered, she got really exotic and used the eight drunken immortals to bring him down. It turned out to be none of the above: She used one of the very first techniques a white belt learns in class—a firm elbow to the chest followed by a punch to the nose.
AIX provides a plethora of commands for examining the most granular facets of servers—both hardware and software. Even the most basic of commands provides a great basis for analyzing problems. And when there isn't enough information or things still don't behave properly, you can work your way into more complex and powerful options. But start with the simplest of commands and ideas before breaking out the big guns.
For example, the AIX errpt is one of the greatest basic tools found in any flavor of UNIX®. It is a one-stop shop for getting all sorts of information about hardware and software problems. By tossing on the –a flag or the –j option with an identifier code, more detailed output will describe the type of problem, which components were affected, and how the system reacted to it based on the type of error. And if that doesn't provide enough information, then you can interrogate the system further with the diagcommand, running specific tests on various pieces of hardware and the operating system.
In the case of the second sample problem, after I detected the hardware problem by looking at the errpt, I used theunmirrorvg command—a simple, powerful utility to try to break the mirror—instead of running an rmlvcopy for every single logical volume on the disk. And when I found that I couldn't remove the one remaining logical volume, I went to other basic commands like lspvlsvg, and migratepv to gain information. I tried extendvg and mirrorvg to create another copy of the volume group on another disk. That still left some stale partitions out there, so I went deeper with syncvg and synclvdom to reconcile the Object Data Manager with the server. Eventually, I went to migratelp to try to move the individual logical partitions off of the disk. Unfortunately, none of these tools worked, but they did give me tons of information.
In the scientific method, one critical point of any hypothesis and testing is the ability to re-create and reproduce the process with the same outcome. Failure to do so makes for inconclusive results at best. At worst, it can ruin ideas and tarnish reputations, like the physicists who claimed to have made room-temperature cold fusion in the 1990s.
Or, as I jest: If at first you don't succeed, see if you can break it somewhere else.
When working on AIX servers, if something goes wrong along the way and you have the resources to duplicate the problem, try to see whether the same actions yield the same results on another, similar type of LPAR. If changing the same attributes on another server causes the same effect, it is reasonable to deduce that action was the source of the problem. But if a totally opposite effect is produced, then examine the subtle differences between the servers and try to deduce what could have contributed to the problem.
For the LPAR I had in the first sample problem, I saw that when I swung the SAN disks back to the old p670 server and booted it up, the problem was not present. Users were able to access their application, and the CPU incurred a decent load, going above 80% CPU utilization (10% kernel + 70% user). So, I was able to determine that there was something unique to running things on the p570 machine that was causing the problem rather than something introduced in the migration process.
In the Information Age, a wealth of knowledge is accessible with just a few keystrokes and mouse clicks. Fortunately for systems administrators, we tend to be a part of a greater community that has documented centuries of experience and libraries of syntax online.
One good first place to start is with the manufacturers and vendors themselves. Companies like IBM have put all of their manuals, Redbooks, technical papers, and even their man pages on the web for research purposes. Just putting a simple keyword into the main site's search bar can provide thousands of possible suggestions for information that might help.
Other places I recommend are the various newsgroups, forums, and sites that other systems administrators frequent. People who work all day on servers tend to keep up with reading tech sites and commenting on things they see in the course of their work. Most systems administrators are happy to lend some pointers or shoot a few e-mails back and forth in response to a public cry for help. And you can often find information going back decades that pertains to other versions of operating systems and software, which can serve as jumping-off platforms for more information.
The main trick to use in any of these circumstances is the right set of keywords. If I use a general web site like Google to get started on an AIX problem, I make sure I start the search string explicitly with AIX to avoid any other flavors of UNIX. Then, I might include something like the output from the command or the label from the errpt. I also make sure to use double quotation marks ("") around specific phrases to limit the search to those specific issues and not bring in extraneous information, especially for common words like Logical Volume Manager.
For my problem with the disk that would not work around the bad block relocation, using the phrase AIX "bad block relocation" failure got me a few hundred results on Google, but no one seemed to have come across the exact same circumstance that fell in my lap.
Sometimes, the wisest thing you can do in working on a problem is undo anything you've put in place and go back to where you were originally. This step isn't always available in all circumstances. Sometimes, it's forced upon you by overzealous C-level executives who need their servers back up. Or, it might be necessitated by getting crunched for time. But the option of rolling back is one of the best tactics to keep in a hip pocket.
I included this option at the midpoint in my list of troubleshooting steps, because sometimes it has to be done earlier and sometimes falls later in your triage. But in my experience, I have found it wisest to do the previous four steps before considering backing out any changes, because if the changes get rolled back immediately in the process, it is possible that the problem won't be resolved, and you will merely set yourself up for the same headache the next time you attempt the same work. If the changes get rolled back too late in the process, you could affect uptime or complicate the problem to the point where no back-out is possible.
I actually did have to roll back the server migration from my first example because of time. The users and company would have lost money if this production server were down any longer. The week it took to reschedule the work afforded me the ability to do some more research, but when I attempted the migration again, the beast reared its ugly head. In the second example, there was no rolling back from a hardware problem. There was no way to tell the server, "Take back that bad block relocation error!" I had to continue to try to overcome the disk's reticence.
If all the steps above haven't worked and you decide that it is time to start altering major components or getting more intrusive with the server, there is one important rule to remember above all else: Change things one piece at a time.
Multiple alterations will do one of two things. First, if the problem is resolved along the way, you won't know which change was the effective action. If you don't care what fixed the issue, it may not be a big deal, but good systems administrators like expanding their knowledge base, because they know that problems tend to strike the same place twice or more. Second, if the problem doesn't get fixed, it is possible to introduce more complications. Then, you won't know which one to back out. Go far enough, and the next thing you know, you'll be confused while your system sits in shambles. (Refer to xkcd for a funny joke about this.)
If the problem doesn't get resolved after one delta, you'll generally want to put it back and try something else. This was the case in my first example: When I compared the two servers’ Hardware Management Console profiles, I looked at how they were different. I noticed that the older POWER4 hardware used dedicated CPUs, while the newer POWER6 hardware had shared CPU pools with no caps. Curious about how this difference could affect CPU performance, I changed the profile on the POWER6 machine to use dedicated CPUs. Strangely, the server then performed "correctly" according to the users, and I saw a load on the processors. So, I knew that the problem had to be related to CPU resources but needed to find out why it would do this.
When you've exhausted all reasonable steps and need to bring in a second opinion, it's usually time to contact IBM Support. They have advanced troubleshooting tools, various specialists who cover every facet of the operating system and related products (such as VIO and PowerHA) and can pull up related case numbers to corroborate and collaborate with similar problems. But if you've never called             800-IBM-SERV      , here are the things you will need to know.
First, you should have a contract number with IBM. There are various levels of support, from the highest levels of 24x7x365 coverage with dedicated personnel down to casual 8:00 am to 5:00 pm support for non-critical servers. These support packages can be purchased directly from IBM or by contacting a value-added reseller.
You will also need to provide some information so that IBM Support can pull up your accounts—typically, a phone number where the machine is located, a serial number, a contract number, or a physical location. This information largely depends on whether you are opening a hardware or a software case.
You must also let the Support person know the severity or priority of the case. Priorities vary from 1—which typically relates to systems being down or production impact, resulting with a live call transfer to a tech—down to a 4, which means a longer turnaround time, usually used for more general administration questions.
After you provide a description of the problem and the case is opened, you will be issued a tracking number—typically called aPMR. This number identifies the case to any other Support people with whom you work. Hardware and software PMRs are unique, and if your problem crosses boundaries, you will need to be issued a new number.
I had to contact IBM for both of my sample problems. For the first problem, IBM engaged everyone—from VIO support to the kernel team—to try to get resolution. For the second problem, I remained on the hardware side of their house, providing information from the snap command for analysis.
Occasionally, there is no other choice to fix a problem but to try something so unorthodox and outlandish that most people would call it crazy. This typically happens at a point of desperation, where a job or life might even be on the line. It's usually when even IBM would say, "If you do this, you will be in an unsupported state and will have to start over before we will support it." But the trade-off is that if your solution works, you might be able to save the day.
For my second example, after I had called IBM Support, they said that my only option was to go to a mksysb image to restore the server. With nothing else to lose, after speaking with my team of administrators, we made a plan to try to physically yank the disk from the server after triple-mirroring the root disk. The known risk was that the removal of the disk could cause the server to be unable to boot. But the latent risk was that the physical removal could panic the larger server and crash all of the LPARs on it. Did we decide to dare it?

Now that I've provided the background on these tickets, it's time for you to make the call. To summarize:
  • Why would moving a Workload Manager-enabled server to faster hardware only work correctly if the LPAR profile was set to dedicated CPUs instead of dynamic CPUs?
  • How could I recover a server from a disk that could not be deconfigured or have the data from a failed physical partition moved off of it?
When you think you have an idea, move ahead.
The culprit of the first example was the Workload Manager. The applications that used it had been throttled back to use 50% of the CPU. So, when the hypervisor polling cycle probed the LPAR, it asked, "How much CPU did you need?" The server replied, "I'm only using half of what I have assigned." So, the hypervisor would dynamically reduce the CPU entitlement by half. After this cycle was repeated a few times, the CPU horsepower would effectively halve itself to zero. To fix the problem, the Workload Manager pool was adjusted to use up to 100% of the CPU, and then the dynamic CPU entitlement would throttle itself appropriately.
For the second example, ultimately, we had to go to backup and restore. There was no way around the failed block relocation that the business was willing to take. According to IBM Support, this is an infrequently encountered problem but one where there is no other choice but to lay a mksysb onto a good disk and recover the box that way. And after I recovered the operating system, I could then hot-swap the bad disk out in a safe manner and get it replaced without compromising the other LPARs on the hardware.

Hopefully, you've gained some practical insight into how systems administrators troubleshoot AIX servers, what strategies you can use, some cautionary areas to avoid, and where you can look for advice on fixing your problems. These steps don't cover every situation perfectly, and there are other choices you can make, but these steps can point you down the right road.

No comments:

Post a Comment