Tuesday 12 March 2013

AIX:sysdump analysis


How do I analyze a system dump ?

1.    System Dump
1.1    Overview
During system crash (flashing 888 on the LCD) dump process is invoked.
Dump process will store the entire kernel segment that resides in real memory (the kernel segment is segment 0) to a disk for future debugging. It is actually creates snap shot core image of the machine in the moment of a crash and save it to a disk.
The core will include also the memory resident user data (such as u-blocks).
If the dump process was successful it will be indicated by 0c0 on the LCD.

1.2    Dump Status Codes

The following dump progress indicators, or dump status codes, are part of a Type 102
message.
Note: When a lowercase c is listed, it displays in the lower half of the character
position. Some systems produce 4-digit codes, the two leftmost positions can
have a blanks or zeros. Use the two rightmost digits.
0c0 The dump completed successfully.
0c1 The dump failed due to an I/O error.
0c2 A dump, requested by the user, is started.
0c3 The dump is inhibited.
0c4 The dump device is not large enough.
0c5 The dump did not start, or the dump crashed.
0c6 Dumping to a secondary dump device.
0c7 Reserved.
0c8 The dump function is disabled.
0c9 A dump is in progress.
0cc Unknown dump failure
1.3    Crash Codes

Note: Some systems may produce 4-digit codes. If the leftmost digit of a 4-digit code is
0, use the three rightmost digits.
The following crash codes are part of a Type 102 message.

000 Unexpected system interrupt.
200 Machine check because of a memory bus error.
201 Machine check because of a memory timeout.
202 Machine check because of a memory card failure.
203 Machine check because of a out of range address.
204 Machine check because of an attempt to write to ROS.
205 Machine check because of an uncorrectable address parity.
206 Machine check because of an uncorrectable ECC error.
207 Machine check because of an unidentified error.
208 Machine check due to an L2 uncorrectable ECC.
300 Data storage interrupt from the processor.
32x Data storage interrupt because of an I/O exception from IOCC.
38x Data storage interrupt because of an I/O exception from SLA.
400 Instruction storage interrupt.
500 External interrupt because of a scrub memory bus error.
501 External interrupt because of an unidentified error.
51x External interrupt because of a DMA memory bus error.
52x External interrupt because of an IOCC channel check.
53x External interrupt from an IOCC bus timeout; x represents the IOCC number.
54x External interrupt because of an IOCC keyboard check.
558 There is not enough memory to continue the IPL.
700 Program interrupt.
800 Floating point is not available.

1.4    Enabling system dump
We can check the current dump settings via "smitty dump" or as following :

# sysdumpdev -l

primary              /dev/hd6
secondary            /dev/sysdumpnull
copy directory       /var/adm/ras
forced copy flag     TRUE
always allow dump    TRUE
dump compression     OFF

This means that the system dump is enabled (always allow dump= TRUE) and  it will be copied to /var/adm/ras.
If  always allow dump=FALSE, then the core dump will not be generated.
To change this, change the settings in smit:
Always ALLOW System Dump = true
/var filesystem must have enough space to accommodate a couple of system dumps.
Its size can be increased by, say, 100MB as following:

# chfs  -a  size=+200000   /var
1.5    Analyzing system dump
If the customer complains that his system had frozen with 888 on the display, check errpt for the entry like this:
C0AA5338   0614145601 U S SYSDUMP        SYSTEM DUMP

This means that the system dump have occurred on 14 of June at 14:56.

Run the following command to verify the status of the last system dump:

# sysdumpdev -L

0453-039

Device name:         /dev/hd6
Major device number: 10
Minor device number: 2
Size:                63952384 bytes
Date/Time:           Thu Jun 14 14:43:11 CST 2001
Dump status:         0
dump completed successfully
Dump copy filename: /var/adm/ras/vmcore.0

Run  the crash command in order to get a basic idea on the possible reasons of the system dump.
The crash subcommands (trace -k, thread -r, status 0) are used to provide a hint on the problem origin:

#cd /var/adm/ras
#crash vmcore.0

Using /unix as the default namelist file.

> trace -k
STACK TRACE:
0x2ff3b400 (excpt=edffff54:40000000:00001004:edffff54:00000106) (intpri=0)
        IAR:      .remove_e_list+38 (00032888):   tweqi   r7,0x0
        LR:       .e_block_thread+40c (00034424)
        2ff3b010: .e_sleep_thread+4c (0003497c)
        2ff3b060: .[nspdd]+4144 (016ba4e4)
        2ff3b100: .[nspdd]+2de4 (016b9184)
        2ff3b170: .[nspdd]+7e8 (016b6b88)
        2ff3b1f0: .rdevioctl+140 (001b4344)
        2ff3b260: .vnop_ioctl+1c (001c01d4)
        2ff3b2a0: .vno_ioctl+144 (001d81d8)
        2ff3b360: .common_ioctl+b0 (001e7894)
        2ff3b3c0: .sys_call_ret+0 (00003a90)
IAR not in kernel segment.

> status 0

CPU     TID  TSLOT     PID  PSLOT  STOPPED  PROC_NAME
  0    700f    112    6db0    109      yes  pltDc

> thread -r

SLT ST    TID      PID    CPUID  POLICY PRI CPU    EVENT  PROCNAME     FLAGS
  2 r     205      204        0    FIFO  7f  78               wait
        t_flags:  sig_avail funnel kthread
  3 r     307      306        1    FIFO  7f  78               wait
        t_flags:  sig_avail funnel kthread
  4 r     409      408        2    FIFO  7f  78               wait
        t_flags:  sig_avail funnel kthread
  5 r     50b      50a        3    FIFO  7f  78               wait
        t_flags:  sig_avail funnel kthread
112 r    700f     6db0        0      RR  40   0              pltDc
        t_flags:  local cdefer funnel


> proc -r

SLT ST    PID   PPID   PGRP   UID  EUID  TCNT  NAME
  2 a     204      0      0     0     0     1  wait
        FLAGS: swapped_in no_swap fixed_pri kproc
  3 a     306      0      0     0     0     1  wait
        FLAGS: swapped_in no_swap fixed_pri kproc
 55 a    37b8   2282   2282   200   200     1  X
        FLAGS: swapped_in execed
112 a    7054   571a   25c8   200   200     1  expose
        FLAGS: swapped_in no_swap fixed_pri ppnocldstop execed
122 a    7a14      1   744c   200   200     1  plateExp_dlg35
        FLAGS: swapped_in orphanpgrp ppnocldstop execed

>q                ;quits the crash command
=================================================================
In this case trace -k shows a problem with nspdd process, which is part of the TSP driver.
 thread -r and status 0  both hint on the application process pltDc as responsible for the core dump (it's the last process that run).
The core file can be copied on a CD and sent to IBM for further analyzing.

No comments:

Post a Comment