How do I analyze a system dump ?
1. System Dump
1.1 Overview
During system crash (flashing 888 on the LCD) dump process is invoked.
Dump process will store the entire kernel segment that resides in real memory (the kernel segment is segment 0) to a disk for future debugging. It is actually creates snap shot core image of the machine in the moment of a crash and save it to a disk.
The core will include also the memory resident user data (such as u-blocks).
If the dump process was successful it will be indicated by 0c0 on the LCD.
1.2 Dump Status Codes
The following dump progress indicators, or dump status codes, are part of a Type 102
message.
Note: When a lowercase c is listed, it displays in the lower half of the character
position. Some systems produce 4-digit codes, the two leftmost positions can
have a blanks or zeros. Use the two rightmost digits.
0c0 The dump completed successfully.
0c1 The dump failed due to an I/O error.
0c2 A dump, requested by the user, is started.
0c3 The dump is inhibited.
0c4 The dump device is not large enough.
0c5 The dump did not start, or the dump crashed.
0c6 Dumping to a secondary dump device.
0c7 Reserved.
0c8 The dump function is disabled.
0c9 A dump is in progress.
0cc Unknown dump failure
1.3 Crash Codes
Note: Some systems may produce 4-digit codes. If the leftmost digit of a 4-digit code is
0, use the three rightmost digits.
The following crash codes are part of a Type 102 message.
000 Unexpected system interrupt.
200 Machine check because of a memory bus error.
201 Machine check because of a memory timeout.
202 Machine check because of a memory card failure.
203 Machine check because of a out of range address.
204 Machine check because of an attempt to write to ROS.
205 Machine check because of an uncorrectable address parity.
206 Machine check because of an uncorrectable ECC error.
207 Machine check because of an unidentified error.
208 Machine check due to an L2 uncorrectable ECC.
300 Data storage interrupt from the processor.
32x Data storage interrupt because of an I/O exception from IOCC.
38x Data storage interrupt because of an I/O exception from SLA.
400 Instruction storage interrupt.
500 External interrupt because of a scrub memory bus error.
501 External interrupt because of an unidentified error.
51x External interrupt because of a DMA memory bus error.
52x External interrupt because of an IOCC channel check.
53x External interrupt from an IOCC bus timeout; x represents the IOCC number.
54x External interrupt because of an IOCC keyboard check.
558 There is not enough memory to continue the IPL.
700 Program interrupt.
800 Floating point is not available.
1.4 Enabling system dump
We can check the current dump settings via "smitty dump" or as following :
# sysdumpdev -l
primary /dev/hd6
secondary /dev/sysdumpnull
copy directory /var/adm/ras
forced copy flag TRUE
always allow dump TRUE
dump compression OFF
This means that the system dump is enabled (always allow dump= TRUE) and it will be copied to /var/adm/ras.
If always allow dump=FALSE, then the core dump will not be generated.
To change this, change the settings in smit:
Always ALLOW System Dump = true
/var filesystem must have enough space to accommodate a couple of system dumps.
Its size can be increased by, say, 100MB as following:
# chfs -a size=+200000 /var
1.5 Analyzing system dump
If the customer complains that his system had frozen with 888 on the display, check errpt for the entry like this:
C0AA5338 0614145601 U S SYSDUMP SYSTEM DUMP
This means that the system dump have occurred on 14 of June at 14:56.
Run the following command to verify the status of the last system dump:
# sysdumpdev -L
0453-039
Device name: /dev/hd6
Major device number: 10
Minor device number: 2
Size: 63952384 bytes
Date/Time: Thu Jun 14 14:43:11 CST 2001
Dump status: 0
dump completed successfully
Dump copy filename: /var/adm/ras/vmcore.0
Run the crash command in order to get a basic idea on the possible reasons of the system dump.
The crash subcommands (trace -k, thread -r, status 0) are used to provide a hint on the problem origin:
#cd /var/adm/ras
#crash vmcore.0
Using /unix as the default namelist file.
> trace -k
STACK TRACE:
0x2ff3b400 (excpt=edffff54:40000000:00001004:edffff54:00000106) (intpri=0)
IAR: .remove_e_list+38 (00032888): tweqi r7,0x0
LR: .e_block_thread+40c (00034424)
2ff3b010: .e_sleep_thread+4c (0003497c)
2ff3b060: .[nspdd]+4144 (016ba4e4)
2ff3b100: .[nspdd]+2de4 (016b9184)
2ff3b170: .[nspdd]+7e8 (016b6b88)
2ff3b1f0: .rdevioctl+140 (001b4344)
2ff3b260: .vnop_ioctl+1c (001c01d4)
2ff3b2a0: .vno_ioctl+144 (001d81d8)
2ff3b360: .common_ioctl+b0 (001e7894)
2ff3b3c0: .sys_call_ret+0 (00003a90)
IAR not in kernel segment.
> status 0
CPU TID TSLOT PID PSLOT STOPPED PROC_NAME
0 700f 112 6db0 109 yes pltDc
> thread -r
SLT ST TID PID CPUID POLICY PRI CPU EVENT PROCNAME FLAGS
2 r 205 204 0 FIFO 7f 78 wait
t_flags: sig_avail funnel kthread
3 r 307 306 1 FIFO 7f 78 wait
t_flags: sig_avail funnel kthread
4 r 409 408 2 FIFO 7f 78 wait
t_flags: sig_avail funnel kthread
5 r 50b 50a 3 FIFO 7f 78 wait
t_flags: sig_avail funnel kthread
112 r 700f 6db0 0 RR 40 0 pltDc
t_flags: local cdefer funnel
> proc -r
SLT ST PID PPID PGRP UID EUID TCNT NAME
2 a 204 0 0 0 0 1 wait
FLAGS: swapped_in no_swap fixed_pri kproc
3 a 306 0 0 0 0 1 wait
FLAGS: swapped_in no_swap fixed_pri kproc
55 a 37b8 2282 2282 200 200 1 X
FLAGS: swapped_in execed
112 a 7054 571a 25c8 200 200 1 expose
FLAGS: swapped_in no_swap fixed_pri ppnocldstop execed
122 a 7a14 1 744c 200 200 1 plateExp_dlg35
FLAGS: swapped_in orphanpgrp ppnocldstop execed
>q ;quits the crash command
=================================================================
In this case trace -k shows a problem with nspdd process, which is part of the TSP driver.
thread -r and status 0 both hint on the application process pltDc as responsible for the core dump (it's the last process that run).
The core file can be copied on a CD and sent to IBM for further analyzing.
No comments:
Post a Comment