UnixPedia : HPUX / LINUX / SOLARIS: HPUX :What Are the Steps to Process an HP-UX 11i Crash Dump?

Friday, September 26, 2014

HPUX :What Are the Steps to Process an HP-UX 11i Crash Dump?



What are the steps to process an 11i crash dump?
ANSWER:
PROCESSING HPUX DUMPS (11.11 - 11.31)
(Please read completely before using)
===============================================================================
WHAT ARE DUMP PROCESSING TOOLS ?
===============================================================================
If HP-UX crashes, system firmware will save critical O/S state data from RAM to the swap LVOL or a dump device, then re-boots the system where the kernel copies the dump to a file system directory.
Depending on the version of HPUX, q4 , crashinfo , and crashlite may be used to process the dumps to provide details about the cause of the crash.
== STEP 1 ===== WHERE IS THE DUMP? ==========================================
1.1 /var/adm/crash/ is the default destination directory for dumps. If a directory is not specified in the boot-time dump configuration file, do so now.

/etc/rc.config.d/savecrash :   SAVECRASH_DIR=/var/adm/crash (or preferred location) 
If "/var: file system full " occurs during the dump save, dump uncompress or processing, you may need to free up space in /var or configure SAVECRASH_DIR= with a file system that can save all of the physical memory installed on the system. When done, run the savecrash command again as follows:
# /sbin/savecrash -rvf <TARGET_DIRECTORY>
1.2 Determine if a recent crash.N (11.X) directory exists in the dump directory as follows:
# ll /var/adm/crash/c* (dump directory)
"N" increments with each new dump.
1.3 If the system dump is not at the expected path, try to save it using the following command:
# savecrash -rvf <directory>
If this results in "invalid dump header ", a valid dump does not exist in the swap/dump device. (Swapping may have occurred)
1.4 /etc/shutdownlog and /var/adm/crash/c*/INDEX contain a useful crash "panic" statement. If shutdownlog does not exist, issue the following command:
# touch /etc/shutdownlog
== STEP 2 ===== CD TO THE DUMP DIRECTORY ====================================
2.1 cd to the dump directory (IMPORTANT!)
Example:
# cd /var/adm/crash/crash.0
2.2 gunzip the kernel file if it is zipped: # gunzip vmunix.gz
== STEP 3 ===== USE THE LATEST TOOL TO READ THE DUMP =========================
3.1 Download the latest version of crashinfo via FTP:
ftp 15.192.32.78 login: hpcu password: Toolbox1
Once you get an FTP prompt, type:
bin get crashinfo.shar quit
3.2 Unpack and run crashinfo:
sh crashinfo.shar ./crashinfo.exe ./crashinfo . > crash.txt
== STEP 4 ===== REVIEW AND SEND DATA =======================================
4.1 HPUX uses the acronyms HPMC and MCA to denote a hardware failure.
Was the crash due to a hardware failure?
Type:

# grep -e MCA -e HPMC crash.txt | grep -I event 
If any of the following lines result from the grep, open a hardware repair case for the system:

"crash event was an MCA"   "crash event was an HPMC"   "Crash Event 0 (HPMC, struct crash_event_table_struct..."  
The OnlineDiag software bundle captures HPMC and MCA details in the /var/tombstones/ts * files for HPMCs and /var/tombstones/mca* filenames for MCAs.
Check the 'dumptime ' in the index file: # grep dumptime INDEX
If an HPMC or MCA has occurred, locate the 'ts' or 'mca' file (usually ts99 or the latest mca* file) created after the "dumptime". Email the file per instructions below.
If an HPMC did not occur, proceed to the next step.
4.2 Skip the remainder of this step if the following lines are not found in crash.txt .
MC/ServiceGuard: Unable to maintain contact with cmcld daemon.
Performing TOC to ensure data integrity.
If these statements occur, determine the NODE_TIMEOUT value as follows:
*      For Serviceguard 11.18 and older use: # cmviewconf | grep node timeout
RETURNS: node_timeout=16000000
*      For Serviceguard 11.19 and newer use# cmviewcl -v -f line -s config | grep member_timeout
RETURNS: member_timeout=16000000
If the value returned is 2 seconds, then this probably caused the crash.
When the kernel is too busy to send a Serviceguard heartbeat packet to the other nodes within the NODE_TIMEOUT period, the other nodes reformed a cluster and 'orphaned' this node - causing a TOC/reboot.
Update the cluster NODE_TIMEOUT to 8 seconds and stop here.
If the NODE_TIMEOUT is not the problem then include the following files when you send in email for the case:
output from command "cmviewcl -v -f line -s config -v "
/var/adm/syslog/syslog.log
(from all of the nodes in the cluster that did not crash.)
/var/adm/syslog/OLDsyslog.log
(from the node that crashed.)
output from command: "netfmt -nNlf /var/adm/nettl.LOG000 "
/var/adm/cmcluster/frdump.cmcld. *
(Serviceguard 'flight recorder' logs)
4.3 Generate a list of installed patches as follows:

# /usr/sbin/swlist -l product > patchlist.txt 
4.4 Zip and Email the following files as requested: crash.txt patchlist.txt /etc/shutdownlog
If an HPMC was detected: /var/tombstones/ts99
If an MCA was detected: /var/tombstones/mca*
If dump was the result of a hang: /var/adm/syslog/OLDsyslog.log
If dump was a Serviceguard TOC: files listed in step 4.2
EMAIL REQUIREMENTS:
To: HPSupport_global@hp.com
Cc: hpcu@atl.hp.com
Subject:<CASE:YOUR_CASE-NUM> [Note there should be no spaces between your case ID and the ":" at the beginning and the ">" at the end.
Example Subject:
Subject:<CASE:3601123456>
EMAIL RECOMMENDATIONS:
Unless requested, DO NOT send this data to the engineer's personal email address.
Send files as attachments when possible.
Send fresh messages, not replies.
Mail size must be < 2MB. Anything greater will be denied.
After emailing the data, please notify HP that dump email has been sent for action (via callback or ITRC note).
If e-mail or ftp is not available, create a tar tape/CD of the crash.n files (relative pathing please) and send the media to:

1 comment: