Document Text
Title : HP-UX Data Corruption Cookbook
Date : 970515
Type : EN
Document ID : KNC050897002
PROBLEM TEXT
I suspect that there is a data corruption problem on my HP system. How
can I troubleshoot the problem, and what information should I gather to
provide HP support?
DETAIL TEXT
Operating System - HP-UX
Version -
Hardware System - 9000
Series -
RESOLUTION TEXT
Data Corruption High Availability Cookbook
Last revised: February 20, 1997
This document is intended to help you troubleshoot a data corruption
problem on HP-UX.
In addition it will cover some general troubleshooting techniques
and data collection steps to do in the event that you contact HP support.
Contents:
I. What is Data Corruption?
II. How Do I Find Where the Corruption Is?
III. How Do I Recover From the Corruption?
IV. Preventative Steps to Data Corruption
V. Data Corruption Checklist
VI. Appendices
I. What is Data Corruption?
Corruption can be defined as the state of an object such that
it contains alterations, errors or contamination.
In regards to data, corruption means the integrity of the
data is no longer valid. This can exhibit itself in many
different ways:
o file system panics
- freeing free inode
- alloccgblk
- can't find blk in cyl
o exec format errors
o bad magic number for a shared library
o application aborts
o database errors or internal check errors logs
- Oracle internal errors
- ORA-00600 error 3339
- ORA-00600 error 3398
- ORA-01578
- unable to rollback errors
o inconsistency between data in memory and on disk
o application data check failures
What causes data to be corrupted? There a multitude of causes.
Typical causes are:
o interruptible power (ie. no UPS)
o excessive RFI
o improper temperature maintenance in the computer room
o improper humidity maintenance in the computer room
o improper training of operator staff
o improper access mode settings on applications
(ie. devices, diagnostics)
o force mounting of disks whose integrity is unknown or
known to be bad
o faulty disk mechanism
o faulty disk controller
o faulty interface card
(ie. F/W SCSI, SE SCSI, LAN, FDDI, X.25)
o faulty connectors
o faulty cables
o defective kernel code
o defective networking code
o defective application/database code
o use of shared disks in a multi-initiator enviroment without
MC/ServiceGuard
II. How Do I Find Where the Corruption Is?
Typically, the first sign of corruption occurs soon after
something has changed on the system:
o operation system release update
o new hardware is added to the system
o patch installation
o application revision update
o large increase in users
o change of operator staff
o change of disk reporting schemes (ie. Immediate Reporting)
Once some sort of data corruption is found (whether it be
via a panic, application abort, or a user complaining that
their data doesn't look right), a thorough check of the
system must be made to determine the extent of the corruption
as well as what changed that may have triggered the corruption.
Various logs and data are available to assist you on your
search of where the corruption is as well as the cause of
the corruption:
o circular system message buffer (obtained via
/etc/dmesg at HP-UX 9.x or /sbin/dmesg at HP-UX 10.x)
o system diagnostics (LOGTOOL) logs (see Appendix A)
o syslog output (in files /usr/adm/syslog* at HP-UX 9.x
or /var/adm/syslog/* at HP-UX 10.x)
o application log files
o application alert logs (see Appendix B)
o application trace logs (see Appendix B)
o application transaction logs
o system console messages (see Appendix C)
o filesystem panic message (see Appendix D)
With the above information, you should have a better handle
on isolating the file system device experiencing corruption.
III. How Do I Recover From the Corruption?
Once you discover you have data corruption in your data
files, or file system, the first step is to stop continued
activity. This means notifying all those on the system to
cease activity and going to single-user mode so that no
further corruption can occur.
In the case of a data file corruption, be prepared to reload
from the latest backup. Beware - the latest backup may
contain a corrupted file also, so you may have to go back
through numerous backups. Once the files are restored,
a full backup should be taken (unless the backup utility
is the cause of the corruption - at which time, another
backup utility should be used).
If the corruption is found in a application file or database
table/tablespace, take the damaged file "offline". In a
generic application, it may mean shutting down the whole
application. In a database application, it typically means
shutting off full access to the table/tablespace via selective
offline commands. Once offline, the file or table/tablespace
can be restored from logs (ie. redo logs, journals) or rebuilt.
It is important to know that, once you have had data
corruption occur on your disk, it will continue to exist
until someone fixes it (ie. correct the database data
files/index, fsck the file systems). Note that fsck will only
detect/correct problems with the file system infrastructure,
and not the contents of files - meaning data files (raw or
file system files) will not be examined nor corrected
by fsck. So we recommend running application consistency
checks after resolving the cause of the corruption (ie.
install patches, replace hardware) to ensure that corruption
is not encountered later.
** If the application in question is not an HP supported
** application, you are advised to contact your application
** support vendor for specific troubleshooting and recovery
** steps.
IV. Preventative Steps to Data Corruption
Numerous steps can be done to minimize the impact of one
or multiple instances of data corruption:
o regular backups
- full backups
- daily incremental backups
- disk copying
o do not bypass fsck checks at boot
o do not set access mode permissions on devices to
be writable by all
o keep copies of important files or information that
can be used in reconstructing your disks/files. Here
is some of that information:
- /etc/checklist or /etc/fstab
- /etc/passwd
- bdf list
- LVM information
- vgcfbackup
- "vgdisplay -v" output
- "lvdisplay -v" output
- LVM map
- database/application configuration files
o install HP-UX and firmware patches for data corruption problems.
You can search for applicable patches by using the HP Electronic
Support Center:
For customers in the Americas and Asia-Pacific, access:
http:/us-support.external.hp.com
Click on "Patch Database".
Follow the instructions provided on the website.
For customers in Europe, access:
http:/europe-support.external.hp.com
Click on "Patch Database".
Follow the instructions provided on the website.
V. Data Corruption Checklist
If you need to contact HP support for help in resolving the
data corruption problem, please qualify the type of data
corruption and be prepared to answer some questions regarding
the configuration of your system and the nature of the corruption.
Here are the types of data corruption:
o file system corruption
o database corruption
o application data corruption
The checklist below can assist you and HP support in gathering
the necessary information to analyze the data corruption. It is
not guaranteed to be an all-inclusive list of every item of information
needed to resolve your data corruption problem. You may not be able
to execute every step without HP support assistance.
DATA CORRUPTION CHECKLIST:
1. HP-UX revision and system model number, obtained from the following
command:
uname -a
2. Number of CPUs in use. If different than the number installed, give
this number, too.
3. Application Revision (use full number format [ie. Netscape 2.04]).
If database corruption is involved, please provide Database Server
Application Revision (please use full number format [ie. Oracle
7.1.4.10, Oracle 7.1.6.2, Informix 7.13.UC1, Sybase 11.0.1]).
4. Include what(1) of kernel, typically:
what /stand/vmunix (HP-UX 10.x)
what /hp-ux (HP-UX 9.x)
as well as a list of the products installed:
swlist -l product (HP-UX 10.x)
ls /etc/filesets (HP-UX 9.x)
5. List of application patches and Database Application/Server
patches installed.
6. Describe the nature of the data corruption in detail
(include the "bad data" and if possible, how it is bad - specifically,
what it should really look like, any logs documenting the corruption.
This includes database/application trace logs, if applicable.
Also, determine if the corruption is on disk or if the corruption is
an application aborting with an error about a corruption)
7. Have there there been any other types of failures on the system?
If so, please give details for each, even if they seem unrelated to
the corruption problems.
8. Please give a history for each of the problems, including the data
corruption, stated above. Include dates, if possible (estimate, if
necessary).
9. Disk information: List disk hardware path, device file, model name,
and firmware revision (if known) for each disk where corruption has
been seen. All except the firmware revision can be gathered using
the command:
ioscan -knf -C disk
Example listing:
path devicefile model firmware revision
2/12.0.4 c0t5d4 C2490WD 5193
8.0.0 c2d0s2 C2430D 0305
2/16.0.3 c0t2d2 C2300WD HP08
2/16.0.4 c0t2d2 C2300WD HP03
2/12.0.3 c0t2d2 C2300WD 8.61
2/12.0.3 c0t2d2 C2300WD 8.61
10. Are any of the busses on this system shared with another system?
Which busses?
Are any of these busses active on the other systems? If so, list
which ones.
11. Is the application using raw or filesystem I/O. Which
filesystem type is involved (i.e. hfs, or vxfs)?
If filesystem is used, what are the block and fragment sizes?
Here is an example of obtaining this information for an HFS
filesystem corresponding to raw device file /dev/vg02/rlvol1:
tunefs -v /dev/vg02/rlvol1
bsize is the block size; fsize is the fragment size.
12. Is the async driver being used by the application? If yes, what bits
are set in the minor number for /dev/async:
ll /dev/async (minor number is the field starting with "0x")
13. Are the disks used by the application under LVM? If so, provide
vgdisplay -v output for each volume group (shows mirroring).
14. Please be prepared to provide dial-in or internet access, with
root passwords, for any system that's failing.
15. If an application corruption failure occurs, does a retry of the
application command/action/transaction succeed?
16. Where is the corruption:
- application/database data file
- application file system
- database index/table file
- database trasaction logs
- application logs
- some other location
Please specify what type of file. Please specify where the
corruption is for each instance of corruption.
17. Documentation from multiple instances of corruption errors:
include logs for each.
(For example, for an Oracle database corruption, an instance can be
characterized by obtaining a trace file using level 10 logging and
dd'ing the block directly from disk. See Appendix B)
18. Are you able to reproduce this data corruption failure at will?
If not, what is the frequency of failure?
19. Dates of application changes, date of last database rebuild, if
applicable.
20. Application or database application/server errors experienced.
21. LOGTOOL errors occuring at or most recently before the "corruption"
errors for the system and for the I/O card with the error. Please
see Appendix A for the steps for using LOGTOOL.
22. For each interface card used for disk IO, list the firmware
firmware revision used on card, if known (This value can be
obtained from diagnostic output). For example:
HP-PB F/W SCSI Firmware Rev 3636
23. What type of networking are you using? FDDI? LAN?
Please describe and include the firmware revision on the
network interface, if known (for FDDI, also obtain the output
of "what /usr/lib/fddi_dnld").
VI. APPENDICES
Appendix A: How to use LOGTOOL (an example)
_________________________________________________________________
1. Log onto system as root.
2. Type "sysdiag".
# sysdiag
3. At the DUI prompt type "logtool".
DUI> logtool
4. At the LOGTOOL prompt type "status detail".
LOGTOOL> status detail
This will list the number of error logs on the system,
including time stamp of the first log entry and size of the log.
For Example:
LOGTOOL> status detail
Log File Rec #1 Rec #1 Total
Name Date Time #Records
========== ============ ======== ========
LOG0000 02/20/97 9:03 AM 1
LOG0001 03/24/97 10:10 AM 1
LOG0002* 0
5. From the time stamp of the first record, determine which log
you are interested in viewing. If it is the most recent log
(the one with the * next to it, then issue the LOGTOOL "switchlog"
command to force the log file in question to be closed and a new
one to be opened).
6. Display the contents of the log and redirect the output to a file
by using the following command:
LOGTOOL> list log=xxxx outfile=<filename>
where xxxx is the integer of the log file name and <filename> is a
standard file name. You can drop the leading zeroes from the
integer number (note: use only alpha/numberic characters in the
filename and limit the filename to eight characters).
For example, if you want to display the contents of LOG001, type:
LOGTOOL> list log=1 outfile=log1
In the above example, the contents of LOG0001 will be redirected
to a file called log1. It will be automatically placed under the
/usr/diag/bin directory on an HP-UX 9.x system or /usr/sbin/diag
on an HP-UX 10.x system.
7. Now, to view this log, exit out of LOGTOOL and DUI and cd to the
directory containing log1. Then do a 'more' or 'vi' on the file.
For example:
LOGTOOL> exit
DUI> exit
# cd /usr/sbin/diag (on HP-UX 10.x)
# more log1
The fields of interest are PRODUCT NAME, PDEV, HARDWARE STATUS, and
DATE and TIME. This will give you an idea of what device is causing
the error and possibly what kind of error. HP support assistance may
be needed to interpret the fields.
Appendix B: Oracle7 Alert Log Example
_________________________________________________________________
Oracle Alert Log Excerpt
========================
Starting up ORACLE RDBMS Version: 7.1.6.2.0.
System parameters with non-default values:
processes = 120
timed_statistics = TRUE
shared_pool_size = 125829120
control_files = ?/dbs/cntrlPRD.dbf,
?/sapdata1/cntrl/cntrlPRD.dbf, ?/sapdata2/cntrl/cntrlPRD.dbf
db_block_buffers = 24576
log_archive_start = TRUE
log_archive_dest = ?/saparch/PRDarch
log_buffer = 245760
log_checkpoint_interval = 3000000000
db_files = 254
checkpoint_process = TRUE
row_locking = always
rollback_segments = PRS_1, PRS_2, PRS_3, PRS_4, PRS_5,
PRS_6, PRS_7, PRS_8, PRS_9, PRS_10
row_cache_cursors = 300
remote_login_passwordfile= NONE
mts_service = PRD
mts_servers = 0
mts_max_servers = 0
mts_max_dispatchers = 0
audit_trail = TRUE
sort_area_size = 2097152
sort_area_retained_size = 262144
db_name = PRD
open_cursors = 800
optimizer_mode = rule
background_dump_dest = ?/saptrace/background
user_dump_dest = ?/saptrace/usertrace
core_dump_dest = ?/saptrace/background
PMON started
DBWR started
ARCH started
LGWR started
CKPT started
Thu Feb 6 09:50:23 1997
alter database mount exclusive
Thu Feb 6 09:50:23 1997
Completed: alter database mount exclusive
Thu Feb 6 09:50:23 1997
alter database open
Thu Feb 6 09:50:28 1997
Thread 1 opened at log sequence 1172
Current log# 4 seq# 1172 mem# 0: /oracle/PRD/origlogB/log_g4_m1.dbf
Thu Feb 6 09:50:28 1997
SMON: enabling cache recovery
SMON: enabling tx recovery
Thu Feb 6 09:50:29 1997
Completed: alter database open
...
Errors in file /oracle/PRD/saptrace/usertrace/ora_24982.trc:
ORA-00600: internal error code, arguments: [3339], [85082112],
[3154128492], [], [], [], [], []
======================================================================
This above information tells you that an internal error was
experienced by the database application and the details are
in the trace log file: /oracle/PRD/saptrace/usertrace/ora_24982.trc
So the next step is to look at the corresponding trace log file.
Oracle Trace Log Excerpt
========================
Oracle7 Server Release 7.1.6.2.0 - Production Release
PL/SQL Release 2.1.6.2.0 - Production
ORACLE_HOME = /oracle/PRD
ORACLE_SID = PRD
Oracle process number: 42 Unix process id: 11799
System name: HP-UX
Node name: pddbs
Release: B.10.01
Version: E
Machine: 9000/892
Sun Dec 8 16:37:09 1996
*** SESSION ID:(39.19)
***
Corrupt block=8c00dc70 file=35. blocknum=56432. found during buffer read
on disk type=2. ver=1. dba=6403cb1d inc=cc9803 seq=2 incseq=15b0105
Sun Dec 8 16:37:09 1996
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [3339], [1677970205],
[2348866672], [], [], [], [], []
...
---- Argument/Register Address Dump -----
Argument/Register addr=34bb4. Dump of memory from 34B74 to 34CB4
34B60 20617272 61790000 00000000
34B80 5244424D 53000000 4F524100 6572726F 72206D65 73736167 65206669
6C65206E
34BA0 616D6500 4F52415F 4E50495F 4552524F 52000000 6B736564 6D703A20
696E7465
34BC0 726E616C 206F7220 66617461 6C206572 726F720A 00000000 6B736564
6D703A20
34BE0 6E6F2063 75727265 6E742063 6F6E7465 78742061 7265610A 00000000
76657273
34C00 696F6E20 6E756D62 65720000 6B736574 73746400 4B534554 53544400
25252D25
34C20 64732025 252D2564 73202525 2D256473 20000000 25732525 2D256473
0A000000
34C40 0A0A2D2D 2D2D2D20 43616C6C 20537461 636B2054 72616365 202D2D2D
2D2D0A00
34C60 63616C6C 696E6700 63616C6C 00000000 656E7472 79000000 61726775
6D656E74
34C80 2076616C 75657320 696E2068 65780000 6C6F6361 74696F6E 00000000
74797065
34CA0 00000000 706F696E 74000000 283F206D 65616E73
...
7B03A0A0 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
Repeat 1 times
7B03A0E0 00000000 00000000 00000000 00000000 00000000 00000000 00000000
000E416F
----- End of Call Stack Trace -----
======================================================================
The above information tells you which block (block=8c00dc70) in which
file (file=35), and in which blocknum (blocknum=56432) the corruption
resides. You will need find out which file this number corresponds to.
If the system is running sqlplus, the following procedure can be used:
sqlplus sys/change_on_install <<!
column file_name format A60
column blocks format 9999999
column file_id format 999
select file_id,file_name,blocks from dba_data_files
exit;
!
When you figure out which file this corresponds with (let's say
/oracle/DB/datafile1), you can get a copy of the block from disk
by doing a dd of the disk block (note you need to multiply the
file block number by 8 since we're doing 1K blocks and Oracle
tends to use 8K blocks [this may be changed by the database
administrator, so modify the block number multiplier accordingly):
BLOCKNUM=56432*8
dd if=/oracle/DB/datafile1 of=block.56432 bs=1k \
skip=$BLOCKNUM count=8
This gives you the datablock from disk. The other block is
in the Oracle trace file itself (reference the Argument/Register
area above).
Appendix C: System Console Messages
_________________________________________________________________
Example #1:
SCSI: unrecovered deferred error (dev = 0x1f011000)
This tells you that the SCSI device in question is device
0x1f011000. This type of error typically means that we have
a faulty disk mechanism, faulty disk controller, or faulty cable.
Example #2:
SCSI: Timeout -- bus: 5 -- request timeout -- dev: 1f050000
SCSI: Timeout -- bus: 5 -- bus hang -- dev: 1f050000, offset: 448
This tells you that the SCSI device in question is
device 0x1f050000 and this type of error shows a timeout and
bus problem. This can be caused by numerous things: faulty disk
mechanism, faulty disk controller, faulty termination, faulty
cable and more.
Appendix D: File System Panic Message Example
_________________________________________________________________
Example file system panic string:
dev = 0x4000000a, ino = 21295, fs = /usr
@(#)9245XA HP-UX (A.09.04) #0: Mon Nov 8 16:46:17 PST 1993
panic: (display==0xbf00, flags==0x0) ifree: freeing free inode
This tells you that the file system in question is /usr and
the inode number for the file is 21295. A search of that file
system can be done via the "find" command with the "-inum n"
option where n is the inode number in question. The file (if
still available on the disk) can then be examined to investigate
the corruption. The next step is to validate the hardware by
checking diagnostics and reading the LOGTOOL logs (see Appendix A).
The next step is to see what file system and related drivers patches
(ie. all the drivers in the I/O path to that specific disk) are
missing from the system.
(c) Copyright 1996-1998 Hewlett-Packard Company.