Document Text

Title       : HP-UX Data Corruption Cookbook
Date        : 970515
Type        : EN
Document ID : KNC050897002

PROBLEM TEXT

I suspect that there is a data corruption problem on my HP system.  How
can I troubleshoot the problem, and what information should I gather to
provide HP support?

DETAIL TEXT

Operating System -  HP-UX
    Version -
Hardware System -   9000
    Series -

RESOLUTION TEXT


           Data Corruption High Availability Cookbook

Last revised: February 20, 1997

This document is intended to help you troubleshoot a data corruption
problem on HP-UX.

In addition it will cover some general troubleshooting techniques
and data collection steps to do in the event that you contact HP support.


Contents:

     I. What is Data Corruption?
    II. How Do I Find Where the Corruption Is?
   III. How Do I Recover From the Corruption?
    IV. Preventative Steps to Data Corruption
    V.  Data Corruption Checklist
    VI. Appendices


I. What is Data Corruption?

    Corruption can be defined as the state of an object such that
    it contains alterations, errors or contamination.

    In regards to data, corruption means the integrity of the
    data is no longer valid.  This can exhibit itself in many
    different ways:
       o file system panics
             - freeing free inode
             - alloccgblk
             - can't find blk in cyl
       o exec format errors
       o bad magic number for a shared library
       o application aborts
       o database errors or internal check errors logs
             - Oracle internal errors
               - ORA-00600 error 3339
               - ORA-00600 error 3398
               - ORA-01578
             - unable to rollback errors
       o inconsistency between data in memory and on disk
       o application data check failures

    What causes data to be corrupted?  There a multitude of causes.
    Typical causes are:
       o interruptible power (ie. no UPS)
       o excessive RFI
       o improper temperature maintenance in the computer room
       o improper humidity maintenance in the computer room
       o improper training of operator staff
       o improper access mode settings on applications
         (ie. devices, diagnostics)
       o force mounting of disks whose integrity is unknown or
         known to be bad
       o faulty disk mechanism
       o faulty disk controller
       o faulty interface card
         (ie. F/W SCSI, SE SCSI, LAN, FDDI, X.25)
       o faulty connectors
       o faulty cables
       o defective kernel code
       o defective networking code
       o defective application/database code
       o use of shared disks in a multi-initiator enviroment without
         MC/ServiceGuard

II.  How Do I Find Where the Corruption Is?

    Typically, the first sign of corruption occurs soon after
    something has changed on the system:
        o operation system release update
        o new hardware is added to the system
        o patch installation
        o application revision update
        o large increase in users
        o change of operator staff
        o change of disk reporting schemes (ie. Immediate Reporting)

    Once some sort of data corruption is found (whether it be
    via a panic, application abort, or a user complaining that
    their data doesn't look right), a thorough check of the
    system must be made to determine the extent of the corruption
    as well as what changed that may have triggered the corruption.

    Various logs and data are available to assist you on your
    search of where the corruption is as well as the cause of
    the corruption:
        o circular system message buffer (obtained via
          /etc/dmesg at HP-UX 9.x or /sbin/dmesg at HP-UX 10.x)
        o system diagnostics (LOGTOOL) logs (see Appendix A)
        o syslog output (in files /usr/adm/syslog* at HP-UX 9.x
          or /var/adm/syslog/* at HP-UX 10.x)
        o application log files
        o application alert logs (see Appendix B)
        o application trace logs (see Appendix B)
        o application transaction logs
        o system console messages (see Appendix C)
        o filesystem panic message (see Appendix D)

    With the above information, you should have a better handle
    on isolating the file system device experiencing corruption.


III.  How Do I Recover From the Corruption?

    Once you discover you have data corruption in your data
    files, or file system, the first step is to stop continued
    activity.  This means notifying all those on the system to
    cease activity and going to single-user mode so that no
    further corruption can occur.

    In the case of a data file corruption, be prepared to reload
    from the latest backup.  Beware - the latest backup may
    contain a corrupted file also, so you may have to go back
    through numerous backups.  Once the files are restored,
    a full backup should be taken (unless the backup utility
    is the cause of the corruption - at which time, another
    backup utility should be used).

    If the corruption is found in a application file or database
    table/tablespace, take the damaged file "offline".  In a
    generic application, it may mean shutting down the whole
    application.  In a database application, it typically means
    shutting off full access to the table/tablespace via selective
    offline commands.  Once offline, the file or table/tablespace
    can be restored from logs (ie. redo logs, journals) or rebuilt.

    It is important to know that, once you have had data
    corruption occur on your disk, it will continue to exist
    until someone fixes it (ie. correct the database data
    files/index, fsck the file systems).  Note that fsck will only
    detect/correct problems with the file system infrastructure,
    and not the contents of files - meaning data files (raw or
    file system files) will not be examined nor corrected
    by fsck.  So we recommend running application consistency
    checks after resolving the cause of the corruption (ie.
    install patches, replace hardware) to ensure that corruption
    is not encountered later.

    ** If the application in question is not an HP supported
    ** application, you are advised to contact your application
    ** support vendor for specific troubleshooting and recovery
    ** steps.

IV.  Preventative Steps to Data Corruption

     Numerous steps can be done to minimize the impact of one
     or multiple instances of data corruption:
       o regular backups
         - full backups
         - daily incremental backups
         - disk copying
       o do not bypass fsck checks at boot
       o do not set access mode permissions on devices to
         be writable by all
       o keep copies of important files or information that
         can be used in reconstructing your disks/files.  Here
         is some of that information:
         - /etc/checklist or /etc/fstab
         - /etc/passwd
         - bdf list
         - LVM information
               - vgcfbackup
               - "vgdisplay -v" output
               - "lvdisplay -v" output
               - LVM map
         - database/application configuration files
       o install HP-UX and firmware patches for data corruption problems.
         You can search for applicable patches by using the HP Electronic
         Support Center:

            For customers in the Americas and Asia-Pacific, access:
            http:/us-support.external.hp.com
            Click on "Patch Database".
            Follow the instructions provided on the website.

            For customers in Europe, access:
            http:/europe-support.external.hp.com
            Click on "Patch Database".
            Follow the instructions provided on the website.


V. Data Corruption Checklist

    If you need to contact HP support for help in resolving the
    data corruption problem, please qualify the type of data
    corruption and be prepared to answer some questions regarding
    the configuration of your system and the nature of the corruption.
    Here are the types of data corruption:
        o file system corruption
        o database corruption
        o application data corruption
    The checklist below can assist you and HP support in gathering
    the necessary information to analyze the data corruption.  It is
    not guaranteed to be an all-inclusive list of every item of information
    needed to resolve your data corruption problem.  You may not be able
    to execute every step without HP support assistance.

    DATA CORRUPTION CHECKLIST:

   1. HP-UX revision and system model number, obtained from the following
      command:
           uname -a

   2. Number of CPUs in use.  If different than the number installed, give
      this number, too.

   3. Application Revision (use full number format [ie. Netscape 2.04]).
      If database corruption is involved, please provide Database Server
      Application Revision (please use full number format [ie. Oracle
      7.1.4.10, Oracle 7.1.6.2, Informix 7.13.UC1, Sybase 11.0.1]).

   4. Include what(1) of kernel, typically:
          what /stand/vmunix    (HP-UX 10.x)
          what /hp-ux           (HP-UX 9.x)
      as well as a list of the products installed:
          swlist -l product     (HP-UX 10.x)
          ls /etc/filesets      (HP-UX 9.x)

   5. List of application patches and Database Application/Server
      patches installed.

   6. Describe the nature of the data corruption in detail
      (include the "bad data" and if possible, how it is bad - specifically,
      what it should really look like, any logs documenting the corruption.
      This includes database/application trace logs, if applicable.
      Also, determine if the corruption is on disk or if the corruption is
      an application aborting with an error about a corruption)

   7. Have there there been any other types of failures on the system?
      If so, please give details for each, even if they seem unrelated to
      the corruption problems.

   8. Please give a history for each of the problems, including the data
      corruption, stated above.  Include dates, if possible (estimate, if
      necessary).

   9. Disk information: List disk hardware path, device file, model name,
      and firmware revision (if known) for each disk where corruption has
      been seen. All except the firmware revision can be gathered using
      the command:
           ioscan -knf -C disk

      Example listing:
           path    devicefile  model               firmware revision
           2/12.0.4 c0t5d4     C2490WD             5193
           8.0.0    c2d0s2     C2430D              0305
           2/16.0.3 c0t2d2     C2300WD             HP08
           2/16.0.4 c0t2d2     C2300WD             HP03
           2/12.0.3 c0t2d2     C2300WD             8.61
           2/12.0.3 c0t2d2     C2300WD             8.61

   10. Are any of the busses on this system shared with another system?
       Which busses?

       Are any of these busses active on the other systems?  If so, list
       which ones.

   11. Is the application using raw or filesystem I/O.  Which
       filesystem type is involved (i.e. hfs, or vxfs)?

       If filesystem is used, what are the block and fragment sizes?
       Here is an example of obtaining this information for an HFS
       filesystem corresponding to raw device file /dev/vg02/rlvol1:
            tunefs -v /dev/vg02/rlvol1
       bsize is the block size; fsize is the fragment size.

   12. Is the async driver being used by the application? If yes, what bits
       are set in the minor number for /dev/async:
            ll /dev/async    (minor number is the field starting with "0x")

   13. Are the disks used by the application under LVM? If so, provide
       vgdisplay -v output for each volume group (shows mirroring).

   14. Please be prepared to provide dial-in or internet access, with
       root passwords, for any system that's failing.

   15. If an application corruption failure occurs, does a retry of the
       application command/action/transaction succeed?

   16. Where is the corruption:
        - application/database data file
        - application file system
        - database index/table file
        - database trasaction logs
        - application logs
        - some other location
       Please specify what type of file.  Please specify where the
       corruption is for each instance of corruption.

   17. Documentation from multiple instances of corruption errors:
       include logs for each.
       (For example, for an Oracle database corruption, an instance can be
       characterized by obtaining a trace file using level 10 logging and
       dd'ing the block directly from disk.  See Appendix B)

   18. Are you able to reproduce this data corruption failure at will?
       If not, what is the frequency of failure?

   19. Dates of application changes, date of last database rebuild, if
       applicable.

   20. Application or database application/server errors experienced.

   21. LOGTOOL errors occuring at or most recently before the "corruption"
       errors for the system and for the I/O card with the error.  Please
       see Appendix A for the steps for using LOGTOOL.

   22. For each interface card used for disk IO, list the firmware
       firmware revision used on card, if known (This value can be
       obtained from diagnostic output).  For example:

            HP-PB F/W SCSI Firmware Rev  3636

   23. What type of networking are you using? FDDI? LAN?
       Please describe and include the firmware revision on the
       network interface, if known (for FDDI, also obtain the output
       of "what /usr/lib/fddi_dnld").


VI.  APPENDICES

Appendix A:  How to use LOGTOOL (an example)
_________________________________________________________________

  1. Log onto system as root.

  2. Type "sysdiag".
     # sysdiag

  3. At the DUI prompt type "logtool".
     DUI> logtool

  4. At the LOGTOOL prompt type "status detail".
     LOGTOOL> status detail

     This will list the number of error logs on the system,
     including time stamp of the first log entry and size of the log.

     For Example:

     LOGTOOL> status detail

     Log File    Rec #1        Rec #1    Total
     Name        Date          Time      #Records
     ==========  ============  ========  ========
     LOG0000     02/20/97       9:03 AM        1
     LOG0001     03/24/97      10:10 AM        1
     LOG0002*                                  0

  5. From the time stamp of the first record, determine which log
     you are interested in viewing.  If it is the most recent log
     (the one with the * next to it, then issue the LOGTOOL "switchlog"
     command to force the log file in question to be closed and a new
     one to be opened).

  6. Display the contents of the log and redirect the output to a file
     by using the following command:

     LOGTOOL> list log=xxxx outfile=<filename>

     where xxxx is the integer of the log file name and <filename> is a
     standard file name. You can drop the leading zeroes from the
     integer number (note: use only alpha/numberic characters in the
     filename and limit the filename to eight characters).

     For example, if you want to display the contents of LOG001, type:

     LOGTOOL> list log=1 outfile=log1

     In the above example, the contents of LOG0001 will be redirected
     to a file called log1. It will be automatically placed under the
     /usr/diag/bin directory on an HP-UX 9.x system or /usr/sbin/diag
     on an HP-UX 10.x system.

  7. Now, to view this log, exit out of LOGTOOL and DUI and cd to the
     directory containing log1. Then do a 'more' or 'vi' on the file.
     For example:

     LOGTOOL> exit
     DUI> exit
     # cd /usr/sbin/diag  (on HP-UX 10.x)
     # more log1

     The fields of interest are PRODUCT NAME, PDEV, HARDWARE STATUS, and
     DATE and TIME. This will give you an idea of what device is causing
     the error and possibly what kind of error.  HP support assistance may
     be needed to interpret the fields.

Appendix B: Oracle7 Alert Log Example
_________________________________________________________________

     Oracle Alert Log Excerpt
     ========================
     Starting up ORACLE RDBMS Version: 7.1.6.2.0.
     System parameters with non-default values:
     processes                = 120
     timed_statistics         = TRUE
     shared_pool_size         = 125829120
     control_files            = ?/dbs/cntrlPRD.dbf,
             ?/sapdata1/cntrl/cntrlPRD.dbf, ?/sapdata2/cntrl/cntrlPRD.dbf
     db_block_buffers         = 24576
     log_archive_start        = TRUE
     log_archive_dest         = ?/saparch/PRDarch
     log_buffer               = 245760
     log_checkpoint_interval  = 3000000000
     db_files                 = 254
     checkpoint_process       = TRUE
     row_locking              = always
     rollback_segments        = PRS_1, PRS_2, PRS_3, PRS_4, PRS_5,
                                PRS_6, PRS_7, PRS_8, PRS_9, PRS_10
     row_cache_cursors        = 300
     remote_login_passwordfile= NONE
     mts_service              = PRD
     mts_servers              = 0
     mts_max_servers          = 0
     mts_max_dispatchers      = 0
     audit_trail              = TRUE
     sort_area_size           = 2097152
     sort_area_retained_size  = 262144
     db_name                  = PRD
     open_cursors             = 800
     optimizer_mode           = rule
     background_dump_dest     = ?/saptrace/background
     user_dump_dest           = ?/saptrace/usertrace
     core_dump_dest           = ?/saptrace/background
     PMON started
     DBWR started
     ARCH started
     LGWR started
     CKPT started
     Thu Feb  6 09:50:23 1997
     alter database  mount exclusive
     Thu Feb  6 09:50:23 1997
     Completed: alter database  mount exclusive
     Thu Feb  6 09:50:23 1997
     alter database  open
     Thu Feb  6 09:50:28 1997
     Thread 1 opened at log sequence 1172
     Current log# 4 seq# 1172 mem# 0: /oracle/PRD/origlogB/log_g4_m1.dbf
     Thu Feb  6 09:50:28 1997
     SMON: enabling cache recovery
     SMON: enabling tx recovery
     Thu Feb  6 09:50:29 1997
     Completed: alter database  open
     ...
     Errors in file /oracle/PRD/saptrace/usertrace/ora_24982.trc:
     ORA-00600: internal error code, arguments: [3339], [85082112],
             [3154128492], [], [], [], [], []

     ======================================================================

     This above information tells you that an internal error was
     experienced by the database application and the details are
     in the trace log file: /oracle/PRD/saptrace/usertrace/ora_24982.trc

     So the next step is to look at the corresponding trace log file.

     Oracle Trace Log Excerpt
     ========================
     Oracle7 Server Release 7.1.6.2.0 - Production Release
     PL/SQL Release 2.1.6.2.0 - Production
     ORACLE_HOME = /oracle/PRD
     ORACLE_SID = PRD
     Oracle process number: 42        Unix process id: 11799
     System name:    HP-UX
     Node name:      pddbs
     Release:        B.10.01
     Version:        E
     Machine:        9000/892

     Sun Dec  8 16:37:09 1996
     *** SESSION ID:(39.19)
     ***
     Corrupt block=8c00dc70 file=35. blocknum=56432. found during buffer read
     on disk type=2. ver=1. dba=6403cb1d inc=cc9803 seq=2 incseq=15b0105
     Sun Dec  8 16:37:09 1996
     ksedmp: internal or fatal error
     ORA-00600: internal error code, arguments: [3339], [1677970205],
                [2348866672], [], [], [], [], []
     ...
     ---- Argument/Register Address Dump -----
     Argument/Register addr=34bb4.  Dump of memory from 34B74 to 34CB4
     34B60                                    20617272 61790000 00000000
     34B80 5244424D 53000000 4F524100 6572726F 72206D65 73736167 65206669
                                                  6C65206E
     34BA0 616D6500 4F52415F 4E50495F 4552524F 52000000 6B736564 6D703A20
                                                  696E7465
     34BC0 726E616C 206F7220 66617461 6C206572 726F720A 00000000 6B736564
                                                  6D703A20
     34BE0 6E6F2063 75727265 6E742063 6F6E7465 78742061 7265610A 00000000
                                                  76657273
     34C00 696F6E20 6E756D62 65720000 6B736574 73746400 4B534554 53544400
                                                  25252D25
     34C20 64732025 252D2564 73202525 2D256473 20000000 25732525 2D256473
                                                  0A000000
     34C40 0A0A2D2D 2D2D2D20 43616C6C 20537461 636B2054 72616365 202D2D2D
                                                  2D2D0A00
     34C60 63616C6C 696E6700 63616C6C 00000000 656E7472 79000000 61726775
                                                  6D656E74
     34C80 2076616C 75657320 696E2068 65780000 6C6F6361 74696F6E 00000000
                                                  74797065
     34CA0 00000000 706F696E 74000000 283F206D 65616E73
     ...
     7B03A0A0 00000000 00000000 00000000 00000000 00000000 00000000 00000000
                                                  00000000
     Repeat 1 times
     7B03A0E0 00000000 00000000 00000000 00000000 00000000 00000000 00000000
                                                  000E416F
     ----- End of Call Stack Trace -----

     ======================================================================

     The above information tells you which block (block=8c00dc70) in which
     file (file=35), and in which blocknum (blocknum=56432) the corruption
     resides.  You will need find out which file this number corresponds to.
     If the system is running sqlplus, the following procedure can be used:

     sqlplus sys/change_on_install <<!
     column file_name format A60
     column blocks format 9999999
     column file_id format 999
     select file_id,file_name,blocks  from dba_data_files
     exit;
     !

     When you figure out which file this corresponds with (let's say
     /oracle/DB/datafile1), you can get a copy of the block from disk
     by doing a dd of the disk block (note you need to multiply the
     file block number by 8 since we're doing 1K blocks and Oracle
     tends to use 8K blocks [this may be changed by the database
     administrator, so modify the block number multiplier accordingly):

      BLOCKNUM=56432*8
      dd if=/oracle/DB/datafile1 of=block.56432 bs=1k \
           skip=$BLOCKNUM count=8

     This gives you the datablock from disk.  The other block is
     in the Oracle trace file itself (reference the Argument/Register
     area above).

Appendix C: System Console Messages
_________________________________________________________________

     Example #1:
     SCSI: unrecovered deferred error (dev = 0x1f011000)

     This tells you that the SCSI device in question is device
     0x1f011000.  This type of error typically means that we have
     a faulty disk mechanism, faulty disk controller, or faulty cable.

     Example #2:
     SCSI: Timeout -- bus: 5 -- request timeout -- dev: 1f050000
     SCSI: Timeout -- bus: 5 -- bus hang -- dev: 1f050000, offset: 448

     This tells you that the SCSI device in question is
     device 0x1f050000 and this type of error shows a timeout and
     bus problem.  This can be caused by numerous things: faulty disk
     mechanism, faulty disk controller, faulty termination, faulty
     cable and more.


Appendix D: File System Panic Message Example
_________________________________________________________________

     Example file system panic string:
     dev = 0x4000000a, ino = 21295, fs = /usr
     @(#)9245XA HP-UX (A.09.04) #0: Mon Nov  8 16:46:17 PST 1993
     panic: (display==0xbf00, flags==0x0) ifree: freeing free inode

     This tells you that the file system in question is /usr and
     the inode number for the file is 21295.  A search of that file
     system can be done via the "find" command with the "-inum n"
     option where n is the inode number in question.  The file (if
     still available on the disk) can then be examined to investigate
     the corruption.  The next step is to validate the hardware by
     checking diagnostics and reading the LOGTOOL logs (see Appendix A).
     The next step is to see what file system and related drivers patches
     (ie. all the drivers in the I/O path to that specific disk) are
     missing from the system.

(c) Copyright 1996-1998 Hewlett-Packard Company.