UnixPedia : HPUX / LINUX / SOLARIS: HPUX : HP Serviceguard NFSToolkit – “Stale NFS file Handle" when Package Failed Over

Wednesday, April 9, 2014

HPUX : HP Serviceguard NFSToolkit – “Stale NFS file Handle" when Package Failed Over



HP Serviceguard NFS Toolkit - "Stale NFS file handle" when Package Failed Over
Overview
HP Serviceguard NFSToolkit – “Stale NFS file Handle" when Package Failed Over
Procedures
Issue
After package failover, accessing the automount NFS mount point giving Stale NFS file handle.  

We had upgraded the ONCplus from version B.11.31.10 to B.11.31.11 hence overuled the possibility that this issue is related to QXCR1001067886 titled "After an NFS package re-start, NFS client may not be able to access the NFS mount due to stale file handle error (ESTALE)". Also make sure all network patches are the latest.

scenario:

1) Login to NODE1, make sure the package is up:
[formatted]
  # date
  Fri May 27 16:56:50 wib 2011

  # cmviewcl
  CLUSTER  STATUS      
  cluster_ecc    up  

    NODE      STATUS   STATE
    NODE1     up       running     

      PACKAGE         STATUS       STATE        AUTO_RUN    NODE
      dbciRP1         up           running            enabled       NODE1

    NODE      STATUS   STATE
    NODE2     up       running     
[unformatted]

2) Make sure the automount NFS mountpoints are accessible:

[formatted]
  # date
  Fri May 27 16:56:58 wib 2011

  # bdf
  Filesystem           kbytes used   avail %used Mounted on
  ..snap..
  NODE1:/export/usr/sap/trans
                42958848     77173 40201578      0% /usr/sap/trans
  NODE1:/export/sapmnt/RP1
                14647296 1327971 12487061 10% /sapmnt/RP1
[unformatted]

3) login to the other host
[formatted]
  # date
  Fri May 27 16:57:13 wib 2011

  # rlogin NODE2
  # uname -a
  HP-UX NODE2 B.11.31 U ia64 1843445582 unlimited-user license
[unformatted]

3) Make sure the automount NFS mountpoints are accessible in this host:
[formatted]
  # bdf
  Filesystem           kbytes used   avail %used Mounted on
  ..snap..
  NODE1:/export/usr/sap/trans
                42958848     77173 40201578      0% /usr/sap/trans
  NODE1:/export/sapmnt/RP1
                14647296 1327971 12487061 10% /sapmnt/RP1
[unformatted]

4) Halt the package from NODE2:

[formatted]
  # date
  Fri May 27 16:57:36 wib 2011
  # uname -a
  HP-UX NODE2 B.11.31 U ia64 1843445582 unlimited-user license
  # cmhaltpkg dbciRP1
  Disabling automatic failover for failover packages to be halted.
  Halting package dbciRP1
  Successfully halted package dbciRP1
  One or more packages or package instances have been halted.
  The failover packages have AUTO_RUN disabled and no new instance can
  start automatically. To allow automatic start, enable AUTO_RUN via
  cmmodpkg -e <package_name>
  cmhaltpkg: Completed successfully on all packages specified

  * syslog NODE2
  May 27 16:57:41 syslog: cmhaltpkg dbciRP1
  May 27 16:57:42 syslog: Request from root on node NODE2
  to halt package dbciRP1
  May 27 16:57:42 cmcld[9295]: Request from root on node
  JKTECCRP2 to halt package dbciRP1
  May 27 16:57:42 cmcld[9295]: Request from node NODE1 to
  disable global switching for package dbciRP1.
  May 27 16:57:45 cmcld[9295]: (NODE1) Halted package
  dbciRP1 on node NODE1.

  * syslog NODE1
  May 27 16:57:43 cmcld[8177]: Request from node NODE1 to
  disable global switching for package dbciRP1.
  May 27 16:57:43 cmcld[8177]: Disabled switching for package
  dbciRP1.
  May 27 16:57:43 cmcld[8177]: Request from node NODE1 to
  begin the halting process for package dbciRP1 on node NODE1.
  May 27 16:57:43 cmcld[8177]: Halting package dbciRP1 on
  node JKTECCRP1 as requested by user.
  May 27 16:57:43 cmcld[8177]: Request from node NODE1 to
  halt package dbciRP1 on node NODE1.
  May 27 16:57:43 cmcld[8177]: Executing
  '/etc/cmcluster/RP1/dbciRP1.control.script stop' for package dbciRP1,
  as service PKG*78849.
  May 27 16:57:43 cmserviced[8181]: Request to stop package
  dbciRP1
  May 27 16:57:43 syslog: cmmodnet -r -i 10.170.12.3
  10.170.12.0
  May 27 16:57:47 LVM[8907]: vgchange -a n vg11
  May 27 16:57:47 LVM[8911]: vgchange -a n vg12
  May 27 16:57:47 LVM[8915]: vgchange -a n vg19
  May 27 16:57:47 LVM[8919]: vgchange -a n vg02
  May 27 16:57:47 LVM[8923]: vgchange -a n vg07
  May 27 16:57:47 LVM[8927]: vgchange -a n vg13
  May 27 16:57:47 LVM[8931]: vgchange -a n vg14
  May 27 16:57:47 LVM[8935]: vgchange -a n vg08
  May 27 16:57:47 LVM[8939]: vgchange -a n vg17
  May 27 16:57:47 LVM[8943]: vgchange -a n vg15
  May 27 16:57:47 LVM[8947]: vgchange -a n vg09
  May 27 16:57:47 LVM[8951]: vgchange -a n vg16
  May 27 16:57:47 LVM[8955]: vgchange -a n vg18
  May 27 16:57:47 LVM[8959]: vgchange -a n vg10
  May 27 16:57:47 LVM[8963]: vgchange -a n vg03
  May 27 16:57:47 cmserviced[8181]: Package Script for
  dbciRP1 completed successfully with an exit(0).
  May 27 16:57:47 cmcld[8177]: Halted package dbciRP1 on node
  JKTECCRP1.

  * dbciRP1.control.script.log of NODE2
       ########### Node "NODE2": Halting package at Fri May 27
  16:56:10 wib 2011 ###########
  May 27 16:56:10 - Node "NODE2": Remove IP address 10.170.12.3
  from subnet 10.170.12.0
  HANFS -- May 27 16:56:10 - Node "NODE2": Unexporting filesystem
  on /export/sapmnt/RP1
  HANFS -- May 27 16:56:10 - Node "NODE2": Unexporting filesystem
  on /export/usr/sap/trans
  HANFS -- May 27 16:56:10 - Node "NODE2": Killing rpc.statd
  HANFS -- May 27 16:56:10 - Node "NODE2": Killing rpc.lockd
  HANFS -- May 27 16:56:10 - Node "NODE2": Restarting rpc.statd
  HANFS -- May 27 16:56:11 - Node "NODE2": Restarting rpc.lockd
  May 27 16:56:12 - Node "NODE2": Unmounting filesystem on
  /dev/vg10/lvol1
  ...
  May 27 16:56:14 - Node "NODE2": Deactivating volume group vg03
  Deactivated volume group in Exclusive Mode.
  Volume group "vg03" has been successfully changed.

       ########### Node "NODE2": Package halt completed at Fri
  May 27 16:56:14 wib 2011 ###########
  /etc/cmcluster/RP1/dbciRP1.control.script[369]: - o
  largefiles,delaylog, nodatainlog:  not found.
  ...
  /etc/cmcluster/RP1/dbciRP1.control.script[383]: - o
  largefiles,delaylog, nodatainlog:  not found.
[unformatted]

5) Now run the package on NODE2:

[formatted]
  # date
  Fri May 27 16:57:47 wib 2011

  # uname -a
  HP-UX NODE2 B.11.31 U ia64 1843445582 unlimited-user license

  # cmviewcl

  CLUSTER  STATUS      
  cluster_ecc    up  

    NODE      STATUS   STATE
    NODE1     up       running     
    NODE2     up       running     

  UNOWNED_PACKAGES

      PACKAGE         STATUS       STATE        AUTO_RUN    NODE
      dbciRP1         down         halted        disabled  unowned    

  # cmrunpkg dbciRP1
  Running package dbciRP1 on node NODE2
  Successfully started package dbciRP1 on node NODE2
  cmrunpkg: All specified packages are running


  * dbciRP1.control.script.log of NODE2
       ########### Node "NODE2": Starting package at Fri May 27
  16:58:04 wib 2011 ###########
  May 27 16:58:04 - Node "NODE2": Activating volume group vg11 with
  exclusive option.
  Activated volume group in Exclusive Mode.
  Volume group "vg11" has been successfully changed.
  ...
  May 27 16:58:04 - Node "NODE2": Checking filesystems:
     /dev/vg19/lvol1
     /dev/vg02/lvol1
     /dev/vg07/lvol1
     /dev/vg11/lvol1
     /dev/vg12/lvol1
     /dev/vg13/lvol1
     /dev/vg14/lvol1
     /dev/vg08/lvol1
     /dev/vg17/lvol1
     /dev/vg15/lvol1
     /dev/vg09/lvol1
     /dev/vg16/lvol1
     /dev/vg18/lvol1
     /dev/vg10/lvol1
     /dev/vg03/lvol1
  /dev/vg19/rlvol1:file system is clean - log replay is not required
  ...
  May 27 16:58:05 - Node "NODE2": Mounting /dev/vg19/lvol1 at
  /export/usr/sap/trans
  ...


       ########### Node "NODE2": Package start completed at Fri
  May 27 16:58:05 wib 2011 ###########
  /etc/cmcluster/RP1/dbciRP1.control.script[369]: - o
  largefiles,delaylog, nodatainlog:  not found.
  ...
  /etc/cmcluster/RP1/dbciRP1.control.script[383]: - o
  largefiles,delaylog, nodatainlog:  not found.
[unformatted]

6) Check bdf - we get Stale NFS error
[formatted]
  # date
  Fri May 27 16:58:06 wib 2011

  # uname -a
  HP-UX NODE2 B.11.31 U ia64 1843445582 unlimited-user license

  # bdf
  Filesystem           kbytes used   avail %used Mounted on
  ..snap..
  bdf: /usr/sap/trans: Stale NFS file handle
  bdf: /sapmnt/RP1: Stale NFS file handle
[unformatted]

7) Go back to the first node, and check bdf - the Stale NFS message also seen there:

[formatted]
  # uname -a
  HP-UX NODE1 B.11.31 U ia64 2515239678 unlimited-user license

  # bdf
  Filesystem           kbytes used   avail %used Mounted on
  ..snap..
  bdf: /usr/sap/trans: Stale NFS file handle
  bdf: /sapmnt/RP1: Stale NFS file handle

  # date
  Fri May 27 16:58:23 wib 2011
[unformatted]
Solution
Running ll /dev/vg*/* on both nodes reveal that the minor and major number of the shared-volumes are not the same.

The names of the volume groups must be unique within the cluster, and the major and minor numbers associated with the volume groups must be the same on all nodes. In addition, the mounting points and exported file system names must be the same on all nodes.

The preceding requirements exist because NFS uses the major number, minor number, inode number, and exported directory as part of a file handle to uniquely identify each NFS file. If differences exist between the primary and adoptive nodes, the client's file handle would no longer point to the correct file location after movement of the package to a different node.

It is recommended that filesystems used for NFS be created as journaled file systems (FStype vxfs). This ensures the fastest recovery time in the event of a package switch to another node.

Fixed the issue by vgexport/vgimport.

Vgimport, vgexport, cmviewcl, NFS stale handle

1 comment: