UnixPedia : HPUX / LINUX / SOLARIS: HPUX : HP Serviceguard NFSToolkit – “Stale NFS file Handle" when Package Failed Over

HP Serviceguard NFS Toolkit - "Stale NFS file handle" when Package Failed Over

Overview

HP Serviceguard NFSToolkit – “Stale NFS file Handle" when Package Failed Over

Procedures

Issue

After package failover, accessing the automount NFS mount point giving Stale NFS file handle.

We had upgraded the ONCplus from version B.11.31.10 to B.11.31.11 hence overuled the possibility that this issue is related to QXCR1001067886 titled "After an NFS package re-start, NFS client may not be able to access the NFS mount due to stale file handle error (ESTALE)". Also make sure all network patches are the latest.

scenario:

1) Login to NODE1, make sure the package is up:

[formatted]

# date

Fri May 27 16:56:50 wib 2011

# cmviewcl

CLUSTER STATUS

cluster_ecc up

NODE STATUS STATE

NODE1 up running

PACKAGE STATUS STATE AUTO_RUN NODE

dbciRP1 up running enabled NODE1

NODE STATUS STATE

NODE2 up running

[unformatted]

2) Make sure the automount NFS mountpoints are accessible:

[formatted]

# date

Fri May 27 16:56:58 wib 2011

# bdf

Filesystem kbytes used avail %used Mounted on

..snap..

NODE1:/export/usr/sap/trans

42958848 77173 40201578 0% /usr/sap/trans

NODE1:/export/sapmnt/RP1

14647296 1327971 12487061 10% /sapmnt/RP1

[unformatted]

3) login to the other host

[formatted]

# date

Fri May 27 16:57:13 wib 2011

# rlogin NODE2

# uname -a

HP-UX NODE2 B.11.31 U ia64 1843445582 unlimited-user license

[unformatted]

3) Make sure the automount NFS mountpoints are accessible in this host:

[formatted]

# bdf

Filesystem kbytes used avail %used Mounted on

..snap..

NODE1:/export/usr/sap/trans

42958848 77173 40201578 0% /usr/sap/trans

NODE1:/export/sapmnt/RP1

14647296 1327971 12487061 10% /sapmnt/RP1

[unformatted]

4) Halt the package from NODE2:

[formatted]

# date

Fri May 27 16:57:36 wib 2011

# uname -a

HP-UX NODE2 B.11.31 U ia64 1843445582 unlimited-user license

# cmhaltpkg dbciRP1

Disabling automatic failover for failover packages to be halted.

Halting package dbciRP1

Successfully halted package dbciRP1

One or more packages or package instances have been halted.

The failover packages have AUTO_RUN disabled and no new instance can

start automatically. To allow automatic start, enable AUTO_RUN via

cmmodpkg -e <package_name>

cmhaltpkg: Completed successfully on all packages specified

* syslog NODE2

May 27 16:57:41 syslog: cmhaltpkg dbciRP1

May 27 16:57:42 syslog: Request from root on node NODE2

to halt package dbciRP1

May 27 16:57:42 cmcld[9295]: Request from root on node

JKTECCRP2 to halt package dbciRP1

May 27 16:57:42 cmcld[9295]: Request from node NODE1 to

disable global switching for package dbciRP1.

May 27 16:57:45 cmcld[9295]: (NODE1) Halted package

dbciRP1 on node NODE1.

* syslog NODE1

May 27 16:57:43 cmcld[8177]: Request from node NODE1 to

disable global switching for package dbciRP1.

May 27 16:57:43 cmcld[8177]: Disabled switching for package

dbciRP1.

May 27 16:57:43 cmcld[8177]: Request from node NODE1 to

begin the halting process for package dbciRP1 on node NODE1.

May 27 16:57:43 cmcld[8177]: Halting package dbciRP1 on

node JKTECCRP1 as requested by user.

May 27 16:57:43 cmcld[8177]: Request from node NODE1 to

halt package dbciRP1 on node NODE1.

May 27 16:57:43 cmcld[8177]: Executing

'/etc/cmcluster/RP1/dbciRP1.control.script stop' for package dbciRP1,

as service PKG*78849.

May 27 16:57:43 cmserviced[8181]: Request to stop package

dbciRP1

May 27 16:57:43 syslog: cmmodnet -r -i 10.170.12.3

10.170.12.0

May 27 16:57:47 LVM[8907]: vgchange -a n vg11

May 27 16:57:47 LVM[8911]: vgchange -a n vg12

May 27 16:57:47 LVM[8915]: vgchange -a n vg19

May 27 16:57:47 LVM[8919]: vgchange -a n vg02

May 27 16:57:47 LVM[8923]: vgchange -a n vg07

May 27 16:57:47 LVM[8927]: vgchange -a n vg13

May 27 16:57:47 LVM[8931]: vgchange -a n vg14

May 27 16:57:47 LVM[8935]: vgchange -a n vg08

May 27 16:57:47 LVM[8939]: vgchange -a n vg17

May 27 16:57:47 LVM[8943]: vgchange -a n vg15

May 27 16:57:47 LVM[8947]: vgchange -a n vg09

May 27 16:57:47 LVM[8951]: vgchange -a n vg16

May 27 16:57:47 LVM[8955]: vgchange -a n vg18

May 27 16:57:47 LVM[8959]: vgchange -a n vg10

May 27 16:57:47 LVM[8963]: vgchange -a n vg03

May 27 16:57:47 cmserviced[8181]: Package Script for

dbciRP1 completed successfully with an exit(0).

May 27 16:57:47 cmcld[8177]: Halted package dbciRP1 on node

JKTECCRP1.

* dbciRP1.control.script.log of NODE2

########### Node "NODE2": Halting package at Fri May 27

16:56:10 wib 2011 ###########

May 27 16:56:10 - Node "NODE2": Remove IP address 10.170.12.3

from subnet 10.170.12.0

HANFS -- May 27 16:56:10 - Node "NODE2": Unexporting filesystem

on /export/sapmnt/RP1

HANFS -- May 27 16:56:10 - Node "NODE2": Unexporting filesystem

on /export/usr/sap/trans

HANFS -- May 27 16:56:10 - Node "NODE2": Killing rpc.statd

HANFS -- May 27 16:56:10 - Node "NODE2": Killing rpc.lockd

HANFS -- May 27 16:56:10 - Node "NODE2": Restarting rpc.statd

HANFS -- May 27 16:56:11 - Node "NODE2": Restarting rpc.lockd

May 27 16:56:12 - Node "NODE2": Unmounting filesystem on

/dev/vg10/lvol1

...

May 27 16:56:14 - Node "NODE2": Deactivating volume group vg03

Deactivated volume group in Exclusive Mode.

Volume group "vg03" has been successfully changed.

########### Node "NODE2": Package halt completed at Fri

May 27 16:56:14 wib 2011 ###########

/etc/cmcluster/RP1/dbciRP1.control.script[369]: - o

largefiles,delaylog, nodatainlog: not found.

...

/etc/cmcluster/RP1/dbciRP1.control.script[383]: - o

largefiles,delaylog, nodatainlog: not found.

[unformatted]

5) Now run the package on NODE2:

[formatted]

# date

Fri May 27 16:57:47 wib 2011

# uname -a

HP-UX NODE2 B.11.31 U ia64 1843445582 unlimited-user license

# cmviewcl

CLUSTER STATUS

cluster_ecc up

NODE STATUS STATE

NODE1 up running

NODE2 up running

UNOWNED_PACKAGES

PACKAGE STATUS STATE AUTO_RUN NODE

dbciRP1 down halted disabled unowned

# cmrunpkg dbciRP1

Running package dbciRP1 on node NODE2

Successfully started package dbciRP1 on node NODE2

cmrunpkg: All specified packages are running

* dbciRP1.control.script.log of NODE2

########### Node "NODE2": Starting package at Fri May 27

16:58:04 wib 2011 ###########

May 27 16:58:04 - Node "NODE2": Activating volume group vg11 with

exclusive option.

Activated volume group in Exclusive Mode.

Volume group "vg11" has been successfully changed.

...

May 27 16:58:04 - Node "NODE2": Checking filesystems:

/dev/vg19/lvol1

/dev/vg02/lvol1

/dev/vg07/lvol1

/dev/vg11/lvol1

/dev/vg12/lvol1

/dev/vg13/lvol1

/dev/vg14/lvol1

/dev/vg08/lvol1

/dev/vg17/lvol1

/dev/vg15/lvol1

/dev/vg09/lvol1

/dev/vg16/lvol1

/dev/vg18/lvol1

/dev/vg10/lvol1

/dev/vg03/lvol1

/dev/vg19/rlvol1:file system is clean - log replay is not required

...

May 27 16:58:05 - Node "NODE2": Mounting /dev/vg19/lvol1 at

/export/usr/sap/trans

...

########### Node "NODE2": Package start completed at Fri

May 27 16:58:05 wib 2011 ###########

/etc/cmcluster/RP1/dbciRP1.control.script[369]: - o

largefiles,delaylog, nodatainlog: not found.

...

/etc/cmcluster/RP1/dbciRP1.control.script[383]: - o

largefiles,delaylog, nodatainlog: not found.

[unformatted]

6) Check bdf - we get Stale NFS error

[formatted]

# date

Fri May 27 16:58:06 wib 2011

# uname -a

HP-UX NODE2 B.11.31 U ia64 1843445582 unlimited-user license

# bdf

Filesystem kbytes used avail %used Mounted on

..snap..

bdf: /usr/sap/trans: Stale NFS file handle

bdf: /sapmnt/RP1: Stale NFS file handle

[unformatted]

7) Go back to the first node, and check bdf - the Stale NFS message also seen there:

[formatted]

# uname -a

HP-UX NODE1 B.11.31 U ia64 2515239678 unlimited-user license

# bdf

Filesystem kbytes used avail %used Mounted on

..snap..

bdf: /usr/sap/trans: Stale NFS file handle

bdf: /sapmnt/RP1: Stale NFS file handle

# date

Fri May 27 16:58:23 wib 2011

[unformatted]

Solution

Running ll /dev/vg*/* on both nodes reveal that the minor and major number of the shared-volumes are not the same.

The names of the volume groups must be unique within the cluster, and the major and minor numbers associated with the volume groups must be the same on all nodes. In addition, the mounting points and exported file system names must be the same on all nodes.

The preceding requirements exist because NFS uses the major number, minor number, inode number, and exported directory as part of a file handle to uniquely identify each NFS file. If differences exist between the primary and adoptive nodes, the client's file handle would no longer point to the correct file location after movement of the package to a different node.

It is recommended that filesystems used for NFS be created as journaled file systems (FStype vxfs). This ensures the fastest recovery time in the event of a package switch to another node.

Fixed the issue by vgexport/vgimport.

Vgimport, vgexport, cmviewcl, NFS stale handle

UnixPedia : HPUX / LINUX / SOLARIS

Humor

Wednesday, April 9, 2014

HPUX : HP Serviceguard NFSToolkit – “Stale NFS file Handle" when Package Failed Over

1 comment: