UnixPedia : HPUX / LINUX / SOLARIS: February 2014

Saturday, February 22, 2014

UX:vxfs fsck: WARNING: V-3-20837: FILE SYSTEM HAD I/O ERROR(S) ON USER DATA.

On system tiger (roar) , a file system has been reported with IO error.

Result of that DB crashed as corrupted fs is holding binary of database.

UNIX team run the fsck and full check on the fs result are negative. Seems like

no way it can be fixed.

Question: what/how/why caused this situation (corruption) with FS on server.

Answer: On first look it seems that someone screwed the FS or a SAN disk has been

Removed which carry the user data ( beta aaj tho lag gayee).

In hope of fixing the issue I tried running below command (fsck). Command have one character (s )(u) difference from situation of yours, and hope it should fix it. if everything work as expected then world would be a different world.

Production DB was down for more than 3 hrs, my comments of FS (corruption) hits the panic

button, Team gone into a critical issue.

[root@tiger:/.root]#

#-> fsck -F vxfs -o full /dev/bc_vg_odd/lv_m01oracle

UX:vxfs fsck: WARNING: V-3-20837: file system had I/O error(s) on user data.

log replay in progress

UX:vxfs fsck: ERROR: V-3-25433: fsck write failure devid = 0, bno = 135153328, off = 0, len = 8192

full file system check required, exiting ...

[root@tiger:/.root]#

#-> fsck -F vxfs -y -o full /dev/bc_vg_odd/lv_m01oracle

UX:vxfs fsck: WARNING: V-3-20837: file system had I/O error(s) on user data.

log replay in progress

UX:vxfs fsck: ERROR: V-3-25433: fsck write failure devid = 0, bno = 135153328, off = 0, len = 8192

full file system check required, exiting ...

VG is showing one disk is less in number on active / current disk count. i given the mixing disk a name

"ghost disk" (ghost also exist on server, yes why not zombie can why not Ghost.). Ghost disks LE are present in Corrupted FS. Seems like today data growth landed on those sectors of LE.

Recommendation: no way data going to recovery on the lv , decision made to recreate the LV and restore the data from BCV FS ( when all option closed , looks for light). Server is running BCV script mid-night backup have a kept the data in good shape.

What is hard part of solution, explaining the story of corruption? Yes.

Thursday, February 13, 2014

BUSINESS COPY ISSUE :I/O ERROR

Business copy Issue : Client reported for FS missing from the system , Bakcup is not happening for 2-days
On first look it clearly look that Resync has happen without VG got export. and left the disk in unavaialble
state.
As resync overwrite the metadata structure of primarry copy to secondary copy so it understandable that kernal
have marked the PV status unavailable on SVOL server.
And FS are noted with I/O error.
Below is screen output :
[FP9](Mercury)/u01/app/oracle/admin/FP9/diag/rdbms/FP9mop/FP9/trace-> bdf
ftp.htp.com:/var/opt/ignite/clients
56426496 16764759 37186635 31% /var/opt/ignite/recovery/client_mnt
bdf: /oracle/admin/FP9: I/O error
bdf: /oracle/oradata02/FP9: I/O error
/dev/vg_FP6_BKP/lv_oracleFP6_sapdata1
523239424 97319432 422593120 19% /oracle/FP6/sapdata1
/dev/vg_FP6_BKP/lv_oracleFP6_sapdata2

[FP9](Mercury)/u01/app/oracle/admin/FP9/diag/rdbms/FP9mop/FP9/trace-> df -P | grep FP9
df: cannot determine file system statistics for /dev/vg_FP9_BKP/lv_oracleoradata02_FP9
df: cannot determine file system statistics for /dev/vg_FP9_BKP/lv_oracleadmin_FP9
Solution :
To resolve the issue,
#umount the FS which are showing I/O errror , else in some case reboot may be required.
#export the vg manual
#resync the PVOL and SVOL
#split and again vgimport for creation of FS.

Wednesday, February 12, 2014

VCS : dg-fail-policy: dgdisable in Storage Foundation Cluster Volume Manager (

When using Storage Foundation Cluster Volume Manager (SFRAC/SFCFS) with shared disk groups of version 120 or higher, disk groups contain an attribute called dgfailpolicy. This attribute determines how the node should react if it loses access to disk in the corresponding disk group. If shared disk groups are set to the default dgfailpolicy of dgdisable a cluster wide panic could ensure and/or the database can halt clusterwide., should the Cluster Volume Manager (CVM) master lose connectivity to storage. To avoid this behavior dgfailpolicy should be set to leave for shared diskgroups.

From your node

#-> uname -a

HP-UX mickey B.11.23 U ia64 2937989941 unlimited-user license

[root@mickey:/.root]#

#->

[root@mickey:/.root]#

#-> vxdg list

NAME STATE ID

localdg02 enabled,cds 1152277302.85.mickey

csrcppdg01 enabled,shared,cds 1152539575.189.donald

csrlindg01 enabled,shared,cds 1232465597.394.donald

racdg01 enabled,shared,cds 1152739766.65.mickey

sptrpdg01 enabled,shared,cds 1152290440.161.donald

tempdg enabled,shared,cds 1193249181.293.donald

totalpdg01 enabled,shared,cds 1177157791.111.mickey

[root@mickey:/.root]#

#-> vxdg list csrcppdg01

Group: csrcppdg01

dgid: 1152539575.189.donald

import-id: 33792.478

flags: shared cds

version: 120

alignment: 8192 (bytes)

local-activation: shared-write

cluster-actv-modes: donald=sw mickey=sw

ssb: on

detach-policy: global

dg-fail-policy: dgdisable ß---Currently set to default i.e. dgdisable

copies: nconfig=default nlog=default

config: seqno=0.1774 permlen=0 free=0 templen=0 loglen=0

[root@mickey:/.root]#

#-> vxdg list csrlindg01

Group: csrlindg01

dgid: 1232465597.394.donald

import-id: 33792.480

flags: shared cds

version: 120

alignment: 8192 (bytes)

local-activation: shared-write

cluster-actv-modes: donald=sw mickey=sw

ssb: on

detach-policy: global

dg-fail-policy: dgdisable ß---Currently set to default i.e. dgdisable

copies: nconfig=default nlog=default

config: seqno=0.5260 permlen=0 free=0 templen=0 loglen=0

[root@mickey:/.root]#

#-> vxdg list racdg01

Group: racdg01

dgid: 1152739766.65.mickey

import-id: 33792.484

flags: shared cds

version: 120

alignment: 8192 (bytes)

local-activation: shared-write

cluster-actv-modes: donald=sw mickey=sw

ssb: on

detach-policy: global

dg-fail-policy: dgdisable ß---Currently set to default i.e. dgdisable

copies: nconfig=default nlog=default

config: seqno=0.1201 permlen=0 free=0 templen=0 loglen=0

[root@mickey:/.root]#

#-> vxdg list sptrpdg01

Group: sptrpdg01

dgid: 1152290440.161.donald

import-id: 33792.474

flags: shared cds

version: 120

alignment: 8192 (bytes)

local-activation: shared-write

cluster-actv-modes: donald=sw mickey=sw

ssb: on

detach-policy: global

dg-fail-policy: dgdisable ß---Currently set to default i.e. dgdisable

copies: nconfig=default nlog=default

config: seqno=0.22839 permlen=0 free=0 templen=0 loglen=0

[root@mickey:/.root]#

#-> vxdg list tempdg

Group: tempdg

dgid: 1193249181.293.donald

import-id: 33792.482

flags: shared cds

version: 120

alignment: 8192 (bytes)

local-activation: off

cluster-actv-modes: donald=sw mickey=off

ssb: on

detach-policy: global

dg-fail-policy: dgdisable ß---Currently set to default i.e. dgdisable

copies: nconfig=default nlog=default

config: seqno=0.1153 permlen=0 free=0 templen=0 loglen=0

[root@mickey:/.root]#

#-> vxdg list totalpdg01

Group: totalpdg01

dgid: 1177157791.111.mickey

import-id: 33792.476

flags: shared cds

version: 120

alignment: 8192 (bytes)

local-activation: shared-write

cluster-actv-modes: donald=sw mickey=sw

ssb: on

detach-policy: global

dg-fail-policy: dgdisable ß---Currently set to default i.e. dgdisable

copies: nconfig=default nlog=default

config: seqno=0.4742 permlen=0 free=0 templen=0 loglen=0

[root@mickey:/.root]#

Cause:

In a CVM RAC environment where a shared disk group is using a dgfailpolicy of dgdisable, should the master lose connectivity to all disks in the disk group, the master will disable the disk group (dgdisable). As this is a CVM environment the disk group is also disabled across all slave nodes (as all nodes must have a consistent view of the configuration as seen by the master).

Once a disk group is dgdisabled any new opens against volumes in that disk group will fail. Some examples of when opens are attempted are:

- When a volume containing a file system is mounted

- When an I/O is attempted against a raw volume device

This scenario can have potentially severe implications. For example if using Oracle RAC with vote devices on raw volumes, as soon as the corresponding disk group is dgdisabled cluster wide, all nodes will be unable to perform I/O to vote disks meaning that they can no longer heartbeat. As a result of this all nodes will be panic'd by Oracle Cluster Ready Services (CRS) causing a cluster wide loss of service.

Solution:

To avoid this issue all shared disk groups of version 120 and higher should be set to use a dgfailpolicy of leave. Once set, should the master lose connectivity to disks in the disk group, it will panic and leave the cluster rather than disabling the disk group cluster wide. This then allows one of the surviving slave nodes to take over the master role and assuming that the new master has not issues with connectivity to storage allows the surviving members of the cluster to continue to function as normal.

vxdg -g <diskgroup> set dgfailpolicy=leave

This policy is consistent through reboots.

=================

In SFCFS 6.0 later, the dg fail policy is obsolete. From SFCFS 6.0 release notes:

Availability of shared disk group configuration copies

If the Cluster Volume Manager (CVM) master node loses access to a configuration

copy, CVM redirects the read or write requests over the network to another node

that has connectivity to the configuration copy. This behavior ensures that the

disk group stays available.

In previous releases, CVM handled disconnectivity according to the disk group

failure policy (dgfail_policy). This behavior still applies if the disk group version

is less than 170. The dgfail_policy is not applicable to disk groups with a version

of 170 or later.

Humor

Saturday, February 22, 2014

UX:vxfs fsck: WARNING: V-3-20837: FILE SYSTEM HAD I/O ERROR(S) ON USER DATA.

Thursday, February 13, 2014

BUSINESS COPY ISSUE :I/O ERROR

Wednesday, February 12, 2014

VCS : dg-fail-policy: dgdisable in Storage Foundation Cluster Volume Manager (