UnixPedia : HPUX / LINUX / SOLARIS: HPUX :LAN Interfaces: ServiceGuard IP Address not Failing Back to Primary lan

HP Serviceguard for HP-UX - LAN Interfaces: ServiceGuard IP Address not Failing Back to Primary lan when lan Card is up and cmmodnet -e Shows lan Already Enabled.

Overview

HP Serviceguard for HP-UX - LAN Interfaces: ServiceGuard IP Address not Failing Back to
Primary lan when lan Card is up and cmmodnet -e Shows lan Already Enabled.

Procedures

# cmviewcl -v

CLUSTER STATUS

Cluster1 up

NODE STATUS STATE

node1 up running

Cluster_Lock_LVM:

The cmviewcl command shows the status of lan3 in down state:

# cmviewcl -v

CLUSTER        STATUS

Cluster1      up

  NODE           STATUS       STATE

  node1        up           running

    Cluster_Lock_LVM:

    VOLUME_GROUP          PHYSICAL_VOLUME       STATUS

    /dev/vglockdisk       /dev/disk/disk50      up

    Network_Parameters:

    INTERFACE    STATUS           PATH                NAME

    PRIMARY      down             7/0/9/1/0/6/0       lan3  &lt;--

    PRIMARY      up               4/0/6/1/0/6/0       lan0

    PRIMARY      up               7/0/0/1/0/6/0       lan2

    STANDBY      up               6/0/14/1/0/6/0      lan1

The nwmgr command shows the status as up both in Administration and Operational Status.

# /usr/sbin/nwmgr

Name/          Interface Station          Sub-   Interface      Related

ClassInstance  State     Address        system   Type           Interface

============== ========= ============== ======== ============== =========

lan0           UP        0x0016353E5444 igelan   1000Base-T

lan1           UP        0x0016353E353C igelan   1000Base-T

lan2           UP        0x0016353E249D igelan   1000Base-T

lan3           UP        0x0016353E04CC igelan   1000Base-T   &lt;-- is up

                      LAN INTERFACE STATUS DISPLAY

                       Wed, Jan 23,2013  13:04:59

PPA Number                      = 3  &lt;--

Description                     = lan3 HP PCI-X 1000Base-T Release B.11.31.1112

Type (value)                    = ethernet-csmacd(6)

MTU Size                        = 1500

Speed                           = 1000000000

Station Address                 = 0x16353e04cc

Administration Status (value)   = up(1)  &lt;--

Operation Status (value)        = up(1)

Cause

CAUSE: Known Defect. See below.

Answer/Solution

After a failure of some sort, the lan3 switched to standby lan0 but then did not switch back

 automatically even when it was up and SG configuration was set to Auto_Failback. When the

user noticed that lan3 appeared to be up, they made an attempt to move it back manually using

 cmmodnet -e lan3 which failed.

Issues where primary lan fails in a SG environment, and SG fails it over to standby LAN. But

it appears that the card is UP, so they attempt to do manual

which reports that the LAN is already enabled. Here is a snippet from syslog.log:

Feb 14 15:42:22 node4 cmnetd[2479]: lan1 is down at the data link layer.

Feb 14 15:42:22 node4 cmnetd[2479]: lan1 failed.

Feb 14 15:42:22 node4 cmnetd[2479]: lan1 switching to lan0

Feb 14 15:42:22 node4 cmnetd[2479]: Subnet 192.168.168.0 switching from lan1 to lan0

Feb 14 15:42:22 node4 cmnetd[2479]: Subnet 192.168.168.0 switched from lan1 to lan0

Feb 14 15:42:22 node4 cmnetd[2479]: lan1 switched to lan0

Feb 14 15:50:00 node4 syslog: cmmodnet -e lan1

Feb 14 15:50:00 node4 cmnetd[2479]: Request to enable interface lan1

Feb 14 15:50:00 node4 cmnetd[2479]: Attempt to enable network interface lan1 when it is already enabled.

Feb 14 15:50:49 node4 su: + 1 sa367701-root

Feb 14 15:52:18 node4 cmnetd[2479]: DLPI error 4, unix error 6 sending to 00

17a4a4b552aa080009167e

Feb 14 15:52:18 node4 cmnetd[2479]: Failed to send on lan0 (1).

Feb 14 15:52:18 node4 cmnetd[2479]: DLPI error 4, unix error 6 sending to 001f296e6ee0aa080009167e

Feb 14 15:52:18 node4 cmnetd[2479]: DLPI error 4, unix error 6 sending to 001f296e6e84aa080009167e

Feb 14 15:52:18 node4 cmnetd[2479]: Failed to send on lan0 (1).

Feb 14 15:52:20 node4 cmnetd[2479]: lan0 is down at the data link layer.

Feb 14 15:52:20 node4 cmnetd[2479]: lan0 failed.

Feb 14 15:52:20 node4 cmnetd[2479]: Subnet 192.168.168.0 down

Feb 14 15:55:40 node4 cmcld[2469]: Request from root on node node4 to halt the cluster on this node

Feb 14 15:55:40 node4 cmcld[2469]: Request from node node4 to disable node switching for package package1 on node node4.

Feb 14 15:55:40 node4 cmcld[2469]: Disabled package package1 on node node4.

Feb 14 15:55:43 node4 cmcld[2469]: Member node4 halting.

Feb 14 15:55:43 node4 cmcld[2469]: Closing route 192.168.170.228:5300 on fd39 to package1n2: Software caused connection abort

Feb 14 15:55:43 node4 cmcld[2469]: Closing route 10.10.50.42:5300 on fd 40 to package1n2: Software caused connection abort

Feb 14 15:55:43 node4 cmcld[2469]: Setting gmsg transport state to ERROR (from READY)

Feb 14 15:55:43 node4 cmcld[2469]: Membership: membership at 1 is HALTED (coordinator 1) includes: 1 2 excludes:

Feb 14 15:55:43 node4 cmnetd[2479]: Subnet 192.168.168.0 switching from lan0 to lan1

Feb 14 15:55:43 node4 cmnetd[2479]: Subnet 192.168.168.0 switched from lan0 to lan1

Feb 14 15:55:43 node4 cmnetd[2479]: lan0 switched to lan1

Feb 14 15:55:43 node4 cmserviced[2474]: Service cmnetd completed successfully with an exit(0).

Feb 14 15:55:48 node4 cmcld[2469]: This node (node4) has ceased cluster activities.

Feb 14 15:55:48 node4 cmcld[2469]: Daemon exiting

Feb 14 15:55:48 node4 cmclconfd[2463]: The Serviceguard daemon, cmcld[2469], exited normally.

Halting and restarting the node fixed the problem on the 2nd restart.

    SG  A.11.20.00 Date: 07/26/12 Patch: PHSS_43094 (node4)

    SG  A.11.20.00 Date: 07/26/12 Patch: PHSS_43094 (node3)

lan1 bridged_net 1

lan2 bridged_net 2

lan0 bridged_net 1 standby

System is a SD2:

lan1 11/0/12/1/0/6/0 igelan HP AB465-60001 PCI/PCI-X 1000Base-T 2-port 2Gb FC/2-port 1000B-T Combo Adapter

lan0 10/0/2/1/0/6/0  igelan HP A9784-60002 PCI/PCI-X 1000Base-T FC/GigE Combo Adapter

SG configuration:

network_polling_interval=2000000

network_failure_detection=inout

network_auto_failback=yes

Noticed that the problem existed on multiple nodes. Tried the following tests:

Ran "lanadmin -r 1" to do a hard reset on lan1 and the result was this:

Feb 17 02:46:45 node4 cmnetd[10107]: Auto Failback is enabled.

Feb 19 11:42:31 node4 cmnetd[10107]: DLPI error 4, unix error 6 sending to 0016353e4478aa080009167e

Feb 19 11:42:31 node4 cmnetd[10107]: Failed to send on lan1 (2).

Feb 19 11:42:33 node4 cmnetd[10107]: lan1 is down at the data link layer.

Feb 19 11:42:33 node4 cmnetd[10107]: lan1 failed.

Feb 19 11:42:33 node4 cmnetd[10107]: lan1 switching to lan0

Feb 19 11:42:33 node4 cmnetd[10107]: Subnet 192.168.168.0 switching from lan1 to lan0

Feb 19 11:42:33 node4 cmnetd[10107]: Subnet 192.168.168.0 switched from lan1 to lan0

Feb 19 11:42:33 node4 cmnetd[10107]: lan1 switched to lan0

It never switches back to lan1 even though lan1 appears to be up after the test:

Name   Mtu  Network       Address         Ipkts     Ierrs  Opkts    Oerrs Coll

lan1*  1500 none          none            3836582   0     6857057   0     0

lan0   1500 192.168.168.0 192.168.170.227 2037      0     3135      0     0

Here is the behavior one has got and would expect:

Feb 20 21:31:02 testnode cmnetd[8922]: lan4 is down at the data link layer.

Feb 20 21:31:02 testnode cmnetd[8922]: lan4 failed.

Feb 20 21:31:02 testnode cmnetd[8922]: lan4 switching to lan3

Feb 20 21:31:02 testnode cmnetd[8922]: Subnet 10.9.1.0 switching from lan4 to lan3

Feb 20 21:31:02 testnode cmnetd[8922]: Subnet 10.9.1.0 switched from lan4 to lan3

Feb 20 21:31:02 testnode cmnetd[8922]: lan4 switched to lan3

Feb 20 21:31:16 testnode cmnetd[8922]: lan4 is up at the data link layer.

Feb 20 21:31:16 testnode cmnetd[8922]: lan4 recovered.

The cdb_dump.11i and counters.ia64 tools were downloaded and sent to the user. Also asked to turn on cmnetd logging.

Here the counters are cleared:

# /tmp/sg_tools&gt;date; lanadmin -c 1; lanadmin -c 0; sleep 4

Tue Feb 26 13:34:35 IST 2013

Counters shows this event on both lans.

Now reset lan1 at Tue Feb 26 13:36:34 IST 2013

lan1 counters:

13:36:31 301 286

13:36:33 304 289

13:36:35 0 289

13:36:37 0 289

13:36:39 1 289

...

Outbound counters do not change but inbound does. So definitely the user is  hitting a known issue, QXCR1001247823.

lan0 in and out counters continue to increment the whole time.

Now here is where outbound counters start to increment again.

13:41:45 308 289

13:41:47 354 328

13:41:49 357 331

The cmnetd.log ends at 13:39:49. My guess is after this they halted and restarted the node in the cluster at 13:41:45.

From syslog.log:

Feb 26 13:41:13 node4 syslog: cmhaltnode -v node4

Feb 26 13:41:45 node4 syslog: cmrunnode -v node4

OK, so what is cmnetd doing at the time of the reset?

Feb 26 13:36:35.703 [3610] DLPI error 4, unix error 57 sending to 0016353e4478aa080009167e

Feb 26 13:36:35.703 [3610] Delivering link error callback

Feb 26 13:36:35.703 [3610] lan1 got card error 57.

Feb 26 13:36:35.703 [3610] lan1 is down at the data link layer.

Feb 26 13:36:35.704 [3610] lan1 failed.

Feb 26 13:36:35.704 [3610] intf_name lan1 status is 1, failure_type is 2, disabled 0.

#define ENOLINK         57      /* the link has been severed */

The user is hitting the defect where the outbound counters do not increment after the lan

reset and therefore one could never recover.

QXCR1001266987 - an IGELAN problem, only.

Problem description: Outbound statistics are not incremented and a "lanadmin -r" does not

reset the outbound statistics.

Current planned schedule, GR-fix: June Web Release.

Note that the defect impacts the outbound-statistics, only:

1.       Serviceguard checks both outbound and inbound statistics. Only if both statistics

do not change, the interface is considered down. This means that this defect did not caused

the SG switch (failover) to the secondary card.

2- If an Interface is been marked down, it will only be set to up and failed back, if both

 inbound AND outbound statistic change.

NOTE: QXCR1001266987 is fixed by the driver version GigEther-01 B.11.31.1307

Keywords.

Cmodnet, cmviewcl, syslog

UnixPedia : HPUX / LINUX / SOLARIS

Humor

Friday, May 9, 2014

HPUX :LAN Interfaces: ServiceGuard IP Address not Failing Back to Primary lan

No comments:

Post a Comment