UnixPedia : HPUX / LINUX / SOLARIS: HPUX :LAN Interfaces: ServiceGuard IP Address not Failing Back to Primary lan

Friday, May 9, 2014

HPUX :LAN Interfaces: ServiceGuard IP Address not Failing Back to Primary lan



HP Serviceguard for HP-UX - LAN Interfaces: ServiceGuard IP Address not Failing Back to Primary lan when lan Card is up and cmmodnet -e Shows lan Already Enabled.
Overview
HP Serviceguard for HP-UX - LAN Interfaces: ServiceGuard IP Address not Failing Back to 
Primary lan when lan Card is up and cmmodnet -e Shows lan Already Enabled.
Procedures
# cmviewcl -v
CLUSTER STATUS  
Cluster1 up
NODE STATUS STATE
node1 up running 
Cluster_Lock_LVM:
The cmviewcl command shows the status of lan3 in down state:
 
# cmviewcl -v
 
CLUSTER        STATUS       
Cluster1      up           
 
  NODE           STATUS       STATE        
  node1        up           running      
 
    Cluster_Lock_LVM:
    VOLUME_GROUP          PHYSICAL_VOLUME       STATUS              
    /dev/vglockdisk       /dev/disk/disk50      up                  
 
    Network_Parameters:
    INTERFACE    STATUS           PATH                NAME         
    PRIMARY      down             7/0/9/1/0/6/0       lan3  <--     
    PRIMARY      up               4/0/6/1/0/6/0       lan0         
    PRIMARY      up               7/0/0/1/0/6/0       lan2         
    STANDBY      up               6/0/14/1/0/6/0      lan1 
 
The nwmgr command shows the status as up both in Administration and Operational Status.
 
# /usr/sbin/nwmgr
 
Name/          Interface Station          Sub-   Interface      Related
ClassInstance  State     Address        system   Type           Interface
============== ========= ============== ======== ============== =========
lan0           UP        0x0016353E5444 igelan   1000Base-T     
lan1           UP        0x0016353E353C igelan   1000Base-T     
lan2           UP        0x0016353E249D igelan   1000Base-T     
lan3           UP        0x0016353E04CC igelan   1000Base-T   <-- is up
 
                      LAN INTERFACE STATUS DISPLAY
                       Wed, Jan 23,2013  13:04:59
 
PPA Number                      = 3  <--
Description                     = lan3 HP PCI-X 1000Base-T Release B.11.31.1112
Type (value)                    = ethernet-csmacd(6)
MTU Size                        = 1500
Speed                           = 1000000000
Station Address                 = 0x16353e04cc
Administration Status (value)   = up(1)  <--
Operation Status (value)        = up(1)
Cause
CAUSE: Known Defect. See below.
Answer/Solution
After a failure of some sort, the lan3 switched to standby lan0 but then did not switch back
 automatically even when it was up and SG configuration was set to Auto_Failback. When the 
user noticed that lan3 appeared to be up, they made an attempt to move it back manually using
 cmmodnet -e lan3 which failed.
 
Issues where primary lan fails in a SG environment, and SG fails it over to standby LAN. But 
it appears that the card is UP, so they attempt to do manual
 
 
which reports that the LAN is already enabled. Here is a snippet from syslog.log: 
 
Feb 14 15:42:22 node4 cmnetd[2479]: lan1 is down at the data link layer.
Feb 14 15:42:22 node4 cmnetd[2479]: lan1 failed.
Feb 14 15:42:22 node4 cmnetd[2479]: lan1 switching to lan0
Feb 14 15:42:22 node4 cmnetd[2479]: Subnet 192.168.168.0 switching from lan1 to lan0
Feb 14 15:42:22 node4 cmnetd[2479]: Subnet 192.168.168.0 switched from lan1 to lan0
Feb 14 15:42:22 node4 cmnetd[2479]: lan1 switched to lan0
Feb 14 15:50:00 node4 syslog: cmmodnet -e lan1
Feb 14 15:50:00 node4 cmnetd[2479]: Request to enable interface lan1
Feb 14 15:50:00 node4 cmnetd[2479]: Attempt to enable network interface lan1 when it is already enabled.
Feb 14 15:50:49 node4 su: + 1 sa367701-root
Feb 14 15:52:18 node4 cmnetd[2479]: DLPI error 4, unix error 6 sending to 00
17a4a4b552aa080009167e
Feb 14 15:52:18 node4 cmnetd[2479]: Failed to send on lan0 (1).
Feb 14 15:52:18 node4 cmnetd[2479]: DLPI error 4, unix error 6 sending to 001f296e6ee0aa080009167e
Feb 14 15:52:18 node4 cmnetd[2479]: DLPI error 4, unix error 6 sending to 001f296e6e84aa080009167e
Feb 14 15:52:18 node4 cmnetd[2479]: Failed to send on lan0 (1).
Feb 14 15:52:20 node4 cmnetd[2479]: lan0 is down at the data link layer.
Feb 14 15:52:20 node4 cmnetd[2479]: lan0 failed.
Feb 14 15:52:20 node4 cmnetd[2479]: Subnet 192.168.168.0 down
Feb 14 15:55:40 node4 cmcld[2469]: Request from root on node node4 to halt the cluster on this node
Feb 14 15:55:40 node4 cmcld[2469]: Request from node node4 to disable node switching for package package1 on node node4.
Feb 14 15:55:40 node4 cmcld[2469]: Disabled package package1 on node node4.
Feb 14 15:55:43 node4 cmcld[2469]: Member node4 halting.
Feb 14 15:55:43 node4 cmcld[2469]: Closing route 192.168.170.228:5300 on fd39 to package1n2: Software caused connection abort
Feb 14 15:55:43 node4 cmcld[2469]: Closing route 10.10.50.42:5300 on fd 40 to package1n2: Software caused connection abort
Feb 14 15:55:43 node4 cmcld[2469]: Setting gmsg transport state to ERROR (from READY)
Feb 14 15:55:43 node4 cmcld[2469]: Membership: membership at 1 is HALTED (coordinator 1) includes: 1 2 excludes:
Feb 14 15:55:43 node4 cmnetd[2479]: Subnet 192.168.168.0 switching from lan0 to lan1
Feb 14 15:55:43 node4 cmnetd[2479]: Subnet 192.168.168.0 switched from lan0 to lan1
Feb 14 15:55:43 node4 cmnetd[2479]: lan0 switched to lan1
Feb 14 15:55:43 node4 cmserviced[2474]: Service cmnetd completed successfully with an exit(0).
Feb 14 15:55:48 node4 cmcld[2469]: This node (node4) has ceased cluster activities.
Feb 14 15:55:48 node4 cmcld[2469]: Daemon exiting
Feb 14 15:55:48 node4 cmclconfd[2463]: The Serviceguard daemon, cmcld[2469], exited normally.
 
Halting and restarting the node fixed the problem on the 2nd restart.
 
    SG  A.11.20.00 Date: 07/26/12 Patch: PHSS_43094 (node4)
    SG  A.11.20.00 Date: 07/26/12 Patch: PHSS_43094 (node3)
 
lan1 bridged_net 1
lan2 bridged_net 2
lan0 bridged_net 1 standby
 
System is a SD2:
 
lan1 11/0/12/1/0/6/0 igelan HP AB465-60001 PCI/PCI-X 1000Base-T 2-port 2Gb FC/2-port 1000B-T Combo Adapter 
lan0 10/0/2/1/0/6/0  igelan HP A9784-60002 PCI/PCI-X 1000Base-T FC/GigE Combo Adapter
 
SG configuration:
 
network_polling_interval=2000000
network_failure_detection=inout
network_auto_failback=yes
 
Noticed that the problem existed on multiple nodes. Tried the following tests:
 
Ran "lanadmin -r 1" to do a hard reset on lan1 and the result was this:
 
Feb 17 02:46:45 node4 cmnetd[10107]: Auto Failback is enabled.
Feb 19 11:42:31 node4 cmnetd[10107]: DLPI error 4, unix error 6 sending to 0016353e4478aa080009167e
Feb 19 11:42:31 node4 cmnetd[10107]: Failed to send on lan1 (2).
Feb 19 11:42:33 node4 cmnetd[10107]: lan1 is down at the data link layer.
Feb 19 11:42:33 node4 cmnetd[10107]: lan1 failed.
Feb 19 11:42:33 node4 cmnetd[10107]: lan1 switching to lan0
Feb 19 11:42:33 node4 cmnetd[10107]: Subnet 192.168.168.0 switching from lan1 to lan0
Feb 19 11:42:33 node4 cmnetd[10107]: Subnet 192.168.168.0 switched from lan1 to lan0
Feb 19 11:42:33 node4 cmnetd[10107]: lan1 switched to lan0
 
It never switches back to lan1 even though lan1 appears to be up after the test:
 
Name   Mtu  Network       Address         Ipkts     Ierrs  Opkts    Oerrs Coll
lan1*  1500 none          none            3836582   0     6857057   0     0
lan0   1500 192.168.168.0 192.168.170.227 2037      0     3135      0     0
 
 
Here is the behavior one has got and would expect:
 
Feb 20 21:31:02 testnode cmnetd[8922]: lan4 is down at the data link layer.
Feb 20 21:31:02 testnode cmnetd[8922]: lan4 failed.
Feb 20 21:31:02 testnode cmnetd[8922]: lan4 switching to lan3
Feb 20 21:31:02 testnode cmnetd[8922]: Subnet 10.9.1.0 switching from lan4 to lan3
Feb 20 21:31:02 testnode cmnetd[8922]: Subnet 10.9.1.0 switched from lan4 to lan3
Feb 20 21:31:02 testnode cmnetd[8922]: lan4 switched to lan3
Feb 20 21:31:16 testnode cmnetd[8922]: lan4 is up at the data link layer.
Feb 20 21:31:16 testnode cmnetd[8922]: lan4 recovered.
 
The cdb_dump.11i and counters.ia64 tools were downloaded and sent to the user. Also asked to turn on cmnetd logging.
Here the counters are cleared:
 
# /tmp/sg_tools>date; lanadmin -c 1; lanadmin -c 0; sleep 4
Tue Feb 26 13:34:35 IST 2013
 
Counters shows this event on both lans.
 
Now reset lan1 at Tue Feb 26 13:36:34 IST 2013
 
lan1 counters:
13:36:31 301 286
13:36:33 304 289
13:36:35 0 289
13:36:37 0 289
13:36:39 1 289
...
Outbound counters do not change but inbound does. So definitely the user is  hitting a known issue, QXCR1001247823.
 
lan0 in and out counters continue to increment the whole time.
 
Now here is where outbound counters start to increment again.
13:41:45 308 289
13:41:47 354 328
13:41:49 357 331
 
The cmnetd.log ends at 13:39:49. My guess is after this they halted and restarted the node in the cluster at 13:41:45.
 
From syslog.log:
Feb 26 13:41:13 node4 syslog: cmhaltnode -v node4
Feb 26 13:41:45 node4 syslog: cmrunnode -v node4
 
OK, so what is cmnetd doing at the time of the reset?
Feb 26 13:36:35.703 [3610] DLPI error 4, unix error 57 sending to 0016353e4478aa080009167e
Feb 26 13:36:35.703 [3610] Delivering link error callback
Feb 26 13:36:35.703 [3610] lan1 got card error 57.
Feb 26 13:36:35.703 [3610] lan1 is down at the data link layer.
Feb 26 13:36:35.704 [3610] lan1 failed.
Feb 26 13:36:35.704 [3610] intf_name lan1 status is 1, failure_type is 2, disabled 0.
 
#define ENOLINK         57      /* the link has been severed */
 
The user is hitting the defect where the outbound counters do not increment after the lan 
reset and therefore one could never recover. 
 
QXCR1001266987 - an IGELAN problem, only. 
Problem description: Outbound statistics are not incremented and a "lanadmin -r" does not 
reset the outbound statistics.
 
Current planned schedule, GR-fix: June Web Release.
 
Note that the defect impacts the outbound-statistics, only:
1.       Serviceguard checks both outbound and inbound statistics. Only if both statistics 
do not change, the interface is considered down. This means that this defect did not caused 
the SG switch (failover) to the secondary card.
 
2- If an Interface is been marked down, it will only be set to up and failed back, if both
 inbound AND outbound statistic change.
NOTE: QXCR1001266987 is fixed by the driver version GigEther-01 B.11.31.1307
Keywords.
Cmodnet, cmviewcl, syslog

No comments:

Post a Comment