HP
Serviceguard for HP-UX - LAN Interfaces: ServiceGuard IP Address not Failing
Back to Primary lan when lan Card is up and cmmodnet -e Shows lan Already
Enabled.
Overview
|
HP Serviceguard for HP-UX - LAN Interfaces:
ServiceGuard IP Address not Failing Back to
Primary lan when lan Card is up and cmmodnet -e Shows lan Already Enabled. |
Procedures
|
# cmviewcl -v
CLUSTER STATUS
Cluster1 up
NODE STATUS STATE
node1 up running
Cluster_Lock_LVM:
The cmviewcl command shows the status of lan3 in down state:
# cmviewcl -v
CLUSTER STATUS
Cluster1 up
NODE STATUS STATE
node1 up running
Cluster_Lock_LVM:
VOLUME_GROUP PHYSICAL_VOLUME STATUS
/dev/vglockdisk /dev/disk/disk50 up
Network_Parameters:
INTERFACE STATUS PATH NAME
PRIMARY down 7/0/9/1/0/6/0 lan3 <--
PRIMARY up 4/0/6/1/0/6/0 lan0
PRIMARY up 7/0/0/1/0/6/0 lan2
STANDBY up 6/0/14/1/0/6/0 lan1
The nwmgr command shows the status as up both in Administration and Operational Status.
# /usr/sbin/nwmgr
Name/ Interface Station Sub- Interface Related
ClassInstance State Address system Type Interface
============== ========= ============== ======== ============== =========
lan0 UP 0x0016353E5444 igelan 1000Base-T
lan1 UP 0x0016353E353C igelan 1000Base-T
lan2 UP 0x0016353E249D igelan 1000Base-T
lan3 UP 0x0016353E04CC igelan 1000Base-T <-- is up
LAN INTERFACE STATUS DISPLAY
Wed, Jan 23,2013 13:04:59
PPA Number = 3 <--
Description = lan3 HP PCI-X 1000Base-T Release B.11.31.1112
Type (value) = ethernet-csmacd(6)
MTU Size = 1500
Speed = 1000000000
Station Address = 0x16353e04cc
Administration Status (value) = up(1) <--
Operation Status (value) = up(1)
Cause
CAUSE: Known Defect. See
below.
Answer/Solution
After a failure of some sort, the lan3 switched to standby lan0 but then did not switch back
automatically even when it was up and SG configuration was set to Auto_Failback. When the
user noticed that lan3 appeared to be up, they made an attempt to move it back manually using
cmmodnet -e lan3 which failed.
Issues where primary lan fails in a SG environment, and SG fails it over to standby LAN. But
it appears that the card is UP, so they attempt to do manual
which reports that the LAN is already enabled. Here is a snippet from syslog.log:
Feb 14 15:42:22 node4 cmnetd[2479]: lan1 is down at the data link layer.
Feb 14 15:42:22 node4 cmnetd[2479]: lan1 failed.
Feb 14 15:42:22 node4 cmnetd[2479]: lan1 switching to lan0
Feb 14 15:42:22 node4 cmnetd[2479]: Subnet 192.168.168.0 switching from lan1 to lan0
Feb 14 15:42:22 node4 cmnetd[2479]: Subnet 192.168.168.0 switched from lan1 to lan0
Feb 14 15:42:22 node4 cmnetd[2479]: lan1 switched to lan0
Feb 14 15:50:00 node4 syslog: cmmodnet -e lan1
Feb 14 15:50:00 node4 cmnetd[2479]: Request to enable interface lan1
Feb 14 15:50:00 node4 cmnetd[2479]: Attempt to enable network interface lan1 when it is already enabled.
Feb 14 15:50:49 node4 su: + 1 sa367701-root
Feb 14 15:52:18 node4 cmnetd[2479]: DLPI error 4, unix error 6 sending to 00
17a4a4b552aa080009167e
Feb 14 15:52:18 node4 cmnetd[2479]: Failed to send on lan0 (1).
Feb 14 15:52:18 node4 cmnetd[2479]: DLPI error 4, unix error 6 sending to 001f296e6ee0aa080009167e
Feb 14 15:52:18 node4 cmnetd[2479]: DLPI error 4, unix error 6 sending to 001f296e6e84aa080009167e
Feb 14 15:52:18 node4 cmnetd[2479]: Failed to send on lan0 (1).
Feb 14 15:52:20 node4 cmnetd[2479]: lan0 is down at the data link layer.
Feb 14 15:52:20 node4 cmnetd[2479]: lan0 failed.
Feb 14 15:52:20 node4 cmnetd[2479]: Subnet 192.168.168.0 down
Feb 14 15:55:40 node4 cmcld[2469]: Request from root on node node4 to halt the cluster on this node
Feb 14 15:55:40 node4 cmcld[2469]: Request from node node4 to disable node switching for package package1 on node node4.
Feb 14 15:55:40 node4 cmcld[2469]: Disabled package package1 on node node4.
Feb 14 15:55:43 node4 cmcld[2469]: Member node4 halting.
Feb 14 15:55:43 node4 cmcld[2469]: Closing route 192.168.170.228:5300 on fd39 to package1n2: Software caused connection abort
Feb 14 15:55:43 node4 cmcld[2469]: Closing route 10.10.50.42:5300 on fd 40 to package1n2: Software caused connection abort
Feb 14 15:55:43 node4 cmcld[2469]: Setting gmsg transport state to ERROR (from READY)
Feb 14 15:55:43 node4 cmcld[2469]: Membership: membership at 1 is HALTED (coordinator 1) includes: 1 2 excludes:
Feb 14 15:55:43 node4 cmnetd[2479]: Subnet 192.168.168.0 switching from lan0 to lan1
Feb 14 15:55:43 node4 cmnetd[2479]: Subnet 192.168.168.0 switched from lan0 to lan1
Feb 14 15:55:43 node4 cmnetd[2479]: lan0 switched to lan1
Feb 14 15:55:43 node4 cmserviced[2474]: Service cmnetd completed successfully with an exit(0).
Feb 14 15:55:48 node4 cmcld[2469]: This node (node4) has ceased cluster activities.
Feb 14 15:55:48 node4 cmcld[2469]: Daemon exiting
Feb 14 15:55:48 node4 cmclconfd[2463]: The Serviceguard daemon, cmcld[2469], exited normally.
Halting and restarting the node fixed the problem on the 2nd restart.
SG A.11.20.00 Date: 07/26/12 Patch: PHSS_43094 (node4)
SG A.11.20.00 Date: 07/26/12 Patch: PHSS_43094 (node3)
lan1 bridged_net 1
lan2 bridged_net 2
lan0 bridged_net 1 standby
System is a SD2:
lan1 11/0/12/1/0/6/0 igelan HP AB465-60001 PCI/PCI-X 1000Base-T 2-port 2Gb FC/2-port 1000B-T Combo Adapter
lan0 10/0/2/1/0/6/0 igelan HP A9784-60002 PCI/PCI-X 1000Base-T FC/GigE Combo Adapter
SG configuration:
network_polling_interval=2000000
network_failure_detection=inout
network_auto_failback=yes
Noticed that the problem existed on multiple nodes. Tried the following tests:
Ran "lanadmin -r 1" to do a hard reset on lan1 and the result was this:
Feb 17 02:46:45 node4 cmnetd[10107]: Auto Failback is enabled.
Feb 19 11:42:31 node4 cmnetd[10107]: DLPI error 4, unix error 6 sending to 0016353e4478aa080009167e
Feb 19 11:42:31 node4 cmnetd[10107]: Failed to send on lan1 (2).
Feb 19 11:42:33 node4 cmnetd[10107]: lan1 is down at the data link layer.
Feb 19 11:42:33 node4 cmnetd[10107]: lan1 failed.
Feb 19 11:42:33 node4 cmnetd[10107]: lan1 switching to lan0
Feb 19 11:42:33 node4 cmnetd[10107]: Subnet 192.168.168.0 switching from lan1 to lan0
Feb 19 11:42:33 node4 cmnetd[10107]: Subnet 192.168.168.0 switched from lan1 to lan0
Feb 19 11:42:33 node4 cmnetd[10107]: lan1 switched to lan0
It never switches back to lan1 even though lan1 appears to be up after the test:
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
lan1* 1500 none none 3836582 0 6857057 0 0
lan0 1500 192.168.168.0 192.168.170.227 2037 0 3135 0 0
Here is the behavior one has got and would expect:
Feb 20 21:31:02 testnode cmnetd[8922]: lan4 is down at the data link layer.
Feb 20 21:31:02 testnode cmnetd[8922]: lan4 failed.
Feb 20 21:31:02 testnode cmnetd[8922]: lan4 switching to lan3
Feb 20 21:31:02 testnode cmnetd[8922]: Subnet 10.9.1.0 switching from lan4 to lan3
Feb 20 21:31:02 testnode cmnetd[8922]: Subnet 10.9.1.0 switched from lan4 to lan3
Feb 20 21:31:02 testnode cmnetd[8922]: lan4 switched to lan3
Feb 20 21:31:16 testnode cmnetd[8922]: lan4 is up at the data link layer.
Feb 20 21:31:16 testnode cmnetd[8922]: lan4 recovered.
The cdb_dump.11i and counters.ia64 tools were downloaded and sent to the user. Also asked to turn on cmnetd logging.
Here the counters are cleared:
# /tmp/sg_tools>date; lanadmin -c 1; lanadmin -c 0; sleep 4
Tue Feb 26 13:34:35 IST 2013
Counters shows this event on both lans.
Now reset lan1 at Tue Feb 26 13:36:34 IST 2013
lan1 counters:
13:36:31 301 286
13:36:33 304 289
13:36:35 0 289
13:36:37 0 289
13:36:39 1 289
...
Outbound counters do not change but inbound does. So definitely the user is hitting a known issue, QXCR1001247823.
lan0 in and out counters continue to increment the whole time.
Now here is where outbound counters start to increment again.
13:41:45 308 289
13:41:47 354 328
13:41:49 357 331
The cmnetd.log ends at 13:39:49. My guess is after this they halted and restarted the node in the cluster at 13:41:45.
From syslog.log:
Feb 26 13:41:13 node4 syslog: cmhaltnode -v node4
Feb 26 13:41:45 node4 syslog: cmrunnode -v node4
OK, so what is cmnetd doing at the time of the reset?
Feb 26 13:36:35.703 [3610] DLPI error 4, unix error 57 sending to 0016353e4478aa080009167e
Feb 26 13:36:35.703 [3610] Delivering link error callback
Feb 26 13:36:35.703 [3610] lan1 got card error 57.
Feb 26 13:36:35.703 [3610] lan1 is down at the data link layer.
Feb 26 13:36:35.704 [3610] lan1 failed.
Feb 26 13:36:35.704 [3610] intf_name lan1 status is 1, failure_type is 2, disabled 0.
#define ENOLINK 57 /* the link has been severed */
The user is hitting the defect where the outbound counters do not increment after the lan
reset and therefore one could never recover.
QXCR1001266987 - an IGELAN problem, only.
Problem description: Outbound statistics are not incremented and a "lanadmin -r" does not
reset the outbound statistics.
Current planned schedule, GR-fix: June Web Release.
Note that the defect impacts the outbound-statistics, only:
1. Serviceguard checks both outbound and inbound statistics. Only if both statistics
do not change, the interface is considered down. This means that this defect did not caused
the SG switch (failover) to the secondary card.
2- If an Interface is been marked down, it will only be set to up and failed back, if both
inbound AND outbound statistic change.
NOTE: QXCR1001266987 is fixed by the driver version GigEther-01 B.11.31.1307
|
Keywords.
|
Cmodnet, cmviewcl,
syslog
|
No comments:
Post a Comment