UnixPedia : HPUX / LINUX / SOLARIS: January 2015

Sunday, January 11, 2015

MC/ServiceGuard: How to Disable the IP Monitoring on Running Cluster

For issue :
During switch port maintenance activity when lan1 failed it was getting failed over to lan16 automatically but the vice-versa was not possible. We had to switch from lan16 to lan1 automatically.

 the lan1 should failover to lan16 and vice versa.
Looking into the cmgetconf output  could see that the lan1’s standby interface is lan16.
 lan1 as primary and lan16 as primary.

NODE_NAME               tdsxdbp03
  NETWORK_INTERFACE     lan1
    HEARTBEAT_IP        172.98.212.10
  NETWORK_INTERFACE     lan8
    HEARTBEAT_IP        192.168.112.140
  NETWORK_INTERFACE     lan0
    HEARTBEAT_IP        172.98.213.10
  NETWORK_INTERFACE     lan16
#  CLUSTER_LOCK_LUN
  FIRST_CLUSTER_LOCK_PV /dev/disk/disk5
# Primary Network Interfaces on Bridged Net 1: lan1.
#   Possible standby Network Interfaces on Bridged Net 1: lan16.
# Primary Network Interfaces on Bridged Net 2: lan8.
#   Warning: There are no standby network interfaces on bridged net 2.
# Primary Network Interfaces on Bridged Net 3: lan0.
#   Warning: There are no standby network interfaces on bridged net 3.

Also the network failback is enabled in the configuration.

# NETWORK_AUTO_FAILBACK
# When set to YES a recovery of the primary LAN interface will cause failback
# from the standby LAN interface to the primary.
# When set to NO a recovery of the primary LAN interface will do nothing and
# the standby LAN interface will continue to be used until cmmodnet -e lanX
# is issued for the primary LAN interface.

NETWORK_AUTO_FAILBACK           YES

I could see the following message in the syslog.log for the switch migration activity. It clearly visible that the lan1 is successfully moving the lan16 upon the failover.
However what I can  see that both lans are failed at IP layer.

Jan 10 05:28:20 tdsxdbp03 cmnetd[27697]: 172.98.212.10 failed.
Jan 10 05:28:20 tdsxdbp03 cmnetd[27697]: lan1 is down at the IP layer.
Jan 10 05:28:20 tdsxdbp03 cmnetd[27697]: lan1 failed.
Jan 10 05:28:00 tdsxdbp03 su: + tty?? root-sentrigo
Jan 10 05:28:20 tdsxdbp03  above message repeats 3 times
Jan 10 05:28:20 tdsxdbp03 cmnetd[27697]: lan1 switching to lan16              ? Switching to standby network
Jan 10 05:28:20 tdsxdbp03 cmnetd[27697]: Subnet 172.98.212.0 switching from lan1 to lan16
Jan 10 05:28:20 tdsxdbp03 cmnetd[27697]: Subnet 172.98.212.0 switched from lan1 to lan16
Jan 10 05:28:20 tdsxdbp03 cmnetd[27697]: lan1 switched to lan16
Jan 10 05:28:28 tdsxdbp03 cmnetd[27697]: 172.98.212.10 failed.
Jan 10 05:28:28 tdsxdbp03 cmnetd[27697]: lan16 is down at the IP layer.
Jan 10 05:28:28 tdsxdbp03 cmnetd[27697]: lan16 failed.
Jan 10 05:28:28 tdsxdbp03 cmnetd[27697]: Subnet 172.98.212.0 down
Jan 10 05:28:40 tdsxdbp03 cimserver[25010]: PGS10405: Failed to deliver an indication: PGS08001: CIM HTTP or HTTPS connector cannot connect to 10.36.218.66:50004. Connection failed.
Jan 10 05:28:40 tdsxdbp03 cimserver[25010]: PGS10405: Failed to deliver an indication: PGS08001: CIM HTTP or HTTPS connector cannot connect to 10.36.218.67:50004. Connection failed.
Jan 10 05:28:40 tdsxdbp03 cimserver[25010]: PGS10405: Failed to deliver an indication: PGS08001: CIM HTTP or HTTPS connector cannot connect to 10.36.152.168:50004. Connection failed.
Jan 10 05:28:50 tdsxdbp03 cmnetd[27697]: 172.98.212.10 recovered.
Jan 10 05:28:50 tdsxdbp03 cmnetd[27697]: Subnet 172.98.212.0 up
Jan 10 05:28:40 tdsxdbp03 cimserver[25010]: PGS10405: Failed to deliver an indication: PGS08001: CIM HTTP or HTTPS connector cannot connect to 10.36.218.66:50004. Connection failed.
Jan 10 05:28:50 tdsxdbp03 cmnetd[27697]: lan16 is up at the IP layer.
Jan 10 05:28:40 tdsxdbp03 cimserver[25010]: PGS10405: Failed to deliver an indication: PGS08001: CIM HTTP or HTTPS connector cannot connect to 10.36.152.168:50004. Connection failed.
Jan 10 05:28:50 tdsxdbp03 cmnetd[27697]: lan16 recovered.
Jan 10 05:29:00 tdsxdbp03 su: + tty?? root-conclusr

Jan 10 06:20:52 tdsxdbp03 cmnetd[27697]: 172.98.212.10 failed.
Jan 10 06:20:21 tdsxdbp03 su: + tty?? root-conclusr
Jan 10 06:20:52 tdsxdbp03 cmnetd[27697]: lan16 is down at the IP layer.
Jan 10 06:20:52 tdsxdbp03 cmnetd[27697]: lan16 failed.
Jan 10 06:20:52 tdsxdbp03 cmnetd[27697]: Subnet 172.98.212.0 down
Jan 10 06:21:00 tdsxdbp03 su: + tty?? root-sentrigo
Jan 10 06:21:02 tdsxdbp03 cmnetd[27697]: 172.98.212.10 recovered.
Jan 10 06:21:02 tdsxdbp03 cmnetd[27697]: Subnet 172.98.212.0 up
Jan 10 06:21:02 tdsxdbp03 cmnetd[27697]: lan16 is up at the IP layer.
Jan 10 06:21:02 tdsxdbp03 cmnetd[27697]: lan16 recovered.

Here is the the logs that customer is tried to enable the lan card manually. And it was successful.
Jan 10 12:27:30 tdsxdbp03 syslog: cmmodnet -e lan1
Jan 10 12:27:30 tdsxdbp03 cmnetd[27697]: Request to enable interface lan1
Jan 10 12:27:14 tdsxdbp03 su: + tty?? root-conclusr
Jan 10 12:27:30 tdsxdbp03 cmnetd[27697]: Subnet 172.98.212.0 switching from lan16 to lan1
Jan 10 12:27:30 tdsxdbp03 cmnetd[27697]: Subnet 172.98.212.0 switched from lan16 to lan1
Jan 10 12:27:30 tdsxdbp03 cmnetd[27697]: lan16 switched to lan1

This issue due to the IP MONITOR enabled for subnet 172.98.212.0 , that is reason we are seeing the LAN failed at IP Layer message is showing in syslog.

SUBNET 172.98.212.0
  IP_MONITOR ON
  POLLING_TARGET 172.98.212.1

SUBNET 192.168.112.0
  IP_MONITOR OFF

SUBNET 172.98.213.0
  IP_MONITOR OFF

Let me explain why the LAN failover was not worked  lan16 to lan1 .When IP Monitor is configured we can choose Target Polling method of Peer Polling method.  In either method, using Internet Control Message Protocol (ICMP) and ICMPv6, IP Monitor sends polling messages (ECHO msgs) to target IP addresses and verifies that responses are received. When the IP Monitor detects a failure, it marks the network interface down at the IP level.

If a PRI and STBY cards are configured, and we are using IP Monitor on the IP address configured on the PRI (to start with) then when such a failure takes place and the lan inerface is marked on the IP level, the IP address is moved to the STBY card. By nature of the IP Monitor design, the pings (or the sending of those ECHO msgs) are now being sent by the IP address on the STBY card.

If then the cause of the failure is fixed, there is no way for Serviceguard to failback the IP from STBY to the PRI. This is because there is no IP address on the PRI to do the ping and verify the replies, to know that all is OK. So even when the problem is resolved, the IP address remains on the STBY card.

Now, if the failure continued then even when the IP address moves to the STBY it will not be able to verify the pings, and the STBY card will be marked down on the IP level, and subnet will go down. However, the IP address will NOT be removed with the STBY (there is nowhere else to place it, and also since we are now down anyway, we keep it there to check for possible replies). So if the cause is now fixed, and the reply to the ECHO msgs are verified, then the subnet is returned to the STBY card, as the IP is configured there.

So solution for this issue is to disable the IP monitor for Subnet 172.98.212.0 . so that SG will not mark the Primary interface in down state.

 Issue Setup



MC/Service Guard Version : A.11.19.00
IP Monitoring was configured on the existing cluster.
SUBNET 172.17.1.0
IP_MONITOR ON
POLLING_TARGET 172.17.1.3
How to disable the IP Monitoring and can it be done Online while cluster is running?

Solution


IP Monitoring can be disabled Online while cluster is running.
Can be changed while the cluster is running; must be removed, with its accompanying IP_MONITOR and POLLING_TARGET entries, if the subnet in question is removed from the cluster configuration.
IP Monitor
Can be changed while the cluster is running; must be removed if the preceding SUBNET entry is removed.
POLLING_TARGET
Can be changed while the cluster is running; must be removed if the preceding SUBNET entry is removed.
To Temporarily disable the IP Subnet monitoring from the cluster configuration, Modify the cluster ascii file like below and check/apply the configuration:
SUBNET 172.17.1.0
IP_MONITOR OFF
# POLLING_TARGET 172.17.1.3
NOTE:POLLING_TARGET 172.17.1.3 entry should be removed/commented when we are making the IP_MONITOR OFF.
To Permanently remove the IP Subnet monitoring from the cluster configuration, remove the following entries from the cluster ascii file and check/apply the configuration:
#SUBNET 172.17.1.0
# IP_MONITOR ON
# POLLING_TARGET 172.17.1.3
Steps:
  1. Get the running cluster configuration file using cmgetconf file:
       #cmgetconf /etc/cmcluster/<clustername_date>.ascii
  2. Modify the /etc/cmcluster/<clustername_date>.ascii file depending on the requirement as mentioned above.
  3. Run cmcheckconf to check any errors:
      #cmcheckconf -v -C /etc/cmcluster/<clustername_date>.ascii
  4. If no errors on cmcheckconf then run cmapplyconf:
      #cmapplyconf -v -C /etc/cmcluster/<clustername_date>.ascii
NOTE:When disabling/deleting the IP Subnet you will get below messages with cmcheckconf and cmapplyconf.
Setting IP_MONITOR to OFF for SUBNET 172.17.1.0 while cluster is running.
Removing POLLING_TARGET 172.17.1.3 from SUBNET 172.17.1.0 while cluster is running.
-----------------