UnixPedia : HPUX / LINUX / SOLARIS

Monday, May 28, 2018

WHAT HAPPENS WHEN A NODE TIMES OUT

Each node sends a heartbeat message to all other nodes at an interval equal to one-fourth of the
value of the configured MEMBER_TIMEOUT or 1 second, whichever is less.

When a node detects that another node has failed (that is, no heartbeat message has arrived
within MEMBER_TIMEOUT microseconds), the following sequence of events occurs:

1. The node contacts the other nodes and tries to re-form the cluster without the failed node.
2. If the remaining nodes are a majority or can obtain the cluster lock, they form a new cluster
without the failed node.
3. If the remaining nodes are not a majority or cannot get the cluster lock, they halt (system reset).

HEALTHY NODE STATUS:


INCASE  OF FAILURE :

EXAMPLE
SITUATION.
Assume a two-node cluster, with Package1 running on JUPITOR and Package2 running
on EARTH. Volume group vg01 is exclusively activated on JUPITOR; volume group vg02is
exclusively activated on EARTH. Package IP addresses are assigned to JUPITOR and EARTH
respectively.

FAILURE.
Only one LAN has been configured for both heartbeat and data traffic. During the course
of operations, heavy application traffic monopolizes the bandwidth of the network, preventing
heartbeat packets from getting through.

Since JUPITOR does not receive heartbeat messages from EARTH, JUPITOR attempts to reform
as a one-node cluster. Likewise, since EARTH does not receive heartbeat messages from
JUPITOR, EARTH also attempts to reform as a one-node cluster.

ELECTION PROCESS:
During the election protocol,each node votes for itself, giving both nodes 50 percent of the vote.
Because both nodes have 50 percent of the vote, both nodes now vie for the cluster lock.
Only one node will get the lock.

OUTCOME.
Assume JUPITOR gets the cluster lock. JUPITOR reforms as a one-node cluster. After
re-formation, JUPITOR will make sure all applications configured to run on an existing cluster
node are running. When JUPITOR discovers Package2 is not running in the cluster it will try to
start Package2 if Package2 is configured to run on JUPITOR.
EARTH recognizes that it has failed to get the cluster lock and so cannot re-form the cluster. To
release all resources related to Package2 (such as exclusive access to volume group vg02 and

the Package2 IP address) as quickly as possible, EARTH halts (system reset).

No comments:

Post a Comment