WHAT
HAPPENS WHEN A NODE TIMES OUT
Each
node sends a heartbeat message to all other nodes at an interval equal to
one-fourth of the
value
of the configured MEMBER_TIMEOUT or 1 second, whichever is less.
When
a node detects that another node has failed (that is, no heartbeat message has
arrived
within
MEMBER_TIMEOUT microseconds), the following sequence of events occurs:
1.
The node contacts the other nodes and tries to re-form the cluster without the
failed node.
2.
If the remaining nodes are a majority or can obtain the cluster lock, they form
a new cluster
without
the failed node.
3.
If the remaining nodes are not a majority or cannot get the cluster lock, they
halt (system reset).
HEALTHY
NODE STATUS:
INCASE OF FAILURE :
EXAMPLE
SITUATION.
Assume
a two-node cluster, with Package1 running on JUPITOR and Package2 running
on
EARTH. Volume group vg01 is exclusively activated on JUPITOR; volume group
vg02is
exclusively
activated on EARTH. Package IP addresses are assigned to JUPITOR and EARTH
respectively.
FAILURE.
Only
one LAN has been configured for both heartbeat and data traffic. During the
course
of
operations, heavy application traffic monopolizes the bandwidth of the network,
preventing
heartbeat
packets from getting through.
Since
JUPITOR does not receive heartbeat messages from EARTH, JUPITOR attempts to
reform
as
a one-node cluster. Likewise, since EARTH does not receive heartbeat messages
from
JUPITOR,
EARTH also attempts to reform as a one-node cluster.
ELECTION
PROCESS:
During
the election protocol,each node votes for itself, giving both nodes 50 percent
of the vote.
Because
both nodes have 50 percent of the vote, both nodes now vie for the cluster
lock.
Only
one node will get the lock.
OUTCOME.
Assume
JUPITOR gets the cluster lock. JUPITOR reforms as a one-node cluster. After
re-formation,
JUPITOR will make sure all applications configured to run on an existing
cluster
node
are running. When JUPITOR discovers Package2 is not running in the cluster it
will try to
start
Package2 if Package2 is configured to run on JUPITOR.
EARTH
recognizes that it has failed to get the cluster lock and so cannot re-form the
cluster. To
release
all resources related to Package2 (such as exclusive access to volume group
vg02 and
the
Package2 IP address) as quickly as possible, EARTH halts (system reset).
No comments:
Post a Comment