In this section, you consider active and passive
nodes, the shared disk array, the quorum, public and private networks, and the
cluster server. Then, you learn how a failover works.
Active Nodes Versus
Passive Nodes
A Windows Failover Cluster can support up to
sixteen nodes; however, most clustering deployment is only two nodes. A single
SQL Server 2012 instance can run on only a single node at a time; and should a
failover occur, the failed instance can failover to another node. Clusters of
three or more physical nodes should be considered when you need to cluster many
SQL Server instances.
In a two-node Windows Failover Cluster with SQL
Server, one of the physical nodes is considered the active node, and the second
one is the passive node for that single SQL Server instance. It doesn’t matter
which of the physical servers in the cluster is designated as active or
passive, but you should specifically assign one node as the active and the
other as the passive. This way, there is no confusion about which physical
server is performing which role at the current time.
When referring to an active node, this
particular node is currently running a SQL Server instance accessing that
instance’s databases, which are located on a shared disk array.
When referring to a passive node, this
particular node is not currently running the SQL Server. When a node is
passive, it is not running the production databases, but it is in a state of
readiness. If the active node fails and a failover occurs, the passive node
automatically runs production databases and begins serving user requests. In
this case, the passive node has become active, and the formerly active node
becomes the passive node (or the failed node, if a failure occurs that prevents
it from operating).
Shared Disk Array
Standalone SQL Server instances usually store
their databases on local disk storage or non-shared disk storage; clustered SQL
Server instances store data on a shared disk array. Shared means that all nodes
of the Windows Failover Cluster are physically connected to the shared disk
array, but only the active node can access that instance’s databases. To ensure
the integrity of the databases, both nodes of a cluster never access the shared
disk at the same time.
Generally speaking, a shared disk array can be
an iSCSI, a fiber-channel, SAS connected, a RAID 1, a RAID 5, or a RAID 10 disk
array housed in a standalone unit, or a SAN. This shared disk array must have
at least two logical disk partitions. One partition is used for storing the
clustered instance’s SQL Server databases, and the other is used for the quorum
drive, if a quorum drive is used. Additionally, you need a third logical
partition if you choose to cluster MSDTC.
The Quorum
When both cluster nodes are up and running and
participating in their respective active and passive roles, they communicate
with each other over the network. For example, if you change a configuration
setting on the active node, this configuration is propagated automatically, and
quickly, to the passive node, thereby ensuring synchronization.
As you might imagine, though, you can make a
change on the active node and have it fail before the change is sent over the
network and made on the passive node. In this scenario, the change is never
applied to the passive node. Depending on the nature of the change, this could
cause problems, even causing both nodes of the cluster to fail.
To prevent this change from happening, a Windows
Failover Cluster employs a quorum. A quorum is essentially a log file, similar
in concept to database logs. Its purpose is to record any change made on the
active node. This way, should any recorded change not get to the passive node
because the active node has failed and cannot send the change to the passive
node over the network, the passive node, when it becomes the active node, can
read the quorum log file to find out what the change was. The passive node can
then make the change before it becomes the new active node. If the state of
this drive is compromised, your cluster may become inoperable.
In effect, each cluster quorum can cast one
“vote,” where the majority of total votes (based on the number of these cluster
quorums that are online) determine whether the cluster continues running on the
cluster node. This prevents more than one cluster node attempting to take
ownership of the same SQL Server instance. The voting quorums are cluster nodes
or, in some cases, a disk witness or file share witness. Each voting cluster
quorum (with the exception of a file share witness) contains a copy of the
cluster configuration. The cluster service works to keep all copies
synchronized at all times.
Following are the four supported Windows
Failover Cluster quorum modes:
·
Node Majority: Each node
that is available and in communication can vote. The cluster functions only
with a majority of the votes.
·
Node and Disk Majority:
Each node plus a designated disk in the cluster storage (the “disk witness”)
can vote, whenever they are available and in communication. The cluster
functions only with a majority of the votes.
·
Node and File Share
Majority: Each node plus a designated file share created by the administrator
(the “file share witness”) can vote, whenever they are available and in
communication. The cluster functions only with a majority of the votes.
·
No Majority: Disk Only:
The cluster has a quorum if one node is available and in communication with a
specific disk in the cluster storage. Only the nodes that are also in
communication with that disk can join the cluster. The disk is the single point
of failure, so use highly reliable storage. A quorum drive is a logical drive
on the shared disk array dedicated to storing the quorum and as a best practice
should be around 1GB of fault tolerant disk storage.
With two-node clusters
Disk only is the most often used quorum configuration, commonly known as the
quorum disk. The quorum configuration can be switched after the cluster has
been deployed based on the number of clustered nodes and user requirements.
While in clusters with greater than two nodes, the other three quorum modes are
more commonly used.
Public and Private
Networks
Each node of a cluster must have at least two
network cards to be a fully supported installation. One network card is
connected to the public network, and the other network card will be connected
to a private cluster network.
The public network is the network to which the
client applications connect. This is how they communicate to a clustered SQL
Server instance using the clustered IP address and clustered SQL Server name.
It is recommended to have two teamed network cards for the public network for
redundancy and to improve availability.
·
The private network is
used solely for communications between the clustered nodes. It is used mainly
for the heartbeat communication. Two forms of communications are executed:
·
LooksAlive: Verifies
that the SQL Server service runs on the online node every 5 seconds by default
·
IsAlive: Verifies that
SQL Server accepts connections by executing sp_server_diagnostics.
This health detection logic determines if a node
is down and the passive node then takes over the production workload.
The SQL Server Instance
Surprisingly, SQL Server client applications
don’t need to know how to switch communicating from a failed cluster node to
the new active node or anything else about specific cluster nodes (such as the
NETBIOS name or IP address of individual cluster nodes). This is because each
clustered SQL Server instance is assigned a Network name and IP address, which
client applications use to connect to the clustered SQL Server. In other words,
client applications don’t connect to a node’s specific name or IP address but
instead to the cluster SQL network name or cluster SQL IP address that stays
consistent and fails over. Each clustered SQL Server will belong to a Failover
Cluster Resource Group that contains the following resources that will fail
together:
·
SQL Server Network Name
·
IP Address
·
One or more shared disks
·
SQL Server Database
Engine service
·
SQL Server Agent
·
SQL Server Analysis
Services, if installed in the same group
·
One file share resource,
if the FILESTREAM feature is installed
How a Failover Works
Assume that a single SQL Server 2012 instance
runs on the active node of a cluster and that a passive node is available to
take over when needed. At this time, the active node communicates with both the
database and the quorum on the shared disk array. Because only a single node at
a time can access the shared disk array, the passive node does not access the
database or the quorum. In addition, the active node sends out heartbeat
signals over the private network, and the passive node monitors them, so it can
take over if a failover occurs. Clients are also interacting with the active
node via the clustered SQL Server name and IP address while running production
workloads.
Now assume that the active node stops working
because of a power failure. The passive node, which is monitoring the
heartbeats from the active node, notices that the heartbeats stopped. After a
predetermined delay, the passive node assumes that the active node has failed
and initiates a failover. As part of the failover process, the passive node
(now the active node) takes over control of the shared disk array and reads the
quorum, looking for any unsynchronized configuration changes. It also takes
over control of the clustered SQL Server name and IP address. In addition, as
the node takes over the databases, it has to perform a SQL Server startup and
recover the databases.
The time this takes depends on many factors,
including the performance of the hardware and the number of transactions that
might have to be rolled forward or back during the database recovery process.
When the recovery process is complete, the new active node announces itself on
the network with the clustered SQL Server name and IP address, which enables
the client applications to reconnect and begin using the SQL Server 2012
instance after this minimal interruption.