Clearing LOG Files: Cluster does not restart SQL group after network failure

All,
We have just rebuilt a SQL 7.0/NT cluster with Windows 2003/SQL2000 in an
active/passive configuration using 2 nodes. During the course of testing it
we had a general network failure in which the network was unavailable. The
virtual SQL and Windows IP address resources went down and did not come up
automatically once the network was available again. The nodes are configured
for automatic failback.
I can't imagine that in the 2 1/2 years the original cluster was running
that we never once had the network go down, but I do know that during that
time I never had a outage where I had to manually move the cluster group
(which causes the cluster to re-initialize both resources and brings
everything back to normal).
I'm thinking that maybe I'm missing a dependency somewhere or something's
changed between NT and 2003 that I'm not accounting for. Anyone seen this or
have any tips? Thanks in advance!
-Dan
Nope, that is pretty much expected behavior. The cluster manager will try
and restart the resources on each possible node until the retry count is
exhausted. Unfortunately, until the network resource is restored, no node
has the ability to run the SQL group. With the physical network port
offline, the IP address(es) will not come online. Nothing dependant on them
will come online, including the Network Name and the SQL Server. If the
network comes back before the retry timeout and count is exhausted, the
cluster will bring the system online. Otherwise it stays down.
Geoff N. Hiten
Microsoft SQL Server MVP
Senior Database Administrator
Careerbuilder.com
I support the Professional Association for SQL Server
www.sqlpass.org
"Dan" <Dan@.discussions.microsoft.com> wrote in message
news:74592016-0B61-4834-8C28-1AD1B864B688@.microsoft.com...
> All,
> We have just rebuilt a SQL 7.0/NT cluster with Windows 2003/SQL2000 in
an
> active/passive configuration using 2 nodes. During the course of testing
it
> we had a general network failure in which the network was unavailable.
The
> virtual SQL and Windows IP address resources went down and did not come up
> automatically once the network was available again. The nodes are
configured
> for automatic failback.
> I can't imagine that in the 2 1/2 years the original cluster was running
> that we never once had the network go down, but I do know that during that
> time I never had a outage where I had to manually move the cluster group
> (which causes the cluster to re-initialize both resources and brings
> everything back to normal).
> I'm thinking that maybe I'm missing a dependency somewhere or something's
> changed between NT and 2003 that I'm not accounting for. Anyone seen this
or
> have any tips? Thanks in advance!
> -Dan
|||Geoff,
Thanks for the post! I guess I'll just have to make sure the retry &
timeout are set high.
"Geoff N. Hiten" wrote:

> Nope, that is pretty much expected behavior. The cluster manager will try
> and restart the resources on each possible node until the retry count is
> exhausted. Unfortunately, until the network resource is restored, no node
> has the ability to run the SQL group. With the physical network port
> offline, the IP address(es) will not come online. Nothing dependant on them
> will come online, including the Network Name and the SQL Server. If the
> network comes back before the retry timeout and count is exhausted, the
> cluster will bring the system online. Otherwise it stays down.
> --
> Geoff N. Hiten
> Microsoft SQL Server MVP
> Senior Database Administrator
> Careerbuilder.com
> I support the Professional Association for SQL Server
> www.sqlpass.org
> "Dan" <Dan@.discussions.microsoft.com> wrote in message
> news:74592016-0B61-4834-8C28-1AD1B864B688@.microsoft.com...
> an
> it
> The
> configured
> or
>
>
|||Be careful adjusting those numbers. TOo high can cause just as many
problems as too low. Given the frequency of the network outage and the fact
that something like that will NEVER go unnoticed, I would not change
anything. The cluster failover is designed to reduce the typical 30-45
minute human reponse time for a down server. You shouldn't expect the
clustering software do deal with anything beyond that scope. Adjusting the
parameters to try and expand that coverage will only expose a gap somewhere
else. Just document a cluster check as part of your network failure
recovery procedure and you will be fine.
Geoff N. Hiten
Microsoft SQL Server MVP
Senior Database Administrator
Careerbuilder.com
I support the Professional Association for SQL Server
www.sqlpass.org
"Dan" <Dan@.discussions.microsoft.com> wrote in message
news:34AB2749-41BF-4A68-8E55-04AEAAE75C38@.microsoft.com...[vbcol=seagreen]
> Geoff,
> Thanks for the post! I guess I'll just have to make sure the retry &
> timeout are set high.
> "Geoff N. Hiten" wrote:
try[vbcol=seagreen]
node[vbcol=seagreen]
them[vbcol=seagreen]
in[vbcol=seagreen]
testing[vbcol=seagreen]
come up[vbcol=seagreen]
running[vbcol=seagreen]
that[vbcol=seagreen]
group[vbcol=seagreen]
something's[vbcol=seagreen]
this[vbcol=seagreen]

Tuesday, March 20, 2012

Cluster does not restart SQL group after network failure

No comments:

Post a Comment

Clearing LOG Files

Blog Archive

About Me