Failover clustering doesn’t mean 100% uptime

Published on : Mar 8, 2012

Category : General

Saravana

Author

When we discuss about high availability scenario, we typically think about windows failover clustering. In a BizTalk world we commonly use failover clustering in few places. The first and foremost is the SQL server clustering, and clustering other resources like host instances, enterprise single sign on etc. sql server cluster One of the common misconceptions people got with failover clustering is, they presume 100% uptime is guaranteed and the failover is seamless. But the reality is, having a fail over cluster simply reduces the time it takes to bring the service up and running. Still there will be intermittent period without that dependent service. One big advantage of fail over clustering is, that intermittent period could be just few seconds instead of few minutes or hours to manually bring the resource online. Let’s dive bit more deep into the issue with a SQL server clustering scenario. When a SQL Server instance is clustered, the sessions do not stay connected to SQL server during failover. Although failing over is quicker than a reboot, the instance must shut down and start back again dropping all the connections. All the normal recovery processes that SQL server goes through, rolling back uncommitted transactions and writing committed transactions to disk, also happens during failover. Any scheduled jobs that are running when failover occurs do not start back up when SQL server restarts. The time that SQL server takes to fail over is similar to restarting it on the same server without a reboot. The good thing about clustering is that in case of a hardware failure, downtime is only minutes or seconds. The application can be up and running quickly after a failover. But any uncommitted transactions will be lost if it’s not handled by the application properly. BizTalk Server is designed keeping this in mind and majority of the time you will be able to recover from this failure either automatically or by instances getting suspended and some one manually resuming it. But if you are dealing with the database directly, then you need to keep this restriction in mind. Keeping the restriction in mind, you should not failover your cluster manually during the peak business hours. Fail over clustering is there either for controlled fail over during outage hour or during disaster like hardware failure.