SQL Server Always On Availability Groups More Resilient to Transient Network Issues
Always On health detection relies on a few mechanisms:
Resource DLL (RHS), which determines the IsAlive value at the cluster heartbeat interval, and is controlled by CrossSubnetDelay and SameSubnetDelay cluster properties
sp_server_diagnostics, which reports the component health on an interval controlled by the HealthCheckTimeout property
Lease mechanism, which is used as a Looks-Alive between the cluster resource host and the SQL processes
Session Timeout, which detected the “soft” errors / small timeouts or insufficient resources
This also affects the automatic seeding timeout
The mechanisms above are controlled by properties that can be adjusted at the cluster level and/or the AG level:
Getting cluster properties values
To view the current cluster values, open an elevated Powershell terminal and run the following:
#display current cluster properties
Get-Cluster | fl CrossSubnetDelay, CrossSubnetThreshold, SameSubnetDelay , SameSubnetThreshold
#display current AG role properties
Get-ClusterResource <yourAG>| Get-ClusterParameter HealthCheckTimeout, LeaseTimeout
To view the current AG role properties, execute the below SQL statement in SSMS
select ag.name, arcn.replica_server_name, arcn.node_name, ars.role, ars.role_desc, ars.connected_state_desc, ars.synchronization_health_desc, ar.availability_mode_desc, ag.failure_condition_level,ar.failover_mode_desc, ar.session_timeout
from sys.availability_replicas ar with (nolock)
inner join sys.dm_hadr_availability_replica_states ars with (nolock) on ars.replica_id=ar.replica_id and ars.group_id=ar.group_id
inner join sys.availability_groups ag with (nolock) on ag.group_id=ar.group_id
inner join sys.dm_hadr_availability_replica_cluster_nodes arcn with (nolock) on arcn.group_name=a
Making the cluster / AG more resilient
The default/maximum network downtime that AG can absorb without being impacted is 10 seconds (1/2 of LeaseTimeout and 1/3 of HealthCheckTimeout).
In order to make the cluster/AG roles more resilient and able to absorb more network downtime without a failover, you can increase the cluster and AG properties, taking into account the relationship between the properties’ values:
For example, to increase the supported network downtime from 10 to 20 seconds:
LeaseTimeout – change from 20000 to 40000 (40 seconds)
HealthCheckTimeout – change from 30000 to 60000 (60 seconds)
SameSubnetThreshold – change from 10 to 20
Session Timeout – change from 10 to 20 seconds
You may use the below Powershell script to change the cluster level properties:
$crossSubnetDelayOptimal = 1000;
$crossSubnetThresholdOptimal = 20
$sameSubnetDelayOptimal = 1000
$sameSubnetThresholdOptimal = 20
$healthCheckTimeoutOptimal = 60000
$leaseTimeoutOptimal = 40000
Write-Host "Check cluster heartbeat timeouts"
$cluster = get-cluster
$crossSubnetDelay = $cluster.CrossSubnetDelay
$crossSubnetThreshold = $cluster.CrossSubnetThreshold
$sameSubnetDelay = $cluster.SameSubnetDelay
$sameSubnetThreshold = $cluster.SameSubnetThreshold
if($crossSubnetDelay -ne $crossSubnetDelayOptimal)
{
Write-Host "Cluster option CrossSubnetDelay changed from $crossSubnetDelay to $crossSubnetDelayOptimal"
$cluster.CrossSubnetDelay = $crossSubnetDelayOptimal
}
if($crossSubnetThreshold -ne $crossSubnetThresholdOptimal)
{
Write-Host "Cluster option CrossSubnetThreshold changed from $crossSubnetThreshold to $crossSubnetThresholdOptimal"
$cluster.CrossSubnetThreshold = $crossSubnetThresholdOptimal
}
if($sameSubnetDelay -ne $sameSubnetDelayOptimal)
{
Write-Host "Cluster option SameSubnetDelay changed from $sameSubnetDelay to $sameSubnetDelayOptimal"
$cluster.SameSubnetDelay = $sameSubnetDelayOptimal
}
if($sameSubnetThreshold -ne $sameSubnetThresholdOptimal)
{
Write-Host "Cluster option SameSubnetThreshold changed from $sameSubnetThreshold to $sameSubnetThresholdOptimal"
$cluster.SameSubnetThreshold = $sameSubnetThresholdOptimal
}
Write-Host "Check cluster resource properties"
$resources = Get-ClusterResource
ForEach($resource in $resources)
{
if($resource.ResourceType -eq "SQL Server Availability Group")
{
$name = $resource.Name
$healthCheckTimeout = (Get-ClusterParameter -Name HealthCheckTimeout -InputObject $resource).Value
$leaseTimeout = (Get-ClusterParameter -Name LeaseTimeout -InputObject $resource).Value
if($healthCheckTimeout -ne $healthCheckTimeoutOptimal)
{
Write-Host "$name - HealthCheckTimeout - Changed from $healthCheckTimeout to $healthCheckTimeoutOptimal"
Set-ClusterParameter -Name HealthCheckTimeout -InputObject $resource -Value $healthCheckTimeoutOptimal
}
if($leaseTimeout -ne $leaseTimeoutOptimal)
{
Write-Host "$name - LeaseTimeout - Changed from $leaseTimeout to $leaseTimeoutOptimal"
Set-ClusterParameter -Name LeaseTimeout -InputObject $resource -Value $leaseTimeoutOptimal
}
}
}
To change the Session Timeout value, connect to the SQL instance and execute the below statement on all the AG replicas:
ALTER AVAILABILITY GROUP <AG name> MODIFY REPLICA ON '<replica name>' WITH (SESSION_TIMEOUT = 20);
Microsoft guidelines
Microsoft compiled guidelines for timeout values, their root causes and outcomes.
Resources: