SQL Server 2025 High Availability has reached a new frontier with Microsoft's most advanced release to date. Whether you are a database administrator architecting enterprise workloads, a cloud engineer designing multi-region deployments, or a DevOps practitioner integrating SQL Server into Kubernetes, understanding
SQL Server 2025 Maximum Availability is no longer optional — it is a business imperative. Downtime costs enterprises an average of $5,600 per minute, and SQL Server 2025 introduces a portfolio of high availability technologies designed to bring that number as close to zero as possible.
In this comprehensive guide, we explore every pillar of
SQL Server 2025 High Availability architecture — from Always On Availability Groups and Failover Cluster Instances to Distributed Availability Groups, Contained Availability Groups, and Kubernetes-native HA. We walk through real-world T-SQL configurations, best practices for RPO/RTO optimization, and architectural patterns used by MinervaDB engineers in production environments.
What Is SQL Server 2025 Maximum Availability?
Maximum Availability in SQL Server 2025 refers to the comprehensive set of features, configurations, and architectural patterns that together eliminate or minimize unplanned and planned downtime. Microsoft defines
Maximum Availability across four dimensions:
- High Availability (HA): Automated local failover within a datacenter — primarily via Always On Availability Groups (AAG) and Failover Cluster Instances (FCI).
- Disaster Recovery (DR): Geo-redundant failover across datacenters and cloud regions using Distributed Availability Groups and Log Shipping.
- Business Continuity (BC): Keeping reads and secondary workloads alive during maintenance using readable secondaries and Availability Group offloading.
- Online Operations: Zero-downtime schema changes, online index rebuilds, and Accelerated Database Recovery (ADR) — greatly enhanced in SQL Server 2025.
SQL Server 2025 High Availability: New Features Overview
SQL Server 2025 builds on the solid HA foundation of SQL Server 2022 and introduces game-changing capabilities that redefine what is possible for enterprise database availability:
- Contained Availability Groups with Full msdb Integration: System databases (msdb, SQL Agent jobs, linked servers, logins) are now fully contained inside the AG, eliminating the manual synchronization that plagued earlier releases.
- Automatic Seeding Enhancements: Faster parallel seeding with bandwidth throttling and progress monitoring via
sys.dm_hadr_automatic_seeding.
- Intelligent Failover Policy: Machine-learning-based health detection that reduces spurious failovers caused by transient I/O spikes.
- Kubernetes HA Operator (GA): The SQL Server Operator for Kubernetes reaches general availability in 2025, enabling StatefulSet-based AG deployments with automated failover on pod and node failure.
- Cross-Region Azure Availability Zones Support: Native multi-zone synchronous commit replicas that survive AZ-level outages with zero data loss.
- Accelerated Database Recovery (ADR) v3: Transaction log-based version store improvements reduce recovery time after a crash to under 10 seconds for most workloads.
Always On Availability Groups in SQL Server 2025
Always On Availability Groups (AAG) remain the cornerstone of SQL Server 2025 High Availability. An AG is a set of user databases — the availability databases — that fail over together. Each AG has one
primary replica that accepts read-write connections, and up to eight
secondary replicas (nine total) that can serve read-only queries.
Synchronous vs. Asynchronous Commit Mode
The choice of commit mode directly determines your
Recovery Point Objective (RPO):
- Synchronous Commit: The primary waits for the secondary to harden the log before committing. RPO = 0 (zero data loss). Ideal for replicas within a low-latency LAN or the same Azure region.
- Asynchronous Commit: The primary commits without waiting. RPO equals some data loss corresponding to the replication lag. Ideal for cross-region DR replicas where latency is high.
Configuring an Always On Availability Group in SQL Server 2025
The following T-SQL demonstrates creating an AG with one synchronous and one asynchronous replica, using the new SQL Server 2025
CONTAINED option and
SEEDING_MODE = AUTOMATIC:
-- Step 1: Enable Always On on each node (run on each replica)
-- Requires Windows Server Failover Cluster (WSFC) or Pacemaker on Linux
-- Step 2: Create the Availability Group (run on the primary replica)
CREATE AVAILABILITY GROUP [AG_Production_2025]
WITH (
AUTOMATED_BACKUP_PREFERENCE = SECONDARY,
DB_FAILOVER = ON,
DTC_SUPPORT = PER_DB,
CLUSTER_TYPE = WSFC,
REQUIRED_SYNCHRONIZED_SECONDARIES_TO_COMMIT = 1,
CONTAINED = YES,
REUSE_SYSTEM_DATABASES = YES
)
FOR
DATABASE [SalesDB],
DATABASE [OrdersDB],
DATABASE [InventoryDB]
REPLICA ON
N'SQLNODE01' WITH (
ENDPOINT_URL = N'TCP://SQLNODE01.corp.local:5022',
FAILOVER_MODE = AUTOMATIC,
AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,
BACKUP_PRIORITY = 50,
SECONDARY_ROLE (
ALLOW_CONNECTIONS = READ_ONLY,
READ_ONLY_ROUTING_URL = N'TCP://SQLNODE01.corp.local:1433'
),
SEEDING_MODE = AUTOMATIC
),
N'SQLNODE02' WITH (
ENDPOINT_URL = N'TCP://SQLNODE02.corp.local:5022',
FAILOVER_MODE = AUTOMATIC,
AVAILABILITY_MODE = SYNCHRONOUS_COMMIT,
BACKUP_PRIORITY = 50,
SECONDARY_ROLE (
ALLOW_CONNECTIONS = READ_ONLY,
READ_ONLY_ROUTING_URL = N'TCP://SQLNODE02.corp.local:1433'
),
SEEDING_MODE = AUTOMATIC
),
N'SQLNODE03-DR' WITH (
ENDPOINT_URL = N'TCP://SQLNODE03.corp.local:5022',
FAILOVER_MODE = MANUAL,
AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT,
BACKUP_PRIORITY = 80,
SECONDARY_ROLE (
ALLOW_CONNECTIONS = READ_ONLY,
READ_ONLY_ROUTING_URL = N'TCP://SQLNODE03.corp.local:1433'
),
SEEDING_MODE = AUTOMATIC
)
LISTENER N'AG_Listener_Prod' (
WITH IP (
(N'10.0.1.100', N'255.255.255.0'),
(N'10.0.2.100', N'255.255.255.0')
),
PORT = 1433
);
GO
-- Grant CREATE ANY DATABASE on secondaries for automatic seeding
ALTER AVAILABILITY GROUP [AG_Production_2025]
GRANT CREATE ANY DATABASE;
GO
Monitoring Always On Availability Groups
After deploying your AG, continuous monitoring is essential. SQL Server 2025 introduces expanded DMVs for deeper observability into replication lag and estimated data loss:
-- Monitor AG health, synchronization state, and estimated data loss
SELECT
ag.name AS availability_group_name,
ar.replica_server_name AS replica_server,
ar.availability_mode_desc AS commit_mode,
ar.failover_mode_desc AS failover_mode,
ars.role_desc AS current_role,
ars.synchronization_health_desc AS sync_health,
ars.connected_state_desc AS connection_state,
adc.database_name AS database_name,
drs.synchronization_state_desc AS db_sync_state,
drs.log_send_queue_size AS log_send_queue_kb,
drs.log_send_rate AS log_send_rate_kbps,
drs.redo_queue_size AS redo_queue_kb,
drs.redo_rate AS redo_rate_kbps,
drs.secondary_lag_seconds AS estimated_lag_seconds,
drs.estimated_data_loss_time AS estimated_data_loss_sec
FROM sys.availability_groups ag
JOIN sys.availability_replicas ar ON ag.group_id = ar.group_id
JOIN sys.dm_hadr_availability_replica_states ars ON ar.replica_id = ars.replica_id
JOIN sys.availability_databases_cluster adc ON ag.group_id = adc.group_id
JOIN sys.dm_hadr_database_replica_states drs
ON adc.group_database_id = drs.group_database_id
AND ar.replica_id = drs.replica_id
ORDER BY ag.name, ars.role_desc, adc.database_name;
GO
Windows Server Failover Clustering (WSFC) with SQL Server 2025
SQL Server 2025 Failover Cluster Instances (FCI) use Windows Server Failover Clustering to provide instance-level high availability. Unlike AGs (which protect specific databases), an FCI protects the entire SQL Server instance — all databases, logins, SQL Agent jobs, linked servers, and instance-level configuration.
Key FCI Architecture Decisions
- Shared Storage: All nodes access the same storage (SAN, Azure Shared Disks, or Storage Spaces Direct). The storage is only mounted on the active node at any given time.
- Virtual Network Name (VNN): Clients always connect to the VNN, which floats to the active node after a failover event.
- Quorum Configuration: SQL Server 2025 on WSFC supports Disk Witness, File Share Witness, and Cloud Witness (Azure Blob Storage) for quorum voting.
Checking FCI Health with T-SQL
-- Check Failover Cluster Instance node status
SELECT
NodeName,
Status,
StatusDescription,
IsDrainStopped
FROM sys.dm_os_cluster_nodes;
GO
-- Check cluster network health
SELECT
network_subnet_ip,
network_subnet_ipv4_mask,
is_public,
is_ipv4
FROM sys.dm_hadr_cluster_networks;
GO
-- Verify FCI quorum state
SELECT
quorum_type_desc,
quorum_state_desc
FROM sys.dm_hadr_cluster;
GO
Distributed Availability Groups for Multi-Site Disaster Recovery
A
Distributed Availability Group (DAG) connects two independent AGs — potentially on different WSFC clusters, different operating systems (Windows and Linux), or even in different Azure regions. This is the premier SQL Server 2025 architecture for active-active and active-passive multi-datacenter disaster recovery deployments.
Creating a Distributed Availability Group
-- Run on the PRIMARY replica of AG1 (primary datacenter)
CREATE AVAILABILITY GROUP [DAG_GlobalProd]
WITH (DISTRIBUTED)
AVAILABILITY GROUP ON
N'AG_Datacenter1' WITH (
LISTENER_URL = N'TCP://AG_DC1_Listener.corp.local:5022',
AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT,
FAILOVER_MODE = MANUAL,
SEEDING_MODE = AUTOMATIC
),
N'AG_Datacenter2' WITH (
LISTENER_URL = N'TCP://AG_DC2_Listener.corp.local:5022',
AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT,
FAILOVER_MODE = MANUAL,
SEEDING_MODE = AUTOMATIC
);
GO
-- Run on the PRIMARY replica of AG2 (DR datacenter) to join the DAG
ALTER AVAILABILITY GROUP [DAG_GlobalProd]
JOIN AVAILABILITY GROUP ON
N'AG_Datacenter1' WITH (
LISTENER_URL = N'TCP://AG_DC1_Listener.corp.local:5022',
AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT,
FAILOVER_MODE = MANUAL,
SEEDING_MODE = AUTOMATIC
);
GO
-- Monitor DAG synchronization and lag
SELECT
ag.name AS dag_name,
ar.replica_server_name AS replica_server,
drs.synchronization_state_desc AS sync_state,
drs.secondary_lag_seconds AS lag_seconds,
drs.log_send_queue_size AS send_queue_kb,
drs.estimated_data_loss_time AS est_data_loss_sec
FROM sys.availability_groups ag
JOIN sys.availability_replicas ar ON ag.group_id = ar.group_id
JOIN sys.dm_hadr_database_replica_states drs ON ar.replica_id = drs.replica_id
WHERE ag.is_distributed = 1;
GO
Contained Availability Groups: The SQL Server 2025 Game Changer
The most significant SQL Server 2025 High Availability innovation is the production-hardening of
Contained Availability Groups. In a Contained AG, the system metadata — SQL Agent jobs, credentials, linked servers, database mail, and logins — is replicated inside the AG itself rather than living only at the Windows or OS level.
This solves the number one operational headache with traditional AGs: the manual work of keeping SQL Agent jobs and logins synchronized across replicas. With Contained AGs in SQL Server 2025, failover is truly seamless with no post-failover manual intervention required.
Managing Contained AG System Objects
-- Verify Contained AG configuration
SELECT
ag.name AS ag_name,
ag.is_contained AS is_contained,
ag.reuse_system_databases AS reuse_system_dbs
FROM sys.availability_groups ag
WHERE ag.is_contained = 1;
GO
-- Create a SQL Agent job inside a Contained AG
-- (This job automatically replicates to all secondary replicas)
USE msdb;
GO
EXEC sp_add_job
@job_name = N'AG_Nightly_Index_Rebuild',
@description = N'Nightly index rebuild - executes on AG primary only';
EXEC sp_add_jobstep
@job_name = N'AG_Nightly_Index_Rebuild',
@step_name = N'Rebuild Indexes on SalesDB',
@subsystem = N'TSQL',
@command = N'EXEC SalesDB.dbo.sp_RebuildIndexes;',
@database_name = N'SalesDB';
EXEC sp_add_schedule
@schedule_name = N'Nightly_2AM',
@freq_type = 4,
@active_start_time = 020000;
EXEC sp_attach_schedule
@job_name = N'AG_Nightly_Index_Rebuild',
@schedule_name = N'Nightly_2AM';
EXEC sp_add_jobserver
@job_name = N'AG_Nightly_Index_Rebuild';
GO
Accelerated Database Recovery (ADR) in SQL Server 2025
Accelerated Database Recovery (ADR) is a transformative SQL Server feature that separates the version store from the transaction log, enabling near-instant database recovery regardless of the number of active transactions at the time of a crash.
In SQL Server 2025, ADR v3 introduces several key improvements for high availability workloads:
- Persistent Version Store (PVS) Auto-Cleanup: Background cleanup is more aggressive and configurable, preventing PVS bloat even under heavy OLTP workloads with millions of short transactions per second.
- ADR with In-Memory OLTP: Full compatibility with memory-optimized tables (Hekaton), enabling sub-second recovery for in-memory workloads.
- Cross-Database Transaction Recovery: Recovery of distributed transactions spanning multiple databases within a Contained AG is handled atomically, ensuring consistency after failover.
-- Enable ADR on production databases
ALTER DATABASE SalesDB
SET ACCELERATED_DATABASE_RECOVERY = ON
(PERSISTENT_VERSION_STORE_FILEGROUP = [PRIMARY]);
GO
ALTER DATABASE OrdersDB
SET ACCELERATED_DATABASE_RECOVERY = ON
(PERSISTENT_VERSION_STORE_FILEGROUP = [PRIMARY]);
GO
-- Monitor ADR Persistent Version Store usage (SQL Server 2025)
SELECT
DB_NAME(pvs.database_id) AS database_name,
pvs.pvs_filegroup_name AS pvs_filegroup,
pvs.persistent_version_store_size_kb / 1024.0 AS pvs_size_mb,
pvs.online_index_version_store_size_kb / 1024.0 AS online_idx_store_mb,
pvs.current_aborted_transaction_count AS active_aborted_txns,
pvs.oldest_active_transaction_id AS oldest_active_txn_id,
pvs.min_transaction_timestamp AS oldest_txn_start_time
FROM sys.dm_tran_persistent_version_store_stats pvs
ORDER BY pvs.persistent_version_store_size_kb DESC;
GO
SQL Server 2025 High Availability on Linux and Kubernetes
SQL Server 2025 expands its
Linux HA capabilities significantly. Pacemaker combined with Corosync remains the standard for Failover Cluster Instances on Linux, but the SQL Server Operator for Kubernetes — now generally available in 2025 — provides a cloud-native alternative optimized for containerized database environments.
Pacemaker-based AG on Linux (RHEL 9 / Ubuntu 24.04)
#!/bin/bash
# Configure Pacemaker for SQL Server 2025 AG on RHEL 9
# Install required packages
sudo dnf install -y pacemaker pcs corosync fence-agents-all
# Authenticate cluster nodes
sudo pcs host auth sqllinux01 sqllinux02 sqllinux03 -u hacluster -p 'SecureClusterPass!'
# Create and start the Pacemaker cluster
sudo pcs cluster setup SQLCluster2025 sqllinux01 sqllinux02 sqllinux03 --start --enable
# Create the AG resource agent
sudo pcs resource create AG_Production_2025 ocf:mssql:ag ag_name="AG_Production_2025" meta failure-timeout=60s op start timeout=60s op stop timeout=60s op promote timeout=60s op monitor interval=10s timeout=60s on-fail=restart op monitor interval=11s timeout=60s role="Master" on-fail=restart op monitor interval=12s timeout=60s role="Slave" on-fail=restart
# Make AG resource promotable (primary/secondary)
sudo pcs resource promotable AG_Production_2025 promoted-max=1 promoted-node-max=1 clone-max=3 clone-node-max=1
# Verify cluster status
sudo pcs status
SQL Server 2025 AG on Kubernetes (Operator GA)
apiVersion: sql.microsoft.com/v1
kind: SqlManagedInstance
metadata:
name: sql-ag-primary
namespace: sql-server-ha
spec:
replicas: 3
scheduler:
defaultSchedulerName: default-scheduler
settings:
sqlinstance:
licenseType: LicenseIncluded
edition: Enterprise
storage:
datalogs:
volumes:
- volume:
persistentVolumeClaim:
claimName: sql-data-pvc
name: sql-data
logs:
volumes:
- volume:
persistentVolumeClaim:
claimName: sql-logs-pvc
name: sql-logs
services:
primary:
type: LoadBalancer
secondary:
type: LoadBalancer
haServices:
primaryEndpoint: 5022
backup:
retentionPeriodInDays: 7
Read-Scale Secondaries and Workload Offloading
One of the most powerful SQL Server 2025 High Availability features for enterprise environments is the ability to offload read-intensive workloads to secondary replicas. This improves both HA (primaries are less loaded) and analytical performance by routing reporting queries to secondaries.
Read-Only Routing Configuration
-- Configure read-only routing so read-intent connections route
-- automatically to the best available secondary replica
-- Step 1: Configure read-only routing URLs on each replica
ALTER AVAILABILITY GROUP [AG_Production_2025]
MODIFY REPLICA ON N'SQLNODE01' WITH (
SECONDARY_ROLE (
READ_ONLY_ROUTING_URL = N'TCP://SQLNODE01.corp.local:1433'
)
);
GO
ALTER AVAILABILITY GROUP [AG_Production_2025]
MODIFY REPLICA ON N'SQLNODE02' WITH (
SECONDARY_ROLE (
READ_ONLY_ROUTING_URL = N'TCP://SQLNODE02.corp.local:1433'
)
);
GO
-- Step 2: Configure read-only routing list on the primary
ALTER AVAILABILITY GROUP [AG_Production_2025]
MODIFY REPLICA ON N'SQLNODE01' WITH (
PRIMARY_ROLE (
READ_ONLY_ROUTING_LIST = (
(N'SQLNODE02', N'SQLNODE03-DR'), -- Load-balanced group
N'SQLNODE02' -- Fallback
)
)
);
GO
-- Step 3: Connect with read-intent from application
-- Connection string: ApplicationIntent=ReadOnly;MultiSubnetFailover=True
-- Verify routing is working
SELECT
SERVERPROPERTY('ServerName') AS connected_server,
SERVERPROPERTY('IsHadrEnabled') AS hadr_enabled,
sys.fn_hadr_is_primary_replica(DB_NAME()) AS is_primary_replica;
GO
SQL Server 2025 High Availability Best Practices
Based on MinervaDB's experience managing SQL Server high availability across hundreds of enterprise deployments, here are the most impactful best practices for SQL Server 2025 Maximum Availability:
1. RPO and RTO Planning
Define your Recovery Point Objective (RPO) and Recovery Time Objective (RTO) before selecting your HA technology. Use synchronous commit (RPO=0) for mission-critical OLTP, and asynchronous commit for DR replicas where network latency makes synchronous commit impractical.
-- Measure current replication latency to determine feasibility
-- of synchronous commit for each replica
SELECT
ar.replica_server_name,
ar.availability_mode_desc,
AVG(drs.secondary_lag_seconds) AS avg_lag_seconds,
MAX(drs.secondary_lag_seconds) AS max_lag_seconds,
AVG(drs.log_send_rate) AS avg_send_rate_kbps,
AVG(drs.redo_rate) AS avg_redo_rate_kbps
FROM sys.availability_replicas ar
JOIN sys.dm_hadr_database_replica_states drs ON ar.replica_id = drs.replica_id
WHERE drs.is_local = 0
GROUP BY ar.replica_server_name, ar.availability_mode_desc
ORDER BY avg_lag_seconds DESC;
GO
2. Health Check Tuning to Prevent Spurious Failovers
The SQL Server 2025 Intelligent Failover Policy uses machine learning, but you can also tune the traditional health check timeout and failure conditions to avoid unnecessary failovers:
-- Tune health check timeout and failure condition level
-- Failure condition level 3 (default): failover on critical server errors
-- Failure condition level 4: also failover on moderate server errors
ALTER AVAILABILITY GROUP [AG_Production_2025]
SET (
HEALTH_CHECK_TIMEOUT = 30000, -- 30 seconds (default: 30000ms)
FAILURE_CONDITION_LEVEL = 3 -- Balanced: critical errors only
);
GO
-- For Pacemaker on Linux, adjust timeout values accordingly
-- (in the Pacemaker resource agent configuration)
3. Backup Strategy with Availability Groups
-- Check which replica is preferred for backups
SELECT
ag.name AS availability_group,
ar.replica_server_name AS replica,
ar.backup_priority AS backup_priority,
sys.fn_hadr_backup_is_preferred_replica(DB_NAME()) AS is_preferred_backup_replica
FROM sys.availability_groups ag
JOIN sys.availability_replicas ar ON ag.group_id = ar.group_id
ORDER BY ar.backup_priority DESC;
GO
-- Use COPY_ONLY for ad-hoc backups on the primary to avoid breaking the log chain
BACKUP DATABASE [SalesDB]
TO DISK = N'/backup/SalesDB_adhoc.bak'
WITH COPY_ONLY, COMPRESSION, STATS = 10;
GO
-- Automate backup selection: run this on all replicas,
-- backup only executes on the preferred replica
IF sys.fn_hadr_backup_is_preferred_replica(N'SalesDB') = 1
BEGIN
BACKUP DATABASE [SalesDB]
TO DISK = N'/backup/SalesDB_full.bak'
WITH COMPRESSION, CHECKSUM, STATS = 10;
PRINT 'Backup completed on preferred replica.';
END
ELSE
BEGIN
PRINT 'Skipping backup: this replica is not preferred for SalesDB.';
END
GO
SQL Server 2025 Maximum Availability Architecture: Reference Topology
A production-ready SQL Server 2025 Maximum Availability architecture for a large enterprise typically combines multiple HA technologies in layers:
- Layer 1 — Local HA (Primary Datacenter): Two-node synchronous Always On AG with automatic failover and a Cloud Witness for quorum. Both nodes participate in a Contained AG with SQL Server 2025.
- Layer 2 — Regional DR: A third asynchronous replica in a secondary datacenter, configured as a Distributed AG forwarding node for cross-region failover.
- Layer 3 — Cloud DR: An Azure SQL Managed Instance or SQL Server on Azure VM participating as a DAG endpoint, providing cloud-based DR with native Azure Availability Zone support.
- Layer 4 — Read Scale-Out: Read-only routing configured to distribute reporting and analytics workloads across secondary replicas, reducing load on the primary.
- Layer 5 — Observability: SQL Server 2025 Extended Events, Intelligent Query Processing telemetry, and third-party monitoring (Prometheus + Grafana or Datadog) for real-time AG health dashboards.
Performing a Planned Manual Failover in SQL Server 2025
Planned failovers are used for maintenance windows, OS patching, and hardware upgrades. A manual failover with synchronous commit results in zero data loss:
-- Step 1: Verify AG is in SYNCHRONIZED state before failover
SELECT
ar.replica_server_name,
ars.role_desc,
ars.synchronization_health_desc,
drs.synchronization_state_desc,
drs.is_suspended
FROM sys.availability_replicas ar
JOIN sys.dm_hadr_availability_replica_states ars ON ar.replica_id = ars.replica_id
JOIN sys.dm_hadr_database_replica_states drs ON ar.replica_id = drs.replica_id
WHERE drs.is_local = 0;
GO
-- Step 2: Connect to the TARGET secondary and initiate failover
-- (Run on SQLNODE02 — the intended new primary)
ALTER AVAILABILITY GROUP [AG_Production_2025] FAILOVER;
GO
-- Step 3: Verify the new primary after failover
SELECT
SERVERPROPERTY('ServerName') AS server_name,
sys.fn_hadr_is_primary_replica('SalesDB') AS is_primary_for_SalesDB,
sys.fn_hadr_is_primary_replica('OrdersDB') AS is_primary_for_OrdersDB
GO
Troubleshooting SQL Server 2025 High Availability Issues
Even the best-configured HA environments encounter issues. Here are the most common SQL Server 2025 AG problems and their diagnostic queries:
Issue 1: AG Replica Stuck in SYNCHRONIZING State
-- Identify redo thread bottlenecks on secondary replicas
SELECT
DB_NAME(drs.database_id) AS database_name,
drs.redo_queue_size AS redo_queue_kb,
drs.redo_rate AS redo_rate_kbps,
drs.log_send_queue_size AS send_queue_kb,
drs.secondary_lag_seconds AS lag_seconds,
drs.last_redone_lsn,
drs.last_received_lsn,
drs.last_hardened_lsn
FROM sys.dm_hadr_database_replica_states drs
WHERE drs.is_local = 1
ORDER BY drs.redo_queue_size DESC;
GO
Issue 2: Automatic Failover Not Triggering
-- Check failover eligibility: both replicas must be SYNCHRONIZED
-- and the AG must have AUTOMATIC failover configured
SELECT
ag.name AS ag_name,
ar.replica_server_name AS replica,
ar.failover_mode_desc AS failover_mode,
ars.synchronization_health_desc AS sync_health,
ars.operational_state_desc AS op_state,
ars.recovery_health_desc AS recovery_health
FROM sys.availability_groups ag
JOIN sys.availability_replicas ar ON ag.group_id = ar.group_id
JOIN sys.dm_hadr_availability_replica_states ars ON ar.replica_id = ars.replica_id
ORDER BY ag.name, ar.replica_server_name;
GO
-- Check Windows Cluster health (run in PowerShell / CMD)
-- Get-ClusterNode | Select Name, State
-- Get-ClusterGroup | Where-Object {$_.GroupType -eq "AvailabilityGroup"}
SQL Server 2025 High Availability on Azure: Azure SQL Managed Instance
For organizations running SQL Server 2025 workloads in Microsoft Azure,
Azure SQL Managed Instance (SQL MI) provides a fully managed HA layer built directly into the platform:
- Business Critical Tier: Built on Always On AG technology with 4 replicas (3 synchronous), SLA of 99.99% uptime, automatic failover in under 30 seconds.
- General Purpose Tier: Uses Azure Service Fabric with replicated remote storage, 99.99% SLA with RTO under 30 minutes.
- Auto-Failover Groups: Cross-region active geo-replication built on Distributed AGs, enabling seamless failover between Azure regions.
- Zone Redundant Configuration: Synchronous replicas spread across Azure Availability Zones within a region for zone-level resilience.
Conclusion: Building a Maximum Availability SQL Server 2025 Environment
SQL Server 2025 Maximum Availability is a multi-layered strategy, not a single feature. The combination of Contained Always On Availability Groups, Accelerated Database Recovery v3, Distributed Availability Groups, Pacemaker on Linux, and the new Kubernetes Operator provides unprecedented flexibility for achieving sub-second failover and zero data loss across on-premises, hybrid, and cloud environments.
For organizations serious about
SQL Server 2025 High Availability, MinervaDB provides expert database consulting services. Our team of SQL Server DBAs has designed and operated Maximum Availability architectures for global enterprises across financial services, healthcare, retail, and SaaS industries.
Key Takeaways:
- Always On Availability Groups with Contained AG is the recommended HA architecture for SQL Server 2025 enterprise deployments.
- Use synchronous commit for RPO=0 within the same datacenter or Azure region; use asynchronous commit for geographically distributed DR replicas.
- Accelerated Database Recovery (ADR) v3 dramatically reduces crash recovery time and should be enabled on all production databases.
- Distributed Availability Groups extend your HA topology across datacenters and cloud regions without requiring a shared WSFC cluster.
- Contained AGs eliminate the manual synchronization of SQL Agent jobs and logins, making post-failover operations seamless.
- For Kubernetes and containerized deployments, the SQL Server 2025 Operator (GA) brings enterprise-grade HA to cloud-native environments.
Need help architecting your SQL Server 2025 High Availability environment? Contact MinervaDB for a complimentary architecture review with our SQL Server experts.