Services

Careers

Products

Contact

Schedule a Call

About

Services

Blogs

Careers

Products

Contact

Schedule a Call

Home
> Blogs
> MySQL
> High Availability for MySQL using Orchestrator and ProxySQL

High Availability for MySQL using Orchestrator and ProxySQL

Auto slave promotion without any DBAs intervention using Orchestrator and ProxySQL. If you are looking for a HA solution without going for synchronous replication or Aws RDS then Orchestrator with ProxySQL is a great choice.

sukan June 09, 2026

Subscribe for email updates

MySQL High Availability with Orchestrator and ProxySQL

MySQL High Availability with Orchestrator and ProxySQL: Auto Failover Without Synchronous Replication

Most MySQL deployments rely on standard asynchronous replication — one primary, one or more replicas — with no automated path for handling a primary failure. When that primary goes down, someone gets paged, SSHes in, and manually promotes a replica. At 2 AM. Mafiree's MySQL team has seen this pattern across dozens of production environments, and in nearly every case the RTO was measured in tens of minutes rather than seconds.

There's a better approach that doesn't require abandoning asynchronous replication or absorbing the write latency overhead of Galera or InnoDB Cluster. Orchestrator — an open-source MySQL topology manager — paired with ProxySQL as an intelligent load balancer delivers automatic primary promotion, replica re-pointing, and transparent traffic rerouting, all without a single line of application-level change.

This guide covers the full architecture, how the three Orchestrator phases work, ProxySQL integration patterns, and the critical configuration details that determine whether failover takes 20 seconds or 3 minutes.

What You'll Learn

Why cluster-based HA (Galera, InnoDB Cluster) isn't always the right trade-off
How Orchestrator's Discovery, Refactoring, and Recovery phases work end-to-end
How ProxySQL routes read/write traffic and reacts to topology changes
Step-by-step Orchestrator + ProxySQL setup with real configuration snippets
Promotion rules, hooks, and the configurations that cause silent failover failures

Why Not Synchronous Replication?

The instinct when building a highly available MySQL tier is to reach for a synchronous cluster. Galera Cluster and MySQL InnoDB Cluster provide multi-master or group replication with automatic failover baked in. For some workloads, that's the right call. But synchronous replication carries real costs that make it unsuitable for many production environments.

Write latency overhead: Every commit must be acknowledged by a quorum before returning to the client. On a multi-AZ setup with 5–10ms cross-zone RTT, every write absorbs that penalty. Under contention, this compounds.
Certification conflicts and deadlocks: Galera's optimistic concurrency control generates certification failures when concurrent transactions touch overlapping rows across nodes. These surface as application-layer deadlock errors and require application-side retry logic.
Operational complexity: Cluster state management — node eviction, SST (full state transfer), IST (incremental) — adds a meaningful operational surface area. Recovering a desynchronised node from SST on a large dataset blocks the donor for the duration.
Not always necessary: If your workload has a single clear write path and your primary RPO tolerance is a few seconds of replication lag, asynchronous replication with automated failover achieves the same availability target at a fraction of the overhead.

Orchestrator + ProxySQL is designed exactly for this: asynchronous replication topologies that need automated, reliable failover without sacrificing write performance.

Architecture Overview

A typical Mafiree-deployed HA stack using this pattern looks like this:

Reference Architecture

  Application Tier
        |
  ┌─────────────┐
  │   ProxySQL  │  (x2, HA pair)  — Listens on port 6033
  │  (Read/Write│  — Hostgroup 10: writer (primary only)
  │   Routing)  │  — Hostgroup 20: readers (replicas)
  └──────┬──────┘
         |
  ┌──────┴──────────────────────┐
  |                             |
  ▼                             ▼
MySQL Primary (3306)     Replica 1 / Replica 2 (3306)
  |
  ├── Replica 1
  └── Replica 2

  Orchestrator Tier (x3, Raft consensus)
  ┌────────────────────────────┐
  │  Orch-1  Orch-2  Orch-3   │  — Raft leader elected automatically
  │          (Raft)            │  — Each polls all MySQL nodes
  └────────────────────────────┘

Three components carry all the weight:

MySQL replication topology: Standard asynchronous replication — one primary, one or more replicas. GTID-based is strongly recommended; Orchestrator supports binary log file+position but GTID makes replica re-pointing after failover far cleaner.
Orchestrator (Raft mode): Three Orchestrator nodes running in Raft consensus. The elected leader handles topology polling and recovery actions. This eliminates single points of failure in the HA manager itself — a critical detail that's often missed in single-Orchestrator setups.
ProxySQL: Deployed as an HA pair or behind a VIP. Maintains hostgroup definitions mapping writers and readers, executes health checks, and can be notified of topology changes via Orchestrator hooks to update its internal routing tables instantly.

How Orchestrator Works: Three Phases

Orchestrator's operation breaks into three distinct phases. Understanding each is essential to configuring it correctly and diagnosing failover behaviour in production.

Phase 1: Discovery

Orchestrator continuously polls the MySQL topology, starting from seed nodes you define in its configuration. For each node it discovers, it reads SHOW SLAVE STATUS, SHOW MASTER STATUS, and a handful of performance-related queries to understand the complete topology graph.

Discovery captures replication positions (GTID executed sets, binary log file/position), read-only status, heartbeat lag, and the full upstream/downstream relationships between every node. This topology map is stored in Orchestrator's backend — either its own database or a shared MySQL/SQLite — and updated on a configurable interval (default: 10 seconds).

Configuration note: The InstancePollSeconds setting controls how frequently Orchestrator polls each instance. Lower values detect failures faster but increase MySQL-side query load. Mafiree's standard configuration is 5 seconds for production and 10 seconds for non-critical environments.

Phase 2: Refactoring

Refactoring is Orchestrator's ability to restructure the replication topology without a failure event. This includes:

Moving a replica from one primary to another (relocate)
Changing the replication position of a replica (match)
Re-pointing all replicas under a new primary after manual changes
Splitting or merging replica subtrees

During recovery, refactoring is what Orchestrator uses to re-attach surviving replicas under the newly promoted primary. With GTID enabled, this is straightforward — Orchestrator simply issues CHANGE MASTER TO with the new primary's address and lets GTID auto-positioning handle the rest. Without GTID, Orchestrator performs binary log file/position calculations to find the right resume point, which works but is more fragile under complex topologies.

Phase 3: Recovery

Recovery triggers when Orchestrator detects a primary failure — specifically when the primary becomes unreachable to Orchestrator and all replicas simultaneously report that replication is broken. The recovery sequence is:

Failure confirmation — Orchestrator waits for RecoveryPollSeconds (default: 1s) and re-checks from multiple Raft members to eliminate false positives from transient network issues.
Candidate selection — Orchestrator evaluates replicas against promotion rules: replica lag, binary log enabled, data centre preference, and any explicitly defined priority tags. The replica with the most recent GTID executed set and no blocking conditions wins.
Pre-failover hook — Orchestrator executes the OnFailureDetectionProcesses hook. This is where ProxySQL integration fires — the hook can call a script that updates ProxySQL's hostgroups to temporarily stop routing writes, preventing split-brain during the transition.
Promotion — The candidate replica has read_only=OFF applied and is promoted to primary. Orchestrator disables read-only, sets super_read_only=OFF if applicable, and updates its internal topology map.
Replica re-pointing — All surviving replicas are issued CHANGE MASTER TO pointing at the new primary. GTID auto-position handles the sync catch-up automatically.
Post-failover hook — PostMasterFailoverProcesses fires. A ProxySQL hook here updates the writer hostgroup to point at the new primary, restoring write traffic routing.

Common failure mode: Orchestrator correctly promotes a replica but ProxySQL continues routing writes to the old primary IP because no hook updates it. This causes write failures until someone manually updates ProxySQL's mysql_servers table. Always verify that your post-failover hooks are tested end-to-end before relying on them in production.

ProxySQL Integration: Traffic Routing and Hostgroups

ProxySQL's role in this architecture is to sit between the application and MySQL, routing connections to the correct backend without the application needing to know anything about the current topology.

Hostgroup Design

The standard pattern uses two hostgroups:

Hostgroup	Purpose	Members
`HG 10`	Writer group — read/write connections	Primary only (`weight=1000`)
`HG 20`	Reader group — read-only connections	All replicas, optionally primary

Query rules map traffic based on patterns. A typical minimal ruleset:

-- Route all writes and transactions to HG 10
INSERT INTO mysql_query_rules (rule_id, active, match_digest, destination_hostgroup, apply)
VALUES
  (10, 1, '^SELECT.*FOR UPDATE', 10, 1),
  (20, 1, '^SELECT',             20, 1);

-- Default: everything else goes to writer HG
UPDATE global_variables SET variable_value='10'
  WHERE variable_name='mysql-default_hostgroup';

LOAD MYSQL QUERY RULES TO RUNTIME;
SAVE MYSQL QUERY RULES TO DISK;

Health Checks and Auto-Eviction

ProxySQL runs internal health checks against every server in its mysql_servers table. If a backend fails mysql-monitor_connect_timeout or mysql-monitor_ping_timeout thresholds, it marks that server SHUNNED or OFFLINE_SOFT.

The native ProxySQL monitor uses mysql-monitor_read_only_interval (default: 1500ms) to poll @@read_only on every backend. When Orchestrator promotes a replica and flips its read_only=OFF, ProxySQL's monitor will detect this within 1–2 poll cycles and automatically move that server into the writer hostgroup — no hook required — if you've configured mysql_replication_hostgroups.

-- Tell ProxySQL which HGs are writer/reader pairs
INSERT INTO mysql_replication_hostgroups (writer_hostgroup, reader_hostgroup, comment)
VALUES (10, 20, 'mysql-ha');

LOAD MYSQL SERVERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;

With mysql_replication_hostgroups configured, ProxySQL moves servers between HG 10 and HG 20 based on read_only state automatically. This means Orchestrator's promotion action — which sets read_only=OFF on the new primary — directly triggers ProxySQL to route writes there without needing a separate hook call.

Mafiree MySQL HA Services

Running MySQL without automated failover?

Mafiree designs, deploys, and manages MySQL high availability setups for production environments. From Orchestrator configuration to ProxySQL tuning and 24x7 failover monitoring, we handle the complexity so your team doesn't get paged at 2 AM.

See MySQL HA Services

Setting Up Orchestrator + ProxySQL: Step-by-Step

Enable GTID on all MySQL nodes
GTID is not mandatory but strongly recommended. Without it, Orchestrator must calculate binary log file/position coordinates to re-point replicas, which is more error-prone.
```
# my.cnf on all nodes
[mysqld]
gtid_mode                = ON
enforce_gtid_consistency = ON
log_slave_updates        = ON
binlog_format            = ROW
```

Create the Orchestrator monitoring user on MySQL

CREATE USER 'orchestrator'@'%' IDENTIFIED BY 'strong_password';
GRANT SUPER, PROCESS, REPLICATION SLAVE, RELOAD ON *.* TO 'orchestrator'@'%';
GRANT SELECT ON mysql.slave_master_info TO 'orchestrator'@'%';

Orchestrator needs SUPER to execute CHANGE MASTER TO and toggle read_only during recovery.

Install and configure Orchestrator

Deploy three Orchestrator binaries (one per Raft node) and point them at a shared backend database — or use SQLite per node with Raft replication handling consensus state. Key orchestrator.conf.json settings:

{
  "MySQLTopologyUser":               "orchestrator",
  "MySQLTopologyPassword":           "strong_password",
  "DiscoverByShowSlaveHosts":        true,
  "InstancePollSeconds":             5,
  "RecoveryPollSeconds":             1,
  "RecoveryPeriodBlockSeconds":      300,
  "FailMasterPromotionIfSQLThreadNotUpToDate": true,
  "DetachLostReplicasAfterMasterFailover": true,
  "ApplyMySQLPromotionAfterMasterFailover": true,
  "PreventCrossDataCenterMasterFailover": false,
  "RaftEnabled":                     true,
  "RaftDataDir":                     "/var/lib/orchestrator/raft",
  "RaftBind":                        "ORCH_NODE_IP",
  "RaftNodes":                       ["ORCH_1_IP:10008","ORCH_2_IP:10008","ORCH_3_IP:10008"],
  "PostMasterFailoverProcesses": [
    "/usr/local/bin/proxysql-failover.sh {failedHost} {failedPort} {successorHost} {successorPort}"
  ]
}

Configure ProxySQL backends

-- Add all MySQL nodes to ProxySQL
INSERT INTO mysql_servers (hostgroup_id, hostname, port, weight, max_connections)
VALUES
  (10, 'mysql-primary',   3306, 1000, 500),
  (20, 'mysql-replica-1', 3306, 100,  500),
  (20, 'mysql-replica-2', 3306, 100,  500);

-- Enable read_only-based automatic routing
INSERT INTO mysql_replication_hostgroups
  (writer_hostgroup, reader_hostgroup, comment)
VALUES (10, 20, 'mysql-ha-pair');

-- ProxySQL monitoring user on MySQL
CREATE USER 'monitor'@'%' IDENTIFIED BY 'monitor_password';
GRANT SELECT ON sys.* TO 'monitor'@'%';

UPDATE global_variables SET variable_value='monitor' WHERE variable_name='mysql-monitor_username';
UPDATE global_variables SET variable_value='monitor_password' WHERE variable_name='mysql-monitor_password';

LOAD MYSQL SERVERS TO RUNTIME;
LOAD MYSQL VARIABLES TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;
SAVE MYSQL VARIABLES TO DISK;

Write the post-failover hook script

Even with mysql_replication_hostgroups handling automatic routing, an explicit hook gives you control over logging, alerting, and edge cases. A minimal proxysql-failover.sh:

#!/bin/bash
FAILED_HOST=$1
FAILED_PORT=$2
NEW_PRIMARY=$3
NEW_PORT=$4

PROXYSQL_ADMIN="mysql -h 127.0.0.1 -P 6032 -u admin -padmin_password"

# Move failed primary offline in ProxySQL
$PROXYSQL_ADMIN -e "UPDATE mysql_servers SET status='OFFLINE_HARD'
  WHERE hostname='$FAILED_HOST' AND port=$FAILED_PORT;"

# Ensure new primary is in writer HG
$PROXYSQL_ADMIN -e "UPDATE mysql_servers SET hostgroup_id=10, status='ONLINE'
  WHERE hostname='$NEW_PRIMARY' AND port=$NEW_PORT;"

$PROXYSQL_ADMIN -e "LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;"

logger "Orchestrator failover: $FAILED_HOST -> $NEW_PRIMARY"

Discover and verify the topology

# Seed the first node — Orchestrator will follow replication links to find replicas
orchestrator-client -c discover -i mysql-primary:3306

# Verify topology is detected correctly
orchestrator-client -c topology -i mysql-primary:3306

The topology output should show the primary with all replicas nested beneath it. If any node is missing, check firewall rules between Orchestrator nodes and MySQL nodes.

Promotion Rules and What Breaks Failover

Orchestrator evaluates candidates against a set of configurable promotion rules before selecting a new primary. Understanding these prevents scenarios where Orchestrator detects a failure but refuses to promote any candidate.

Config Parameter	Effect	Recommendation
`FailMasterPromotionIfSQLThreadNotUpToDate`	Blocks promotion if SQL thread has unapplied relay logs	Set `true` for data safety; accept slightly longer failover time
`DelayMasterPromotionIfSQLThreadNotUpToDate`	Waits for SQL thread to catch up instead of blocking outright	Use if tolerable; capped by `ReasonableReplicationLagSeconds`
`RecoveryPeriodBlockSeconds`	Blocks a second recovery for N seconds after one completes	300s prevents cascade promotions; too high extends downtime on back-to-back failures
`PreventCrossDataCenterMasterFailover`	Refuses to promote a replica in a different DC	Enable only if cross-DC write latency is unacceptable
`DetachLostReplicasAfterMasterFailover`	Stops replication on replicas that were unreachable during failover	Enable — prevents replicas from reattaching to a potentially outdated source

Silent failure pattern: Orchestrator logs a recovery attempt but no promotion occurs. The most common cause is all candidate replicas having read_only=ON at the MySQL level with no mechanism for Orchestrator to clear it, combined with ApplyMySQLPromotionAfterMasterFailover: false. Verify this setting is true and that the Orchestrator user has SUPER privilege on all backends.

Testing Your HA Setup Before It Matters

A failover stack you've never tested is not a failover stack. These are the three tests Mafiree runs on every new Orchestrator + ProxySQL deployment before signing off:

Test 1: Controlled Primary Failure

# Simulate primary loss — stop MySQL on the primary node
systemctl stop mysql   # on primary

# Watch Orchestrator detect and respond
orchestrator-client -c topology -i mysql-replica-1:3306

# Verify ProxySQL has routed writes to the new primary
mysql -h proxysql-vip -P 6033 -u app_user -e "SHOW VARIABLES LIKE 'hostname';"

Test 2: Measure Actual RTO

Run a continuous write loop through ProxySQL and timestamp any connection errors. Total downtime from primary failure to first successful write through ProxySQL is your actual RTO. With Orchestrator's defaults, expect 15–45 seconds depending on InstancePollSeconds and SQL thread catch-up time.

Test 3: Verify Replica Re-pointing

After failover, confirm all replicas are replicating from the new primary:

-- On each replica after failover
SHOW SLAVE STATUS\G
-- Master_Host should show the new primary's IP
-- Seconds_Behind_Master should be 0 or near 0

Conclusion

Orchestrator + ProxySQL gives MySQL deployments the automated failover they need without forcing a migration to synchronous replication. The combination handles the full recovery cycle — detection, candidate selection, promotion, replica re-pointing, and traffic rerouting — in under a minute on a well-configured stack.

The non-obvious details are what separate a functional setup from one that works reliably under production conditions: running Orchestrator in Raft mode rather than single-node, configuring mysql_replication_hostgroups in ProxySQL for automatic read-only-based routing, testing the full failover path before relying on it, and understanding exactly which promotion rules can silently block recovery.

Mafiree's team manages MySQL high availability environments for clients across financial services, logistics, and e-commerce — environments where a 20-minute manual failover is not acceptable. If your MySQL tier doesn't yet have automated failover, or if you've set up Orchestrator but haven't validated it end-to-end, our MySQL HA services can help you get there.

Need Automated MySQL Failover in Production?

Mafiree's MySQL team designs and manages Orchestrator + ProxySQL deployments for high-traffic production environments. Get in touch for a free architecture review.

Talk to a Mafiree DBA Expert

FAQ

It's an automated failover architecture for MySQL using asynchronous replication. Orchestrator monitors the MySQL topology and automatically promotes a replica to primary when the current primary fails. ProxySQL routes application traffic to the correct backend, updating its routing tables when the topology changes. Together, they deliver sub-minute RTO without synchronous replication overhead.

They solve different problems. Galera provides synchronous, multi-master replication with no data loss on failover (RPO = 0) but adds write latency and certification conflict overhead. Orchestrator with asynchronous replication has a small RPO window (equal to replication lag at failure time) but zero write latency overhead and no certification conflicts. For most OLTP workloads where a few seconds of potential data loss is acceptable, Orchestrator + ProxySQL is the better trade-off.

Detection time depends on <code>InstancePollSeconds</code> (default: 10s) and <code>RecoveryPollSeconds</code> (default: 1s). At Mafiree's standard 5-second poll interval, primary failure is typically confirmed within 10–15 seconds. Total failover time (detection + candidate selection + promotion + ProxySQL update) ranges from 20 to 60 seconds on a well-configured stack.

Author Bio

sukan

Sukan is Database Team Lead at Mafiree with over a decade of experience in database systems, architecture, and performance optimization. He specializes in MySQL, MongoDB, TiDB, and ClickHouse, developing architectural improvements that make data platforms faster, more efficient, and cost-effective. Sukan writes about practical database engineering topics, real-world performance tuning, data replication, and high-scale system design, drawing from extensive hands-on experience solving complex technical challenges.

Subscribe for email updates

Get in touch with us

Highlights

More than 6000 Servers Monitored

Happy Clients

Certified DBAs

24 x 7 x 365 Support

Database Services

MySQL MongoDB PostgreSQL SQL Server Aerospike Clickhouse TiDB MariaDB Columnstore

Quick Links

Careers Blog Contact Privacy Policy Disclaimer Policy

Contacts

Nagercoil Office

Miru IT Park, Vallankumaranvillai,

Nagercoil, Tamilnadu - 629 002.

Bangalore Office

Unit 303, Vanguard Rise,

5th Main, Konena Agrahara,

Old Airport Road, Bangalore - 560 017.

Call: +91 6383016411

Email: sales@mafiree.com

High Availability for MySQL using Orchestrator and ProxySQL

Subscribe for email updates

MySQL High Availability with Orchestrator and ProxySQL: Auto Failover Without Synchronous Replication

Why Not Synchronous Replication?

Architecture Overview

How Orchestrator Works: Three Phases

Phase 1: Discovery

Phase 2: Refactoring

Phase 3: Recovery

ProxySQL Integration: Traffic Routing and Hostgroups

Hostgroup Design

Health Checks and Auto-Eviction

Running MySQL without automated failover?

Setting Up Orchestrator + ProxySQL: Step-by-Step

Promotion Rules and What Breaks Failover

Testing Your HA Setup Before It Matters

Test 1: Controlled Primary Failure

Test 2: Measure Actual RTO

Test 3: Verify Replica Re-pointing

Related Mafiree Resources

Conclusion

Need Automated MySQL Failover in Production?

FAQ

Author Bio

sukan

Leave a Comment

Related Blogs

Subscribe for email updates

Highlights

Database Services

Quick Links

Contacts

Nagercoil Office

Bangalore Office