This post explores the downtime that occurs during Amazon RDS Aurora MySQL failover, the resulting application errors when combined with Rails' connection pool, and potential solutions to mitigate these issues.
When setting up a cluster with Amazon RDS Aurora MySQL (hereafter: Aurora MySQL), there are two replication strategies available:
Single-Master Replication
This is a commonly used replication method where a single node is designated as the "Writer" (also known as Primary/Leader/Master, etc.), handling all writable operations. Other nodes function as "Readers" (also referred to as Secondary/Follower/Slave/Replica, etc.), asynchronously replicating data from the Writer node using mechanisms like triggers or redo logs and handling read operations. While writable operations can become a bottleneck at the single Writer node, read-heavy workloads can be scaled horizontally by adding more Reader nodes to accommodate increasing demand.
Multi-Master Replication
In this strategy, multiple or all nodes can handle writable operations. Data writes are distributed using hash algorithms or other methods, and changes are asynchronously propagated to other nodes. However, Multi-Master Replication is not a silver bullet and is effective only for specific workloads. It is not necessarily superior to Single-Master Replication in all cases.
NOTE: In general, there are other replication strategies, such as Leaderless Replication, but they are not relevant to Aurora MySQL.
That said, Single-Master Replication has an unavoidable downside: failover.
In a Single-Master cluster, if the Writer node experiences hardware failure or needs to be restarted for a MySQL engine version update, there will be a temporary moment where no Writer node exists. In other words, downtime occurs. More precisely, another selected Reader node will be promoted to Writer, but a brief period of downtime is inevitable during this transition.
To address this, Aurora MySQL provides cluster endpoints and reader endpoints, which allow applications to connect in a decoupled manner. When a failover occurs, Aurora MySQL transparently handles the transition by switching the DNS records of Writer and Reader nodes.
Most modern applications, including those built with Rails, implement optimizations such as Connection Pools to efficiently reuse database connections.
In Rails (ActiveRecord), once a connection is established, it is reused within the Connection Pool until a new thread or process is created. This optimization improves performance in most cases, as it eliminates the need for repeated DNS lookups when connecting to the database.
However, this optimization becomes problematic during an Aurora MySQL Single-Master cluster failover. Internally, Aurora MySQL transitions the old Writer node to a new Writer node, updating the IP addresses returned by the cluster or reader endpoints. However, because the old Writer node's connection is still being reused from the Connection Pool, the application continues trying to connect to the outdated node, leading to application errors.
There are several ways to resolve this issue:
Let's examine each approach:
winebarrel/active_record_mysql_xverify
In my team, we use the active_record_mysql_xverify
gem, developed and maintained by our SRE team. (Other available options include mysql2-aurora
and activerecord-refresh_connection
).
This gem overrides ActiveRecord::ConnectionAdapters::Mysql2Adapter#active?
. When a query to a connection in the Connection Pool fails, it checks the @@innodb_read_only
variable to determine whether the connected node is a Writer or Reader. This method is recommended in the Aurora MySQL documentation.
To determine which DB instance in an Aurora MySQL DB cluster a connection is connected to, check the innodb_read_only global variable, as shown in the following example.
SHOW GLOBAL VARIABLES LIKE 'innodb_read_only';
The innodb_read_only variable is set to ON if you are connected to a reader DB instance. This setting is OFF if you are connected to a writer DB instance, such as primary instance in a provisioned cluster.
If my understanding is correct, this approach should not only handle failover events triggered by engine updates (vertical scaling) but also resolve connection errors caused by DNS caching when scaling in (horizontal scaling).