Amazon RDS Aurora MySQL Failover and Rails Connection Pool

This post explores the downtime that occurs during Amazon RDS Aurora MySQL failover, the resulting application errors when combined with Rails' connection pool, and potential solutions to mitigate these issues.

Aurora MySQL Replication Strategy

When setting up a cluster with Amazon RDS Aurora MySQL (hereafter: Aurora MySQL), there are two replication strategies available:

  1. Single-Master Replication
    This is a commonly used replication method where a single node is designated as the "Writer" (also known as Primary/Leader/Master, etc.), handling all writable operations. Other nodes function as "Readers" (also referred to as Secondary/Follower/Slave/Replica, etc.), asynchronously replicating data from the Writer node using mechanisms like triggers or redo logs and handling read operations. While writable operations can become a bottleneck at the single Writer node, read-heavy workloads can be scaled horizontally by adding more Reader nodes to accommodate increasing demand.

  2. Multi-Master Replication
    In this strategy, multiple or all nodes can handle writable operations. Data writes are distributed using hash algorithms or other methods, and changes are asynchronously propagated to other nodes. However, Multi-Master Replication is not a silver bullet and is effective only for specific workloads. It is not necessarily superior to Single-Master Replication in all cases.

NOTE: In general, there are other replication strategies, such as Leaderless Replication, but they are not relevant to Aurora MySQL.

That said, Single-Master Replication has an unavoidable downside: failover.

In a Single-Master cluster, if the Writer node experiences hardware failure or needs to be restarted for a MySQL engine version update, there will be a temporary moment where no Writer node exists. In other words, downtime occurs. More precisely, another selected Reader node will be promoted to Writer, but a brief period of downtime is inevitable during this transition.

To address this, Aurora MySQL provides cluster endpoints and reader endpoints, which allow applications to connect in a decoupled manner. When a failover occurs, Aurora MySQL transparently handles the transition by switching the DNS records of Writer and Reader nodes.

Rails Connection Pool

Most modern applications, including those built with Rails, implement optimizations such as Connection Pools to efficiently reuse database connections.

In Rails (ActiveRecord), once a connection is established, it is reused within the Connection Pool until a new thread or process is created. This optimization improves performance in most cases, as it eliminates the need for repeated DNS lookups when connecting to the database.

However, this optimization becomes problematic during an Aurora MySQL Single-Master cluster failover. Internally, Aurora MySQL transitions the old Writer node to a new Writer node, updating the IP addresses returned by the cluster or reader endpoints. However, because the old Writer node's connection is still being reused from the Connection Pool, the application continues trying to connect to the outdated node, leading to application errors.

Solution

There are several ways to resolve this issue:

  1. Restart the application (e.g., via deployment) to force connections to the new Writer node.
  2. Disable the Connection Pool.
  3. Migrate to a Multi-Master Replication Cluster.
  4. Implement application-layer logic to reconnect when a connection to the Writer node fails.

Let's examine each approach:

  • Restarting the application every time a failover occurs is unrealistic. Failovers can happen unexpectedly due to hardware failures, not just during planned maintenance, making this an unreliable solution.
  • Disabling the Connection Pool eliminates errors during failover but sacrifices performance optimizations that are beneficial in normal operation. This would be a case of missing the forest for the trees.
  • Migrating to Multi-Master Replication is not always the best solution. The decision between Single-Master and Multi-Master replication should be based on business requirements and workload characteristics, not just failover behavior.
  • Implementing re-connection logic at the application layer is the most practical approach. In fact, this is the approach taken by my team.

Using winebarrel/active_record_mysql_xverify

In my team, we use the active_record_mysql_xverify gem, developed and maintained by our SRE team. (Other available options include mysql2-aurora and activerecord-refresh_connection).

This gem overrides ActiveRecord::ConnectionAdapters::Mysql2Adapter#active?. When a query to a connection in the Connection Pool fails, it checks the @@innodb_read_only variable to determine whether the connected node is a Writer or Reader. This method is recommended in the Aurora MySQL documentation.

To determine which DB instance in an Aurora MySQL DB cluster a connection is connected to, check the innodb_read_only global variable, as shown in the following example. SHOW GLOBAL VARIABLES LIKE 'innodb_read_only'; The innodb_read_only variable is set to ON if you are connected to a reader DB instance. This setting is OFF if you are connected to a writer DB instance, such as primary instance in a provisioned cluster.

If my understanding is correct, this approach should not only handle failover events triggered by engine updates (vertical scaling) but also resolve connection errors caused by DNS caching when scaling in (horizontal scaling).

2021-01-23