pt-osc --alter-foreign-keys-method's drop_swap fallback issue

pt-online-schema-change (hereafter: pt-osc) is a tool that enables online schema changes in MySQL and is included in the Percona Toolkit. Our current team has been using it for a while, and I often deal with performance tuning through pt-osc configuration changes or debugging errors that occur during online migrations.

Going back to last summer, we received reports from the developer team that requests from the application to the database were occasionally failing during online migrations using pt-osc. It appeared that some mysterious downtime was occurring.

I was assigned to investigate the issue, and in conclusion, we discovered that when migrating tables with foreign keys, pt-osc's design in certain cases inevitably caused downtime.

This blog post explains the details of this issue and provides background knowledge on pt-osc's internal behavior to help understand the root cause.

Link: soure code

How pt-osc Achieves Online Schema Changes

pt-osc’s basic strategy involves creating a temporary copy of the target table, using triggers to capture ongoing changes, and gradually copying data from the original table to the new table based on row counts calculated statically or dynamically using --chunk-time or --chunk-size. This approach is relatively straightforward:

  1. Create an empty copy of the table to be altered.
  2. Create triggers on the original table.
  3. Copy rows from the original table to the new table in small chunks.
  4. RENAME the table.
  5. Drop the original table.

Handling Foreign Keys

Issues arise when the original table is referenced by foreign keys from child tables. A key challenge is determining the right moment to switch these references from the old table to the new table.

Let’s examine how pt-osc behaves when handling foreign keys. Suppose we have a foo table referenced by a child table bar. In this case, pt-osc follows a specific approach to modify the foo table while preserving foreign key constraints.

pt-osc Internal Behaviour

The method for updating the foreign key references in child tables is determined by the --alter-foreign-keys-method option:

  • rebuild_constraints
  • drop_swap
  • auto ... Dynamically chooses between rebuild_constraints and drop_swap
  • none ... Leaves foreign key references pointing to the dropped table, which no longer exists

Using rebuild_constraints is generally recommended because it updates the schema while preserving constraints by applying ALTER statements to child tables.

On the other hand, the drop_swap strategy simply drops the old table and renames the new one, bypassing ALTER operations. While this might work under ideal conditions, it does not guarantee data integrity if updates or losses occur during the process.

Furthermore, as stated in the documentation, drop_swap introduces a brief downtime between dropping the old table and renaming the new one:

This method is faster and does not block, but it is riskier for two reasons. First, for a short time between dropping the original table and renaming the temporary table, the table to be altered simply does not exist, and queries against it will result in an error.

This behavior is expected, given the nature of the process. If the application sends queries during this short window, they will fail, making it necessary to implement appropriate retry mechanisms.

rebuild_constraints Fallbacks to drop_swap

The issue is that even when rebuild_constraints is explicitly specified, there are cases where pt-osc falls back to drop_swap.

I first discovered this behavior while reading this blog post, which contained the following statement:

However, if the table t2 is too large, the alter operation may take too long, and drop_swap may be forced. The main methods involved are determine_alter_fk_method, rebuild_constraints, and swap_tables of pt-online-schema-change file.

The logic for determining --alter-foreign-keys-method is found in sub determine_alter_fk_method():

Upon examining the source code, I confirmed that if the total number of rows ($n_rows) exceeds a dynamically calculated threshold ($max_rows), pt-osc will always fall back to drop_swap, regardless of the specified --alter-foreign-keys-method.

The reasoning is that altering large tables takes time, and drop_swap provides a faster table switch. However, in our case, despite explicitly setting rebuild_constraints, our application experienced unexplained downtime, which we eventually traced back to this unexpected fallback behavior.

Alternatives

Handling online schema migrations for large tables with foreign keys remains a challenging problem.

Similar tools from GitHub, Facebook, and SoundCloud exist for online schema migrations, but a quick review suggests that many do not support tables with foreign keys at all:

If you know of any tools that support foreign keys or if this information is outdated due to recent developments, please let me know!

2021-01-14