pt-online-schema-change
(hereafter: pt-osc) is a tool that enables online schema changes in MySQL and is included in the Percona Toolkit. Our current team has been using it for a while, and I often deal with performance tuning through pt-osc configuration changes or debugging errors that occur during online migrations.
Going back to last summer, we received reports from the developer team that requests from the application to the database were occasionally failing during online migrations using pt-osc. It appeared that some mysterious downtime was occurring.
I was assigned to investigate the issue, and in conclusion, we discovered that when migrating tables with foreign keys, pt-osc's design in certain cases inevitably caused downtime.
This blog post explains the details of this issue and provides background knowledge on pt-osc's internal behavior to help understand the root cause.
Link: soure code
pt-osc’s basic strategy involves creating a temporary copy of the target table, using triggers to capture ongoing changes, and gradually copying data from the original table to the new table based on row counts calculated statically or dynamically using --chunk-time
or --chunk-size
. This approach is relatively straightforward:
RENAME
the table.Issues arise when the original table is referenced by foreign keys from child tables. A key challenge is determining the right moment to switch these references from the old table to the new table.
Let’s examine how pt-osc behaves when handling foreign keys. Suppose we have a foo
table referenced by a child table bar
. In this case, pt-osc follows a specific approach to modify the foo
table while preserving foreign key constraints.
The method for updating the foreign key references in child tables is determined by the --alter-foreign-keys-method
option:
rebuild_constraints
drop_swap
auto
... Dynamically chooses between rebuild_constraints
and drop_swap
none
... Leaves foreign key references pointing to the dropped table, which no longer existsUsing rebuild_constraints
is generally recommended because it updates the schema while preserving constraints by applying ALTER
statements to child tables.
On the other hand, the drop_swap
strategy simply drops the old table and renames the new one, bypassing ALTER
operations. While this might work under ideal conditions, it does not guarantee data integrity if updates or losses occur during the process.
Furthermore, as stated in the documentation, drop_swap
introduces a brief downtime between dropping the old table and renaming the new one:
This method is faster and does not block, but it is riskier for two reasons. First, for a short time between dropping the original table and renaming the temporary table, the table to be altered simply does not exist, and queries against it will result in an error.
This behavior is expected, given the nature of the process. If the application sends queries during this short window, they will fail, making it necessary to implement appropriate retry mechanisms.
The issue is that even when rebuild_constraints
is explicitly specified, there are cases where pt-osc falls back to drop_swap
.
I first discovered this behavior while reading this blog post, which contained the following statement:
However, if the table t2 is too large, the alter operation may take too long, and drop_swap may be forced. The main methods involved are determine_alter_fk_method, rebuild_constraints, and swap_tables of pt-online-schema-change file.
The logic for determining --alter-foreign-keys-method
is found in sub determine_alter_fk_method()
:
Upon examining the source code, I confirmed that if the total number of rows ($n_rows
) exceeds a dynamically calculated threshold ($max_rows
), pt-osc will always fall back to drop_swap
, regardless of the specified --alter-foreign-keys-method
.
rebuild_constraints
is the default behavior.$n_rows > $max_rows
, it falls back to drop_swap
.The reasoning is that altering large tables takes time, and drop_swap
provides a faster table switch. However, in our case, despite explicitly setting rebuild_constraints
, our application experienced unexplained downtime, which we eventually traced back to this unexpected fallback behavior.
Handling online schema migrations for large tables with foreign keys remains a challenging problem.
Similar tools from GitHub, Facebook, and SoundCloud exist for online schema migrations, but a quick review suggests that many do not support tables with foreign keys at all:
If you know of any tools that support foreign keys or if this information is outdated due to recent developments, please let me know!