SQLite non-GVL-blocking, fair retry interval busy handler #51958

fractaledmind · 2024-05-30T19:39:08Z

Motivation / Background

As detailed in #50370, the current use of SQLite's busy_timeout functionality leads to notable increases in tail latency as well as occasional SQLite3::BusyException: database is locked errors since it does not release the GVL while waiting to retry acquiring the SQLite database lock. Version 2.0 of the sqlite3 gem introduces a new SQLite3::Database#busy_handler_timeout= method, which provides a non-GVL-blocking, fair retry interval busy handler implementation.

cc: @flavorjones @tenderlove @byroot

Detail

This PR points the timeout option the database.yml config file at this new busy handler implementation. In order to ensure that this new method is present, I have also opened #51957 so that Rails 8 defaults to version >= 2.0 for the sqlite3 gem.

Additional information

As detailed in my deep dive on this topic, having such a busy handler is essential for running production-grade Rails application using SQLite. In my benchmarks, ensuring that the busy handler doesn't hold the GVL brought p95 latency down from >5s to 0.19s. Then, ensuring that the retry interval was fair for both "old" and "new" queries brought p99.99 latency down from >1.5s to 0.5s.

Checklist

Before submitting the PR make sure the following are checked:

This Pull Request is related to one change. Unrelated changes should be opened in separate PRs.
Commit message has a detailed description of what changed and why. If this PR fixes a related issue include it in the commit message. Ex: [Fix #issue-number]
Tests are added or updated if you fix a bug or add a feature.
CHANGELOG files are updated for the changed libraries if there is a behavior change or additional feature. Minor bug fixes and documentation changes should not be included.

activerecord/lib/active_record/connection_adapters/sqlite3_adapter.rb

byroot · 2024-05-30T19:57:06Z

Also some of the CI failures are legit (rubocop & Active Record), you can ignore the Action Text failure though.

fractaledmind · 2024-05-30T20:01:26Z

... CoPilot and those damned hanging parens.

fractaledmind · 2024-05-30T20:37:52Z

I'm guessing that the guides failures are related, but the errors don't really have any messages, so I'm not sure what the problem is and thus how to possibly fix it. But other issues are fixed.

byroot · 2024-05-30T20:39:30Z

but the errors don't really have any messages

/usr/local/bundle/gems/bundler-2.5.10/lib/bundler/rubygems_integration.rb:237:in `block (2 levels) in replace_gem': Error loading the 'sqlite3' Active Record adapter. Missing a gem it depends on? can't activate sqlite3 (~> 1.4), already activated sqlite3-2.0.2-x86_64-linux-gnu. Make sure all dependencies are added to Gemfile. (LoadError)

skipkayhil · 2024-05-30T20:48:55Z

rails/guides/bug_report_templates/action_mailbox.rb

Line 12 in 51b6023

gem "sqlite3", "~> 1.4"

Edit: oh you already got that one...

Maybe this one?

rails/activerecord/lib/active_record/connection_adapters/sqlite3_adapter.rb

Line 14 in 51b6023

gem "sqlite3", ">= 1.4"

But >= is not in the error message...

OH the bug report template uses rails on rubygems, so the last published 7.1 has a ~> 1.4 requirement... maybe the requirement from these templates should be left as is?

fractaledmind · 2024-05-30T20:58:02Z

Ok. I'll simplify the changes for the guides to simply use >= 1.6.6 to make them more flexible and bump them to what the Rails Gemfile currently has after #51592

byroot · 2024-05-31T08:50:02Z

So releasing the GVL when waiting for the write lock is a bit of a double edged sword in my opinion. On one hand it does allow other threads to run, which is certainly a good thing for the throughput of an individual process in most cases.

On the other hand, it can cause a situation where you acquired the lock, but then have to wait a very long time to acquire the GVL. So potentially, you are occasionally locking the database for much longer than before. e.g. if you are running 10 threads, you can potentially wait up to 900ms to get the GVL back, and all this time you are holding the EXCLUSIVE lock. And if you are running fibers, you may even wait an unbounded amount of time.

In other words, I'm in favor of this change, but have some small reservations. @fractaledmind is the code you used for you benchmark available somewhere? I'd like to experiment with it if I find the time (apologies if it's in your blog post and I missed the link).

fractaledmind · 2024-05-31T09:18:15Z

I have been using the oha simple load testing CLI utility against this Rails app hitting these endpoints: https://github.com/fractaledmind/railsconf-2024/blob/workshop/app/controllers/benchmarking_controller.rb. You can find the details of how I setup the benchmarking in this section of the README of that repo: https://github.com/fractaledmind/railsconf-2024?tab=readme-ov-file#run-baseline-load-tests

activerecord/lib/active_record/connection_adapters/sqlite3_adapter.rb

djmb · 2024-06-04T08:09:24Z

@fractaledmind - I've just tried this out on Campfire and it looks good. We are currently using a retry with backoff, but I'm not seeing any performance differences with our load tests with this setting.

Did you try out the effects of a small amount of retry backoff in your tests? I wonder if we could get higher throughput without affecting the high percentiles too negatively.

JulienDefrance · 2024-06-20T20:05:34Z

+1 on these changes, so that we can make use of the new version of the sqlite3 gem in projects.

Will this likely be slated for Rails 7.2.0?

fractaledmind · 2024-07-05T08:22:34Z

@fractaledmind - I've just tried this out on Campfire and it looks good. We are currently using a retry with backoff, but I'm not seeing any performance differences with our load tests with this setting.

Did you try out the effects of a small amount of retry backoff in your tests? I wonder if we could get higher throughput without affecting the high percentiles too negatively.

@djmb: In the original PR, we benchmarked a number of different implementations of the busy_handler_timeout logic.

I don't immediately follow how adding retry backoff could increase throughput tho? Fill me in on your thought here?

One thing I needed some time to wrap my head around is how this "queue" behaves differently than the type of queues I was familiar with (a background job queue), and what that difference implies when designing a busy handler. In a typical background job queue, when you say "retry in 1 second", that job will be executed in 1 second. The "retry" is a command that will be honored. When you can command that a job is retried after a particular interval, you don't need to worry about "older jobs" (jobs that have been retried more times) having any sort of penalty relative to "newer jobs" and so having some kind of backoff steps makes sense to distribute execution load. But, with this "queue", we aren't telling SQLite "retry this query in 1 ms" as a command, we are more so saying "you can attempt to run this query again in 1ms". There is no guarantee that the query will actually run, since it has to attempt to acquire the write lock before running. When you switch from commanding a job to run in the future to requesting that a query try to run in the future, now you have to deal with how a backoff starts to penalize "older queries".

I wrote an in-depth article on this topic, but the TL;DR is that having backoff¹ has a noticeable impact on long-tail latency (p99.99 from my load testing):

but when using the "fair" retry interval from the busy_handler_timeout, the long-tail latency flattens out:

And since we are retrying all queries within 1ms, I don't immediately see how we could increase throughput by retrying older queries after some increased duration. When there are fewer queries, this will have no impact on throughput but only increase latency a bit, since there won't be enough contention on the write lock to prevent the older query from running, but it will only run after N ms instead of 1ms. But, when there are more queries, this will increase latency a lot, but I still don't see how it would increase throughput, as the same number of queries will be executed. Since writes are executed serially, throughput is always constant, right? Again, I'm probably missing something, but this is my reasoning as I understand things. Throughput is fixed, but we can improve latency with by ensuring that the busy handler [1] doesn't block the GLV and [2] has a fair retry interval.

in my testing I used SQLite's backoff: delays = [1, 2, 5, 10, 15, 20, 25, 25, 25, 50, 50, 100] ↩

djmb · 2024-07-11T11:48:25Z

@fractaledmind - my concern would be where you have a lot of threads/processes all waiting to get a lock and a single thread that is holding on to the transaction for a long time. Could the polling eat enough resources to cause the thread with the transaction to slow down (especially for polling by threads on the same process as they will need the GVL)?

fractaledmind · 2024-07-18T15:03:17Z

@djmb: I will try to create a test case to benchmark this.

fractaledmind · 2024-07-19T18:12:28Z

@djmb

To consider the impact of polling for the lock, we (myself and @oldmoe) wanted to explore the following three scenarios:

all queries are fast
one slow query with contention from fast queries
two slow queries with contention from fast queries

In order to explore a wide range of situations, we had the contentious fast queries run from a composition of processes and threads. The slow queries were always ran from the single parent process in succession.

You can find the code used to run the various scenarios in this gist

Both modes of the busy_handler ignore any concern of a timeout and simply focus on how to behave when the handler is called. The constant mode will always sleep for 1ms (busy_handler { |_count| sleep 0.001 }) and the backoff mode always sleeps for Nms, where N is the number of times the handler has been called (busy_handler { |count| sleep count * 0.001 }).

We play with 3 variables to create the various contention situations:

the number of child processes
the number of threads per child process
the number of slow queries that the parent process will run

The first situation has the parent process run zero slow queries, so all contention comes from the fast queries that the child connections are running. The ideal case, regardless of the number of child processes and/or threads, is that the script, parent connection, fastest child connection, and slowest child connection all finish in very short durations. This is because the only queries being run inside of the transactions are fast queries (SELECT 1) and so no connection should need to wait very long to acquire the lock. This is what we see with the constant mode, for all variations of this scenario:

   MODE    |  CONTENTION  |  PARENT  |  FASTEST  |  SLOWEST  |  SCRIPT  |  DELTA
 constant  |  6 (4/2/0)   |       0  |        1  |        3  |       2  |     -3
 constant  |  8 (4/4/0)   |       0  |        0  |        9  |       1  |     -9
 constant  |  10 (8/2/0)  |       0  |        1  |        8  |       3  |     -8
 constant  |  12 (8/4/0)  |       0  |        0  |       27  |       3  |    -27

For the backoff mode, we see the slowest child connection take longer to finish under increased load:

   MODE    |  CONTENTION  |  PARENT  |  FASTEST  |  SLOWEST  |  SCRIPT  |  DELTA
 backoff   |  6 (4/2/0)   |       0  |        0  |        4  |       2  |     -4
 backoff   |  8 (4/4/0)   |       0  |        0  |       26  |       2  |    -26
 backoff   |  10 (8/2/0)  |       0  |        0  |       26  |       3  |    -26
 backoff   |  12 (8/4/0)  |       0  |        0  |      107  |       3  |   -107

In the worst case, backoff mode takes 5× longer for the slowest child connection (107ms vs 27ms) than the constant mode.

The second scenario has the parent process run one slow query, so all contention comes from the fast queries that the children connection is running. The ideal case is that the gap between the duration for the parent connection and the script duration is minimal, the gap between the fastest and slowest child connection is minimal. This is because the total script execution time should not be majorly impacted by the child processes (thus the small first gap), the children connections should not have any unfairly punished executions since they are all running only fast queries. These are the results we see for the constant mode in this scenario:

   MODE    |  CONTENTION  |  PARENT  |  FASTEST  |  SLOWEST  |  SCRIPT  |  DELTA
 constant  |  7 (4/2/1)   |    1071  |      869  |      875  |    1073  |    196
 constant  |  9 (4/4/1)   |    1077  |      876  |      893  |    1078  |    184
 constant  |  11 (8/2/1)  |    1077  |      877  |      891  |    1080  |    186
 constant  |  13 (8/4/1)  |    1076  |      878  |      902  |    1079  |    174

What we observe is that the gap between PARENT and SCRIPT does indeed remain miniscule (< 3) and the gap between FASTEST and SLOWEST also remains small (< 33). Yes, there is some small penalty on the parent connection execution time as load increases, but it is not significant (< 7ms for a 1 second query, so 0.7%). The results for the backoff mode do not paint as rosy a picture:

   MODE    |  CONTENTION  |  PARENT  |  FASTEST  |  SLOWEST  |  SCRIPT  |  DELTA
 backoff   |  7 (4/2/1)   |    1064  |      878  |      942  |    1066  |    122
 backoff   |  9 (4/4/1)   |    1069  |      868  |     1018  |    1071  |     51
 backoff   |  11 (8/2/1)  |    1071  |      875  |     1173  |    1074  |   -102
 backoff   |  13 (8/4/1)  |    1066  |      873  |     1424  |    1069  |   -358

While the gap between PARENT and SCRIPT stays small (< 8), the gap between FASTEST and SLOWEST grows significantly (< 551). The slowest child connection takes 2.5× longer to finish in the worst case than the best case. This is on top of the fact that the lag on the parent connection execution time is identical (7ms).

In the third scenario, where the parent connection is running two slow queries with 1ms gap in between executions, we hope to see that the fast child queries are able to acquire the lock between the two slow queries running in the parent. We then similarly want to see that the gap between fastest and slowest child connections is minimal (tho it will be larger than previous scenarios, because the slowest child connection will be the one that has to wait for both slow parent queries to finish). Here are the results for the constant mode:

   MODE    |  CONTENTION  |  PARENT  |  FASTEST  |  SLOWEST  |  SCRIPT  |  DELTA
 constant  |  8 (4/2/2)   |    2142  |      871  |     1944  |    2144  |    198
 constant  |  10 (4/4/2)  |    2144  |      871  |     1952  |    2146  |    192
 constant  |  12 (8/2/2)  |    2141  |      867  |     1953  |    2143  |    188
 constant  |  14 (8/4/2)  |    2163  |      883  |     1984  |    2166  |    179

We see that indeed the fastest child connections remain fast, while the slowest child connections are only slightly faster than the parent connection (the 200ms that we have the child processes sleep before beginning). For the backoff mode, the fastest queries are noticably slower, and the slowest queries get progressively slower under increasing contention:

   MODE    |  CONTENTION  |  PARENT  |  FASTEST  |  SLOWEST  |  SCRIPT  |  DELTA
 backoff   |  8 (4/2/2)   |    2127  |     1926  |     2044  |    2129  |     83
 backoff   |  10 (4/4/2)  |    2132  |     1998  |     2204  |    2134  |    -72
 backoff   |  12 (8/2/2)  |    2143  |     1952  |     2324  |    2145  |   -181
 backoff   |  14 (8/4/2)  |    2133  |     1992  |     2475  |    2135  |   -342

So, I believe it is fair to say that across all three situations, the constant mode (i.e. the "fair" busy handler with no backoff) is superior to the backoff mode.

If we find a "mode" for the busy_handler that is superior to the constant mode, we can update the implementation of the sqlite3 gem's busy_handler_timeout method and Rails will get that improvement for free when the gem version is bumped. So, merging this PR as is only improves Rails applications in the present, and does not lock us into a suboptimal implementation for the future.

If you want, you can view all of the raw results in the full output file

dorianmariecom · 2024-07-20T16:05:28Z

I feel like upgrading sqlite should be a separate worthwhile pr

djmb · 2024-07-22T11:09:10Z

@fractaledmind - thanks for that!

I've tried out your script with some adjustments - 1. do everything in one process so all the threads are all contending for the GVL, 2. get the thread running the slow query to do some CPU intensive work while holding the transaction. The idea is to see if the CPU intensive work is slowed down by the other threads waking to retry.

I'm not seeing any significant differences in timings between constant/backoff modes doing that so looks great to me 👍

fractaledmind · 2024-07-22T11:39:59Z

Lovely. If you want to share those tweaks with me, I'd love to incorporate into the sandbox so that future explorations of busy handler implementations can be tested against a wide and useful set of scenarios.

And I see a conflict now in the PR so I will get that cleaned up and push up a clean PR here shortly.

fractaledmind · 2024-07-23T11:13:51Z

@byroot @djmb: I believe that the single test failure for the MySQL Trilogy adapter is a flaky test. Everything else should be good to go

fractaledmind · 2024-07-27T13:22:11Z

Fixed the CHANGELOG conflict and see a different test failure. This looks like another flaky test.

northeastprince · 2024-08-10T16:48:53Z

@byroot could we get this merged?

byroot · 2024-08-16T11:55:50Z

activerecord/lib/active_record/connection_adapters/sqlite3_adapter.rb

 require "active_record/connection_adapters/sqlite3/schema_statements"

-gem "sqlite3", ">= 1.4"
+gem "sqlite3", ">= 2.0"


This was still missing.

activerecord/lib/active_record/connection_adapters/sqlite3_adapter.rb

…GVL-blocking, fair retry interval busy handler implementation Co-Authored-By: Jean Boussier <byroot@ruby-lang.org>

byroot · 2024-08-16T12:04:40Z

Alright, I don't really have the time/energy to review the benchmark scripts, so I chose to take them at face value. Anyway, that's a feature provided by the underlying driver, so performance issues with this should be handled upstream.

fxn · 2024-08-17T06:56:22Z

ActiveRecord.deprecator.warn(<<~MSG)
  The retries option is deprecated and will be removed in Rails 8.0. Use timeout instead.
MSG

Is this message OK? I believe main will be 8.0?

byroot · 2024-08-17T06:57:34Z

Yes, main was already renamed in 8.0.alpha.

fxn · 2024-08-17T07:06:27Z

👍 target moved to 8.1 in 7d282b4.

rails/rails#51958

* Add /gemfiles/vendor/ to .gitignore since the directory is used when installing by using Gemfile under gemfiles/ dir * Rails 6.1 or above should use rspec-rails 6.x to comply its policy https://github.com/rspec/rspec-rails/blob/d0e322b20bb713b200b66d4a7cc21a272d0b4374/README.md#L11-L16 * Rails 8.0 require sqlite3 2.0 or above rails/rails#51958 * Add Ruby 3.3 x Rails 7.2 to test matrix

Rails main requires `sqlite3` version 2: rails/rails#51958

$fractaledmind$

rails-bot bot added the activerecord label May 30, 2024

tenderlove reviewed May 30, 2024

View reviewed changes

activerecord/lib/active_record/connection_adapters/sqlite3_adapter.rb Outdated Show resolved Hide resolved

$@fractaledmind$ fractaledmind mentioned this pull request May 30, 2024

Bump sqlite3 gem dependency to version 2.0 #51957

Closed

4 tasks

$@fractaledmind$ fractaledmind force-pushed the activerecord-busy-handler-timeout branch from e09a702 to 800069b Compare May 30, 2024 19:50

rails-bot bot added docs railties labels May 30, 2024

$@fractaledmind$ fractaledmind force-pushed the activerecord-busy-handler-timeout branch from 800069b to 1a1b94c Compare May 30, 2024 20:01

$@fractaledmind$ fractaledmind force-pushed the activerecord-busy-handler-timeout branch from 1a1b94c to dec9f1c Compare May 30, 2024 20:24

$@fractaledmind$ fractaledmind force-pushed the activerecord-busy-handler-timeout branch 3 times, most recently from 432da1b to 111d8a3 Compare May 30, 2024 22:57

zzak reviewed May 31, 2024

View reviewed changes

activerecord/lib/active_record/connection_adapters/sqlite3_adapter.rb Show resolved Hide resolved

npezza93 mentioned this pull request Jun 7, 2024

Add DB-backed Action Cable adapter as new default in Rails 8 #50480

Closed

jcowhigjr mentioned this pull request Jun 14, 2024

float sqlite3 until the update to version 2 works jcowhigjr/yelp_search_demo#525

Merged

$@fractaledmind$ fractaledmind force-pushed the activerecord-busy-handler-timeout branch from fb7a813 to 999062d Compare July 5, 2024 17:18

$@fractaledmind$ fractaledmind force-pushed the activerecord-busy-handler-timeout branch 2 times, most recently from f36aacc to 4348c8b Compare July 23, 2024 10:47

$@fractaledmind$ fractaledmind force-pushed the activerecord-busy-handler-timeout branch from 4348c8b to 03b668b Compare July 27, 2024 13:10

byroot force-pushed the activerecord-busy-handler-timeout branch from 03b668b to 5f72e7f Compare August 16, 2024 11:55

byroot reviewed Aug 16, 2024

View reviewed changes

activerecord/lib/active_record/connection_adapters/sqlite3_adapter.rb Outdated Show resolved Hide resolved

$@fractaledmind$

Use the new SQLite3::Database#busy_handler_timeout= method for a non-…

2976d37

…GVL-blocking, fair retry interval busy handler implementation Co-Authored-By: Jean Boussier <byroot@ruby-lang.org>

byroot force-pushed the activerecord-busy-handler-timeout branch from 5f72e7f to 2976d37 Compare August 16, 2024 11:56

byroot merged commit 156b9ac into rails:main Aug 16, 2024

jeromedalbert mentioned this pull request Aug 17, 2024

Rails 8: Fix CI sqlite3 error for Gemfile-rails-main heartcombo/devise#5708

Merged

ohbarye added a commit to ohbarye/after_commit_everywhere that referenced this pull request Aug 18, 2024

Rails 8.0 require sqlite3 2.0 or above

be43c94

rails/rails#51958

ohbarye mentioned this pull request Aug 18, 2024

Run CI against Rails 7.2 Envek/after_commit_everywhere#33

Merged

p8 added a commit to p8/rspec-rails that referenced this pull request Aug 26, 2024

Allow sqlite3 gem to float to version 2

728faf4

Rails main requires `sqlite3` version 2: rails/rails#51958

p8 mentioned this pull request Aug 26, 2024

Allow sqlite3 gem to float to version 2 rspec/rspec-rails#2783

Closed

p8 added a commit to p8/rspec-rails that referenced this pull request Aug 27, 2024

Allow sqlite3 gem to float to version 2

ae89a74

Rails main requires `sqlite3` version 2: rails/rails#51958

JonRowe pushed a commit to rspec/rspec-rails that referenced this pull request Aug 27, 2024

Allow sqlite3 gem to float to version 2

e75b6f5

Rails main requires `sqlite3` version 2: rails/rails#51958

This was referenced Sep 4, 2024

SQLite3::BusyException: database is locked rails/solid_queue#309

Closed

SQLite queue database corruption rails/solid_queue#324

Closed

makicamel mentioned this pull request Sep 15, 2024

Lock sqlite3 version to fix CI makicamel/circuit_switch#14

Closed

gboudreau mentioned this pull request Jun 6, 2025

ReverseGeocodingJob - ActiveRecord::StatementTimeout: SQLite3::BusyException: database is locked Freika/dawarich#1331

Closed

SQLite non-GVL-blocking, fair retry interval busy handler #51958

SQLite non-GVL-blocking, fair retry interval busy handler #51958

Uh oh!

Conversation

fractaledmind commented May 30, 2024

Motivation / Background

Detail

Additional information

Checklist

Uh oh!

Uh oh!

byroot commented May 30, 2024

Uh oh!

fractaledmind commented May 30, 2024

Uh oh!

fractaledmind commented May 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

byroot commented May 30, 2024

Uh oh!

skipkayhil commented May 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fractaledmind commented May 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

byroot commented May 31, 2024

Uh oh!

fractaledmind commented May 31, 2024

Uh oh!

Uh oh!

djmb commented Jun 4, 2024

Uh oh!

JulienDefrance commented Jun 20, 2024

Uh oh!

fractaledmind commented Jul 5, 2024

Footnotes

Uh oh!

djmb commented Jul 11, 2024

Uh oh!

fractaledmind commented Jul 18, 2024

Uh oh!

fractaledmind commented Jul 19, 2024

Uh oh!

dorianmariecom commented Jul 20, 2024

Uh oh!

djmb commented Jul 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fractaledmind commented Jul 22, 2024

Uh oh!

fractaledmind commented Jul 23, 2024

Uh oh!

fractaledmind commented Jul 27, 2024

Uh oh!

northeastprince commented Aug 10, 2024

Uh oh!

byroot Aug 16, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

byroot commented Aug 16, 2024

Uh oh!

fxn commented Aug 17, 2024

Uh oh!

byroot commented Aug 17, 2024

Uh oh!

fxn commented Aug 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

$@fractaledmind$ fractaledmind commented May 30, 2024

fractaledmind commented May 30, 2024 •

edited

Loading

skipkayhil commented May 30, 2024 •

edited

Loading

fractaledmind commented May 30, 2024 •

edited

Loading

djmb commented Jul 22, 2024 •

edited

Loading