ActiveRecord connection pool fairness


25 May 2012

ActiveRecord maintains a pool of database connections that may be used by multiple threads to perform database work. If many threads are contending for a smaller number of connections, the behavior of the connection pool becomes important.

See also the activerecord pull request.

In a normal Rails app, Rails takes care of checking out a connection before an HTTP request, and checking the connection back into the pool after the request. Rails is commonly run by servers such as Unicorn that run multiple worker processes to service HTTP requests in parallel. Each Ruby process has its own connection pool not shared with other worker processes. Each process handles at most one HTTP request at any given time, and may have background threads using database connections as well.

In JRuby, where there is no global interpreter lock, it is desirable to run multiple worker threads, rather than processes, to handle HTTP requests in parallel. All the worker threads now share the same connection pool.

A simple simulation spawns N worker threads which perform simulated HTTP requests. For a total of 100,000 requests, each worker attempts to acquire a database connection, sleeps 10 ms, then releases the connection.

conn = @pool.checkout
sleep 0.01
@pool.checkin conn

The connection pool size is set to five, which is the ActiveRecord default. With five worker threads, there is no contention for connections, and all requests complete in nearly 10 ms.

With 50 worker threads, there is contention for connections. Threads must wait until another thread releases a connection to the pool. ActiveRecord versions 3.0.12 and 3.2.3 have a couple thread safety bugs as well as a “fairness” problem. A connection pool is fair if a thread that has been waiting the longest will acquire the next available connection. With an unfair connection pool, a new thread that has not been waiting can “steal” a newly available connection, jumping ahead in line.

Unfair vs Fair queue

These test were run with a patched ActiveRecord where a queue implementation underlies the connection pool. This queue may be run in “fair” or “unfair” modes.

The key to the fair queue:

def can_remove_no_wait?
  @queue.size > @num_waiting
end

And the change to make the queue unfair:

def can_remove_no_wait?
  true
end

With 50 worker threads, a clear difference in behavior appears.

With the fair queue, predictably, each thread must wait before other threads are finished with the connection. Consider a thread beginning to wait fifth in line when all five connections are currently checked out. Each simulated request takes 10 ms, so after 5 x 10 ms = 50 ms, the waiting thread will receive a connection. And with 50 workers, each threads starts 50:sup:`th` in line, for 100 ms wait seen above. The histogram shows a very tight concentration right around 100 ms.

On the other hand, the unfair queue shows that some threads acquire a connection immediately, another bunch requires 100 ms, while some threads are left waiting for a whole second or more.

Why would one ever want the number of worker threads greater than the number of connections? In a typical Java servlet environment, the pool of worker threads is shared by multiple web applications running in the servlet container. Each application may then have its own unique database and corresponding connection pool. If, normally, the number of concurrent requests to a given app will not exceed the connection pool size, there is a possibility of a burst of requests to cause contention for database connections, and possibly timeouts.

The same test with 100,000 simulated requests was run with the unfair and fair queues with a varying number of different work threads. The fair queue maintains a consistent wait time for each thread, even as this time marches higher and higher. The unfair queue shows much less consistent behavior per request, and can potentially cause some threads to fail with a timeout error.

The behavior of JRuby is different than Ruby 1.9.3. With the unfair queue, the latter manages to clump most requests right at 10 ms, but those that take “long” take longer time, and timeouts (with the 5 second default) are observed with 50 worker threads.

(All tests were run on dual core 1GHz Athlon running Linux kernel 3.0.0)