deadlock with multiple threads on getConnection()

Bug #1152815 reported by heheman3000
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
BoneCP
Fix Committed
Undecided
Unassigned
Ubuntu
Invalid
Undecided
Unassigned

Bug Description

I'm using boneCP (0.7.1-release) in a setting where multiple threads are using one boneCP object to share some database connections. The threads are using the try-with-resources Java 7 language construct:

  try( Connection conn = pbds.getConnection() ) {

  } catch (SQLException e) {
   e.printStackTrace();
  }

Sometimes my application (mainly in load-testing) completely hangs up and doesn't do anything. In the debugger, I find that all the threads are blocking on the getConnection() call. The queries are quick to run and shouldn't be taking long, so I'm pretty sure the db calls are not the problem.

I discover that increasing maxConnectionsPerPartition to some larger number can alleviate the problem to some extent; here is my config code:

  pbds = new BoneCPDataSource();
  pbds.setDatasourceBean(ds); // I'm using a MySQLConnectionPoolDataSource
  pbds.setMinConnectionsPerPartition(1);
  pbds.setMaxConnectionsPerPartition(10);
  pbds.setIdleConnectionTestPeriodInMinutes(60);
  pbds.setIdleMaxAgeInMinutes(240);
  pbds.setPartitionCount(1);

On certain machines, the problem is easily reproducible when maxConnectionsPerPartition = 5; but it goes away when increased to 10. On other (faster) machines, I never observe it even when running tests many times.

I'm not sure how best to debug this problem. Please point me in the direction of what information I can provide additionally that will be helpful.

Thanks,
Andrew

Revision history for this message
heheman3000 (mizzao) wrote :

This might help. I can reproduce the deadlock and then suspend threads that are blocking on it; here is the call stack of a given thread.

Revision history for this message
heheman3000 (mizzao) wrote :

It seems like connections are somehow not being returned to the pool, so maybe some other part of BoneCP is causing this. But the problem goes away when I increase the max number of connections.

Revision history for this message
Stig Runar Vangen (stig-i) wrote :

For me this behaviour occured all the time when I ran version 0.8.0-RC1, mostly after long idle periods. After downgrading to version 0.7.1 I haven't seen this error anymore.

Revision history for this message
heheman3000 (mizzao) wrote :

I am using 0.7.1-RELEASE, and this behavior occurs right when I start my tests, no idle involved. We may be talking about a different problem (although it's possible that it's the same concurrency issue.)

Revision history for this message
Wallace Wadge (wwadge) wrote :

It would be great if you could provide a test case. Meanwhile enable connection tracking in the config to make sure the connection is actually being released somewhere.

Revision history for this message
heheman3000 (mizzao) wrote :

Hi Wallace,

Thanks for your response. I'm not sure how to enable connection tracking; I don't see the specific method on BoneCP that does that, could you elaborate?

Here's how to replicate the bug. The changes in the latest revision of my software were pushed just for you:

https://github.com/HarvardEconCS/TurkServer/commit/a645e8cd1734e57144f6dc2987dd71a80a9abb9a

Comment out the line that makes a max connections of 10 and reduce it to 5. Configure your database using the "turkserver.properties" file. Run that test file (ConcurrentGroupsTest) to get deadlocks (you will see nothing happen in the GUI and the test just times out). If the test completes normally, try reducing the number of connections a bit more.

I run from eclipse. Let me know if you have problems because I don't really use maven from the commandline. You will also need to add this project to your dependencies to get it to compile.
https://github.com/mizzao/libmao

Thanks!

Revision history for this message
Wallace Wadge (wwadge) wrote :

https://github.com/wwadge/bonecp/blob/0.7.1.RELEASE/bonecp/src/main/java/com/jolbox/bonecp/BoneCPConfig.java#L649

Try enabling that feature (for debug only!) let's see if it spits out something out.

Revision history for this message
heheman3000 (mizzao) wrote :

This is the only thing I see when search "BoneCP" in the output.

340 [Thread-0] WARN com.jolbox.bonecp.BoneCP - Thread close connection monitoring has been enabled. This will negatively impact on your performance. Only enable this option for debugging purposes!

There are a bunch of SQLExceptions when the test is killed off by JUnit due to timeout, and these are all at the hanging getConnection() calls. But nothing else...

Revision history for this message
Larry Streepy (lstreepy) wrote :

I'm having a very similar issue with 0.8.0-rc1 and Hibernate 4.2.2. When I start my application and let it sit idle for over an hour, the next connection I try to acquire hangs with the same stack trace as reported, included below. I have not narrowed it down to exactly how long things have to be idle, but I know at 45 minutes things work fine, and at 90 minutes they hang. I've added a connection hook to my system and I'll see if it provides any useful information and report back.

here's the (partial) stack:

   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for <0x00000000c2009228> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
        at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
        at com.jolbox.bonecp.DefaultConnectionStrategy.getConnectionInternal(DefaultConnectionStrategy.java:81)
        at com.jolbox.bonecp.AbstractConnectionStrategy.getConnection(AbstractConnectionStrategy.java:90)
        at com.jolbox.bonecp.BoneCP.getConnection(BoneCP.java:540)
        at com.jolbox.bonecp.provider.BoneCPConnectionProvider.getConnection(BoneCPConnectionProvider.java:178)
        at org.hibernate.internal.AbstractSessionImpl$NonContextualJdbcConnectionAccess.obtainConnection(AbstractSessionImpl.java:292)

Revision history for this message
Larry Streepy (lstreepy) wrote :

I tracked this down and can provide a fix. For me, this problem is a result of configuring more than 1 connection partition and having all the connections destroyed due to an idle timeout. In this case, the DefaultConnectionStrategy has a bug in pollConnection() when it checks the other partitions (if it finds the original partition has no free connections).

At line 55 of DefaultConnectionStrategy.java, you have this:

        connectionPartition = this.pool.partitions[i];

Because the connectionPartition is being changed from the one determined above in line 44, the code at the bottom of the routine ends up signalling the wrong partition to wake up and create more connections. The simple fix is to comment out this line of code. Also, I would change the break on line 57 to a "return result" since there is no reason to signal for more connections when you know you've found one.

Revision history for this message
Wallace Wadge (wwadge) wrote :

Good catch Larry,

Fixed.

I'm keeping the break+notify there since that signal is meant for a different thread to create new connections if need be.

Changed in bonecp:
status: New → Fix Committed
Revision history for this message
Paul White (paulw2u) wrote :

Belatedly invalidating superfluous Ubuntu task.

Changed in ubuntu:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.