VMs can't progress through state changes because Neutron is deadlocking on it's database queries, and thus leaving networks in inconsistent states

Bug #1230407 reported by ZhiQiang Fan
56
This bug affects 11 people
Affects Status Importance Assigned to Milestone
neutron
Invalid
Undecided
Unassigned

Bug Description

This is most often seen with the "State change timeout exceeded" in the tempest logs.

2013-09-25 16:03:28.319 | FAIL: tempest.thirdparty.boto.test_ec2_instance_run.InstanceRunTest.test_run_stop_terminate_instance_with_tags[gate,smoke]
2013-09-25 16:03:28.319 | tempest.thirdparty.boto.test_ec2_instance_run.InstanceRunTest.test_run_stop_terminate_instance_with_tags[gate,smoke]
2013-09-25 16:03:28.319 | ----------------------------------------------------------------------
2013-09-25 16:03:28.319 | _StringException: Empty attachments:
2013-09-25 16:03:28.319 | stderr
2013-09-25 16:03:28.320 | stdout
2013-09-25 16:03:28.320 |
2013-09-25 16:03:28.320 | pythonlogging:'': {{{2013-09-25 15:49:34,792 state: pending}}}
2013-09-25 16:03:28.320 |
2013-09-25 16:03:28.320 | Traceback (most recent call last):
2013-09-25 16:03:28.320 | File "tempest/thirdparty/boto/test_ec2_instance_run.py", line 175, in test_run_stop_terminate_instance_with_tags
2013-09-25 16:03:28.320 | self.assertInstanceStateWait(instance, "running")
2013-09-25 16:03:28.321 | File "tempest/thirdparty/boto/test.py", line 356, in assertInstanceStateWait
2013-09-25 16:03:28.321 | state = self.waitInstanceState(lfunction, wait_for)
2013-09-25 16:03:28.321 | File "tempest/thirdparty/boto/test.py", line 341, in waitInstanceState
2013-09-25 16:03:28.321 | self.valid_instance_state)
2013-09-25 16:03:28.321 | File "tempest/thirdparty/boto/test.py", line 331, in state_wait_gone
2013-09-25 16:03:28.321 | state = state_wait(lfunction, final_set, valid_set)
2013-09-25 16:03:28.322 | File "tempest/thirdparty/boto/utils/wait.py", line 57, in state_wait
2013-09-25 16:03:28.322 | (dtime, final_set, status))
2013-09-25 16:03:28.322 | AssertionError: State change timeout exceeded!(400s) While waitingfor set(['running', '_GONE']) at "pending"

full log: http://logs.openstack.org/38/47438/1/gate/gate-tempest-devstack-vm-neutron/93db162/

Revision history for this message
Joe Gordon (jogo) wrote :

according to logstash, this appears to only happen with neutron.

Revision history for this message
Joe Gordon (jogo) wrote :
Changed in neutron:
status: New → Confirmed
Revision history for this message
Joe Gordon (jogo) wrote :

logstash query: @message:"AssertionError: State change timeout exceeded!" AND @fields.build_status:"FAILURE" AND @fields.filename:"console.html"

Revision history for this message
Joe Gordon (jogo) wrote :

100 hits in last 2 days

Joe Gordon (jogo)
tags: added: havana-rc-potential
Revision history for this message
Sean Dague (sdague) wrote :

This looks like it's basically a neutron database deadlock issue - http://logs.openstack.org/87/47487/4/check/gate-tempest-devstack-vm-neutron/4128a28/logs/screen-q-svc.txt.gz?level=TRACE

This is really an RC bug, and I can't imagine cutting an RC without this.

Changed in neutron:
importance: Undecided → Critical
milestone: none → havana-rc1
Sean Dague (sdague)
description: updated
summary: - State change timeout exceeded
+ VMs can't progress through state changes because Neutron is deadlocking
+ on it's database queries, and thus leaving networks in inconsistent
+ states
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

triaging

Changed in neutron:
assignee: nobody → Salvatore Orlando (salvatore-orlando)
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :
Download full text (4.3 KiB)

Here's a breakdown of the SQL statements that run into this trouble

Logstash Queries
----------------
"Lock wait timeout exceeded; try restarting transaction" = 2268 hits

"Lock wait timeout exceeded; try restarting transaction" and NOT (@message:"UPDATE agents") = 808 hits

"Lock wait timeout exceeded; try restarting transaction" and NOT (@message:"UPDATE agents") and NOT (@message:"INSERT INTO routerl3agentbindings") = 558 hits

"Lock wait timeout exceeded; try restarting transaction" and NOT (@message:"UPDATE agents") and NOT (@message:"INSERT INTO routerl3agentbindings") and NOT (@message:"UPDATE ports") = 3 hits

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Number of hits: 1460
2013-09-27 16:44:32.559 2778 TRACE neutron.openstack.common.rpc.amqp OperationalError: (OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE agents SET heartbeat_timestamp=%s, configurations=%s WHERE agents.id = %s' (datetime.datetime(2013, 9, 27, 16, 43, 41, 432257), '{"router_id": "", "gateway_external_network_id": "", "handle_internal_only_routers": true, "use_namespaces": true, "routers": 5, "interfaces": 5, "floating_ips": 0, "interface_driver": "neutron.agent.linux.interface.OVSInterfaceDriver", "ex_gw_ports": 5}', 'e4b2e255-ee2a-40e1-a3c3-938269a03b28')
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Number of hits: 250
2013-09-18 13:15:32.212 31226 TRACE neutron.openstack.common.rpc.amqp OperationalError: (OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE ports SET admin_state_up=%s, device_id=%s, device_owner=%s WHERE ports.id = %s' (1, '08d955df-9810-5417-81bf-ae8c785d3ac4', 'neutron:LOADBALANCER', '14e47f8d-6515-421c-94f9-f472f1a030e4')
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Number of hits: 555
2013-09-27T01:51:42.000 [-] Returning exception (OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'INSERT INTO routerl3agentbindings (id, router_id, l3_agent_id) VALUES (%s, %s, %s)' ('b87db1d2-35f6-467a-8743-930a2992a24a', '0ea9e7b5-13ba-4cfb-af7...

Read more...

Revision history for this message
Joe Gordon (jogo) wrote :

query @message:"Lock wait timeout exceeded; try restarting transaction" AND @fields.filename:"logs/screen-q-svc.txt" AND @fields.build_status:"FAILURE" is to narrow, here is another case of the same bug (I think)

http://logs.openstack.org/70/44670/3/gate/gate-tempest-devstack-vm-neutron/02d68e3/logs/screen-q-svc.txt.gz?level=TRACE

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

It appears that it is not the db queries which deadlock, but the rpc calls which trigger these database queries.
The lock wait timeout is probably just a manifestation of the eventlet deadlock.
This gets so bad at time that it appears all the connections in the pool are then taken by deadlocked threads, leaving the server without available connections and making the server totally unresponsive.

This last manifestation should be the cause of the observed failure.

I think this problem is not new in Neutron; these deadlock have been sporadically observed in the past.
There was a similar bug (https://bugs.launchpad.net/tripleo/+bug/1184484), but with some improvements in Havana-1 the issue apparently went away.

No change happened in quantum in the last 3 days which might justify this. However, recently - not sure when - vpn support was added to devstack-gate. As the VPN support adds more RPC calls which might increase a chance of deadlock, I would first try if removing vpn support from devstack-gate does remove the issue.

If that is successfull, I will then work on a solution which prevents this issue altogether.

Revision history for this message
Sean Dague (sdague) wrote :

Salvatore, nice find!

I've proposed the revert here - https://review.openstack.org/#/c/48793/ (not sure why it didn't link in from gerrit)

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

It does not seem however it solved anything.
A think which is puzzling me is that often I am not able to correlate the failure (which always happens either in test_network_basic_ops or in boto ec2 tests due to the instance not coming up) with the timeout exception.
There are cases where the timeout exception is not reported in the neutron logs.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

To add more context, the gate-tempest-devstack-neutron is showing two failure modes:
1) behaviour which is exactly about this bug - as it can be seen in http://logs.openstack.org/91/48591/2/gate/gate-tempest-devstack-vm-neutron/4e2582a/ - In this case the timeout, being caused by agent RPC calls, can happen at anytime - not just a VM boot, and indeed we have failures in tempest.api.network tests
2) behaviour without any related traceback in neutron (the ones I see are related to negative tests IMHO), as it can be seen in http://logs.openstack.org/46/47546/2/gate/gate-tempest-devstack-vm-neutron/c8da791/

Hopeefully removing vpn from the gate will reduce the occurency of #1; however I am more concerned about manifestations of type #2, since they are more frequent in the tests I have been doing on my machines.

Changed in neutron:
assignee: Salvatore Orlando (salvatore-orlando) → Mark McClain (markmcclain)
Revision history for this message
Joe Gordon (jogo) wrote :

I think there are two bugs here:

'lock wait timeout exceeded; try restarting transaction':

 @message:"Lock wait timeout exceeded; try restarting transaction" AND @fields.filename:"logs/screen-q-svc.txt" AND @fields.build_status:"FAILURE"

And

http://logs.openstack.org/98/49198/1/check/check-tempest-devstack-vm-neutron/e3cd6d8/logs/screen-n-cpu.txt.gz#_2013-10-01_16_57_22_043

that instance / req-id is missing a Got semaphore /lock "update_usage" so the DB doesn't know the instance has booted

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/49257

Revision history for this message
Thierry Carrez (ttx) wrote :

RC1 was signed-off, moving to havana-rc-potential

Changed in neutron:
milestone: havana-rc1 → none
Thierry Carrez (ttx)
tags: added: havana-backport-potential
removed: havana-rc-potential
Revision history for this message
Bhuvan Arumugam (bhuvan) wrote :

we also face this issue.

The workaround we use is to increase innodb lock wait timeout to 100seconds. The MySQL default value is 50seconds.
innodb_lock_wait_timeout=100

Revision history for this message
Anita Kuno (anteaya) wrote :

salv-orlando said that Mark McClain is going to split processing RPC over AMQP from REST API. This will solve the contention problem.

Thierry Carrez (ttx)
Changed in neutron:
milestone: none → icehouse-1
Thierry Carrez (ttx)
Changed in neutron:
milestone: icehouse-1 → icehouse-2
Sean Dague (sdague)
no longer affects: tempest
no longer affects: nova
Revision history for this message
Bhuvan Arumugam (bhuvan) wrote :

As per this https://bugs.launchpad.net/nova/+bug/1262154, the UPDATE command is executed again in case it had failed due to deadlock. It would presumably resolve this issue as well.

Revision history for this message
Mark McClain (markmcclain) wrote :

Reducing severity as other changes seem to reduced this problem. Most hits in logstash are related to Grizzly cells tests.

Thierry Carrez (ttx)
Changed in neutron:
milestone: icehouse-2 → icehouse-3
Thierry Carrez (ttx)
Changed in neutron:
milestone: icehouse-3 → icehouse-rc1
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Does this bug still make sense?

Revision history for this message
Mark McClain (markmcclain) wrote :

Closing this bug as it is non-specific. Instead we should open bugs for specific instances of this error.

Changed in neutron:
milestone: icehouse-rc1 → none
status: Confirmed → Invalid
assignee: Mark McClain (markmcclain) → nobody
importance: Critical → Undecided
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.