Bug #1230407 “VMs can't progress through state changes because N...” : Bugs : neutron

Revision history for this message

Joe Gordon (jogo) wrote on 2013-09-26:

#1

according to logstash, this appears to only happen with neutron.

Revision history for this message

Joe Gordon (jogo) wrote on 2013-09-26:

#2

This bug has happened 62 times in the last 24 hours and only 67 in the last week. So somehow this got worse.
http://logstash.openstack.org/#eyJzZWFyY2giOiJcIkFzc2VydGlvbkVycm9yOiBTdGF0ZSBjaGFuZ2UgdGltZW91dCBleGNlZWRlZCFcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM4MDE1NzA5OTYwMiwibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==

Changed in neutron:
status:	New → Confirmed

Revision history for this message

Joe Gordon (jogo) wrote on 2013-09-26:

#3

logstash query: @message:"AssertionError: State change timeout exceeded!" AND @fields.build_status:"FAILURE" AND @fields.filename:"console.html"

Revision history for this message

Joe Gordon (jogo) wrote on 2013-09-26:

#4

100 hits in last 2 days

Joe Gordon (jogo) on 2013-09-26

tags:

added: havana-rc-potential

Revision history for this message

Sean Dague (sdague) wrote on 2013-09-27:

#5

This looks like it's basically a neutron database deadlock issue - http://logs.openstack.org/87/47487/4/check/gate-tempest-devstack-vm-neutron/4128a28/logs/screen-q-svc.txt.gz?level=TRACE

This is really an RC bug, and I can't imagine cutting an RC without this.

Changed in neutron:
importance:	Undecided → Critical
milestone:	none → havana-rc1

Sean Dague (sdague) on 2013-09-27

description:	updated
summary:	- State change timeout exceeded + VMs can't progress through state changes because Neutron is deadlocking + on it's database queries, and thus leaving networks in inconsistent + states

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2013-09-27:

#6

triaging

Changed in neutron:
assignee:	nobody → Salvatore Orlando (salvatore-orlando)

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2013-09-27:

#8

Download full text (4.3 KiB)

Here's a breakdown of the SQL statements that run into this trouble

Logstash Queries
----------------
"Lock wait timeout exceeded; try restarting transaction" = 2268 hits

"Lock wait timeout exceeded; try restarting transaction" and NOT (@message:"UPDATE agents") = 808 hits

"Lock wait timeout exceeded; try restarting transaction" and NOT (@message:"UPDATE agents") and NOT (@message:"INSERT INTO routerl3agentbindings") = 558 hits

"Lock wait timeout exceeded; try restarting transaction" and NOT (@message:"UPDATE agents") and NOT (@message:"INSERT INTO routerl3agentbindings") and NOT (@message:"UPDATE ports") = 3 hits

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Number of hits: 1460
2013-09-27 16:44:32.559 2778 TRACE neutron.openstack.common.rpc.amqp OperationalError: (OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE agents SET heartbeat_timestamp=%s, configurations=%s WHERE agents.id = %s' (datetime.datetime(2013, 9, 27, 16, 43, 41, 432257), '{"router_id": "", "gateway_external_network_id": "", "handle_internal_only_routers": true, "use_namespaces": true, "routers": 5, "interfaces": 5, "floating_ips": 0, "interface_driver": "neutron.agent.linux.interface.OVSInterfaceDriver", "ex_gw_ports": 5}', 'e4b2e255-ee2a-40e1-a3c3-938269a03b28')
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Number of hits: 250
2013-09-18 13:15:32.212 31226 TRACE neutron.openstack.common.rpc.amqp OperationalError: (OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE ports SET admin_state_up=%s, device_id=%s, device_owner=%s WHERE ports.id = %s' (1, '08d955df-9810-5417-81bf-ae8c785d3ac4', 'neutron:LOADBALANCER', '14e47f8d-6515-421c-94f9-f472f1a030e4')
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Number of hits: 555
2013-09-27T01:51:42.000 [-] Returning exception (OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'INSERT INTO routerl3agentbindings (id, router_id, l3_agent_id) VALUES (%s, %s, %s)' ('b87db1d2-35f6-467a-8743-930a2992a24a', '0ea9e7b5-13ba-4cfb-af7...

Here's a breakdown of the SQL statements that run into this trouble

Logstash Queries
----------------
"Lock wait timeout exceeded; try restarting transaction" = 2268 hits

"Lock wait timeout exceeded; try restarting transaction" and NOT (@message:"UPDATE agents") = 808 hits

"Lock wait timeout exceeded; try restarting transaction" and NOT (@message:"UPDATE agents") and NOT (@message:"INSERT INTO routerl3agentbindings") = 558 hits

"Lock wait timeout exceeded; try restarting transaction" and NOT (@message:"UPDATE agents") and NOT (@message:"INSERT INTO routerl3agentbindings") and NOT (@message:"UPDATE ports") = 3 hits

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Number of hits: 1460
2013-09-27 16:44:32.559 2778 TRACE neutron.openstack.common.rpc.amqp OperationalError: (OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE agents SET heartbeat_timestamp=%s, configurations=%s WHERE agents.id = %s' (datetime.datetime(2013, 9, 27, 16, 43, 41, 432257), '{"router_id": "", "gateway_external_network_id": "", "handle_internal_only_routers": true, "use_namespaces": true, "routers": 5, "interfaces": 5, "floating_ips": 0, "interface_driver": "neutron.agent.linux.interface.OVSInterfaceDriver", "ex_gw_ports": 5}', 'e4b2e255-ee2a-40e1-a3c3-938269a03b28')
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Number of hits: 250
2013-09-18 13:15:32.212 31226 TRACE neutron.openstack.common.rpc.amqp OperationalError: (OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE ports SET admin_state_up=%s, device_id=%s, device_owner=%s WHERE ports.id = %s' (1, '08d955df-9810-5417-81bf-ae8c785d3ac4', 'neutron:LOADBALANCER', '14e47f8d-6515-421c-94f9-f472f1a030e4')
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Number of hits: 555
2013-09-27T01:51:42.000	[-] Returning exception (OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'INSERT INTO routerl3agentbindings (id, router_id, l3_agent_id) VALUES (%s, %s, %s)' ('b87db1d2-35f6-467a-8743-930a2992a24a', '0ea9e7b5-13ba-4cfb-af72-007197e03dee', '29ede799-5b15-4a54-9f87-64811e687fde') to caller
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Number of hits: 3
2013-09-18T01:35:39.000	[-] Returning exception (OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'DELETE FROM securitygroupportbindings WHERE securitygroupportbindings.port_id = %s AND securitygroupportbindings.security_group_id = %s' ('97f0e811-9be3-4eff-9383-030432a119de', '039fefd7-87bf-4294-9a88-ded7e33777dc') to caller
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Revision history for this message

Joe Gordon (jogo) wrote on 2013-09-27:

#9

query @message:"Lock wait timeout exceeded; try restarting transaction" AND @fields.filename:"logs/screen-q-svc.txt" AND @fields.build_status:"FAILURE" is to narrow, here is another case of the same bug (I think)

http://logs.openstack.org/70/44670/3/gate/gate-tempest-devstack-vm-neutron/02d68e3/logs/screen-q-svc.txt.gz?level=TRACE

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2013-09-27:

#11

It appears that it is not the db queries which deadlock, but the rpc calls which trigger these database queries.
The lock wait timeout is probably just a manifestation of the eventlet deadlock.
This gets so bad at time that it appears all the connections in the pool are then taken by deadlocked threads, leaving the server without available connections and making the server totally unresponsive.

This last manifestation should be the cause of the observed failure.

I think this problem is not new in Neutron; these deadlock have been sporadically observed in the past.
There was a similar bug (https://bugs.launchpad.net/tripleo/+bug/1184484), but with some improvements in Havana-1 the issue apparently went away.

No change happened in quantum in the last 3 days which might justify this. However, recently - not sure when - vpn support was added to devstack-gate. As the VPN support adds more RPC calls which might increase a chance of deadlock, I would first try if removing vpn support from devstack-gate does remove the issue.

If that is successfull, I will then work on a solution which prevents this issue altogether.

Revision history for this message

Sean Dague (sdague) wrote on 2013-09-28:

#12

Salvatore, nice find!

I've proposed the revert here - https://review.openstack.org/#/c/48793/ (not sure why it didn't link in from gerrit)

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2013-09-28:

#13

It does not seem however it solved anything.
A think which is puzzling me is that often I am not able to correlate the failure (which always happens either in test_network_basic_ops or in boto ec2 tests due to the instance not coming up) with the timeout exception.
There are cases where the timeout exception is not reported in the neutron logs.

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2013-09-28:

#14

To add more context, the gate-tempest-devstack-neutron is showing two failure modes:
1) behaviour which is exactly about this bug - as it can be seen in http://logs.openstack.org/91/48591/2/gate/gate-tempest-devstack-vm-neutron/4e2582a/ - In this case the timeout, being caused by agent RPC calls, can happen at anytime - not just a VM boot, and indeed we have failures in tempest.api.network tests
2) behaviour without any related traceback in neutron (the ones I see are related to negative tests IMHO), as it can be seen in http://logs.openstack.org/46/47546/2/gate/gate-tempest-devstack-vm-neutron/c8da791/

Hopeefully removing vpn from the gate will reduce the occurency of #1; however I am more concerned about manifestations of type #2, since they are more frequent in the tests I have been doing on my machines.

Mark McClain (markmcclain) on 2013-09-30

Changed in neutron:
assignee:	Salvatore Orlando (salvatore-orlando) → Mark McClain (markmcclain)

Revision history for this message

Joe Gordon (jogo) wrote on 2013-10-01:

#15

I think there are two bugs here:

'lock wait timeout exceeded; try restarting transaction':

@message:"Lock wait timeout exceeded; try restarting transaction" AND @fields.filename:"logs/screen-q-svc.txt" AND @fields.build_status:"FAILURE"

And

http://logs.openstack.org/98/49198/1/check/check-tempest-devstack-vm-neutron/e3cd6d8/logs/screen-n-cpu.txt.gz#_2013-10-01_16_57_22_043

that instance / req-id is missing a Got semaphore /lock "update_usage" so the DB doesn't know the instance has booted

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-10-01: Related fix proposed to nova (master)

#17

Related fix proposed to branch: master
Review: https://review.openstack.org/49257

Revision history for this message

Thierry Carrez (ttx) wrote on 2013-10-03:

#18

RC1 was signed-off, moving to havana-rc-potential

Changed in neutron:
milestone:	havana-rc1 → none

Thierry Carrez (ttx) on 2013-10-14

tags:

added: havana-backport-potential
removed: havana-rc-potential

Revision history for this message

Bhuvan Arumugam (bhuvan) wrote on 2013-10-21:

#19

we also face this issue.

The workaround we use is to increase innodb lock wait timeout to 100seconds. The MySQL default value is 50seconds.
innodb_lock_wait_timeout=100

Revision history for this message

Anita Kuno (anteaya) wrote on 2013-11-22:

#20

salv-orlando said that Mark McClain is going to split processing RPC over AMQP from REST API. This will solve the contention problem.

Thierry Carrez (ttx) on 2013-11-26

Changed in neutron:
milestone:	none → icehouse-1

Thierry Carrez (ttx) on 2013-12-03

Changed in neutron:
milestone:	icehouse-1 → icehouse-2

Sean Dague (sdague) on 2013-12-12

no longer affects:	tempest
no longer affects:	nova

Revision history for this message

Bhuvan Arumugam (bhuvan) wrote on 2013-12-19:

#21

As per this https://bugs.launchpad.net/nova/+bug/1262154, the UPDATE command is executed again in case it had failed due to deadlock. It would presumably resolve this issue as well.

Revision history for this message

Mark McClain (markmcclain) wrote on 2014-01-13:

#22

Reducing severity as other changes seem to reduced this problem. Most hits in logstash are related to Grizzly cells tests.

Thierry Carrez (ttx) on 2014-01-22

Changed in neutron:
milestone:	icehouse-2 → icehouse-3

Thierry Carrez (ttx) on 2014-03-05

Changed in neutron:
milestone:	icehouse-3 → icehouse-rc1

Revision history for this message

Salvatore Orlando (salvatore-orlando) wrote on 2014-03-25:

#23

Does this bug still make sense?

Revision history for this message

Mark McClain (markmcclain) wrote on 2014-03-25:

#24

Closing this bug as it is non-specific. Instead we should open bugs for specific instances of this error.

Changed in neutron:
milestone:	icehouse-rc1 → none
status:	Confirmed → Invalid
assignee:	Mark McClain (markmcclain) → nobody
importance:	Critical → Undecided

neutron

VMs can't progress through state changes because Neutron is deadlocking on it's database queries, and thus leaving networks in inconsistent states

Bug Description

Other bug subscribers

Remote bug watches