Nova compute failed to delete VM port with DVR

Bug #1398446 reported by Zengfa Gao
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Invalid
Medium
Zengfa Gao

Bug Description

This defect is hard to reproduce, only happens when I have more than 3 compute node with DVR enabled.

With the following script, run several times, I can see one VM in ERROR state.

neutron net-create demo-net
netdemoid=$(neutron net-list | awk '{if($4=="'demo-net'"){print $2;}}')
neutron subnet-create demo-net 10.100.100.0/24 --name demo-subnet
subnetdemoid=$(neutron subnet-list | awk '{if($4=="'demo-subnet'"){print $2;}}')
neutron router-create demo-router
routerdemoid=$(neutron router-list | awk '{if($4=="'demo-router'"){print $2;}}')

exnetid=$(neutron net-list | awk '{if($4=="'ext-net'"){print $2;}}')
for i in `seq 1 10`; do
    #boot vm, and create floating ip
    nova boot --image cirros --flavor m1.tiny --nic net-id=$netdemoid cirrosdemo${i}
    cirrosdemoid[i]=$(nova list | awk '{if($4=="'cirrosdemo${i}'"){print $2;}}')
    output=$(neutron floatingip-create $exnetid)
    echo $output
    floatipid[i]=$(echo "$output" | awk '{if($2=="id"){print $4;}}')
    floatip[i]=$(echo "$output" | awk '{if($2=="floating_ip_address"){print $4;}}')a
done

# Setup router
neutron router-gateway-set $routerdemoid $exnetid
neutron router-interface-add demo-router $subnetdemoid
#wait for VM to be running
sleep 30

for i in `seq 1 10`; do
    cirrosfix=$(nova list | awk '{if($4=="'cirrosdemo${i}'"){print $12;}}')
    cirrosfixip=${cirrosfix#*=}
    output=$(neutron port-list | grep ${cirrosfixip})
    echo $output
    portid=$(echo "$output" | awk '{print $2;}')
    neutron floatingip-associate --fixed-ip-address $cirrosfixip ${floatipid[i]} $portid
    neutron floatingip-delete ${floatipid[i]}
    nova delete ${cirrosdemoid[i]}
done

neutron router-interface-delete demo-router $subnetdemoid
neutron router-gateway-clear demo-router $netdemoid
neutron router-delete demo-router
neutron subnet-delete $subnetdemoid
neutron net-delete $netdemoid

Looking at log file:
2014-11-20 17:25:56.258 31042 DEBUG neutron.openstack.common.lockutils [req-6eabf07e-2fe4-4960-89ca-f0ac3f04f7a5 None] Got semaphore "db-access" lock /opt/stack/venvs/openstack/local/lib/python2.7/site-packages/neutron/openstack/common/lockutils.py:168
2014-11-20 17:25:56.424 31042 ERROR neutron.api.v2.resource [req-6eabf07e-2fe4-4960-89ca-f0ac3f04f7a5 None] delete failed
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource Traceback (most recent call last):
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource File "/opt/stack/venvs/openstack/local/lib/python2.7/site-packages/neutron/api/v2/resource.py", line 87, in resource
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource result = method(request=request, **args)
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource File "/opt/stack/venvs/openstack/local/lib/python2.7/site-packages/neutron/api/v2/base.py", line 476, in delete
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource obj_deleter(request.context, id, **kwargs)
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource File "/opt/stack/venvs/openstack/local/lib/python2.7/site-packages/neutron/plugins/ml2/plugin.py", line 1036, in delete_port
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource router_info = l3plugin.dvr_deletens_if_no_vm(context, id)
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource File "/opt/stack/venvs/openstack/local/lib/python2.7/site-packages/neutron/db/l3_dvrscheduler_db.py", line 195, in dvr_deletens_if_no_vm
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource port_host)
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource File "/opt/stack/venvs/openstack/local/lib/python2.7/site-packages/neutron/db/agents_db.py", line 136, in _get_agent_by_type_and_host
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource Agent.host == host).one()
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource File "/opt/stack/venvs/openstack/local/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2369, in one
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource ret = list(self)
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource File "/opt/stack/venvs/openstack/local/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2411, in _iter_
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource self.session._autoflush()
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource File "/opt/stack/venvs/openstack/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 1198, in _autoflush
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource self.flush()
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource File "/opt/stack/venvs/openstack/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 1919, in flush
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource self._flush(objects)
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource File "/opt/stack/venvs/openstack/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2037, in _flush
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource transaction.rollback(_capture_exception=True)
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource File "/opt/stack/venvs/openstack/local/lib/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 60, in _exit_
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource compat.reraise(exc_type, exc_value, exc_tb)
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource File "/opt/stack/venvs/openstack/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2001, in _flush
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource flush_context.execute()
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource File "/opt/stack/venvs/openstack/local/lib/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 372, in execute
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource rec.execute(self)
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource File "/opt/stack/venvs/openstack/local/lib/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 526, in execute
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource uow
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource File "/opt/stack/venvs/openstack/local/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 60, in save_obj
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource mapper, table, update)
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource File "/opt/stack/venvs/openstack/local/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 536, in _emit_update_statements
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource (table.description, len(update), rows))
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource StaleDataError: UPDATE statement on table 'ml2_dvr_port_bindings' expected to update 1 row(s); 0 were matched.
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource
2014-11-20 17:25:56.427 31042 INFO neutron.wsgi [req-6eabf07e-2fe4-4960-89ca-f0ac3f04f7a5 None] 10.23.32.135 - - [20/Nov/2014 17:25:56] "DELETE /v2.0/ports/83f2d987-0ab7-4479-8c97-7ab7b758a2dc.json HTTP/1.1" 500 296 0.235463

The problem could be some race condition. The code l3_devscheduler_db.py
From line 158, got_dvr_routers_by_portid to line 193 get_agent_by_type_and_host, there are several calls.
If the port was there when code on 158. Another thread comes in the middle, delete the port. Line 193 will fail because the entry was not there anymore.
Both "neutron floatingip-delete" and "nova delete" can delete the port.

I think if we move l3_devscheduler_db.py, line 193:
            agent = self._get_agent_by_type_and_host(context,
                                                     q_const.AGENT_TYPE_L3,
                                                     port_host)
to line 164

It should resolve the problem.

tags: added: l3-dvr-backlog
summary: - Novacompute failed to detel VM port with DVR
+ Novacompute failed to delete VM port with DVR
summary: - Novacompute failed to delete VM port with DVR
+ Nova compute failed to delete VM port with DVR
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/138822

Changed in neutron:
assignee: nobody → Zengfa Gao (zfgao)
status: New → In Progress
Revision history for this message
Zengfa Gao (zfgao) wrote :

The key error is: 2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource StaleDataError: UPDATE statement on table 'ml2_dvr_port_bindings' expected to update 1 row(s); 0 were matched.
2014-11-20 17:25:56.424 31042 TRACE neutron.api.v2.resource

exception sqlalchemy.orm.exc.StaleDataError
Bases: sqlalchemy.exc.SQLAlchemyError

An operation encountered database state that is unaccounted for.

Conditions which cause this to happen include:

A flush may have attempted to update or delete rows and an unexpected number of rows were matched during the UPDATE or DELETE statement. Note that when version_id_col is used, rows in UPDATE or DELETE statements are also matched against the current known version identifier.

A mapped object with version_id_col was refreshed, and the version number coming back from the database does not match that of the object itself.

A object is detached from its parent object, however the object was previously attached to a different parent identity which was garbage collected, and a decision cannot be made if the new parent was really the most recent “parent”.

Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

Could we also get stack traces in a paste bin or something. These stack traces are so hard to read.

Revision history for this message
Zengfa Gao (zfgao) wrote :
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Zengfa Gao (<email address hidden>) on branch: master
Review: https://review.openstack.org/138822
Reason: I cannot reproduce the same StaleDataError with stable/juno or master. There is no enough reasons to push this into upstream.

Revision history for this message
Zengfa Gao (zfgao) wrote :

This defect cannot be reproduced anymore with master or stable/juno.

Changed in neutron:
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.