Lots of problems with deleting a server immediately after create

Bug #937554 reported by justinsb
258
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Committed
High
Russell Bryant

Bug Description

Filing as a security vulnerability because I can crash the compute service (and it consumes all memory). Not sure if this is worthy of being classed as a vulnerability, but I presume we can easily mark it as "not security" if need be.

In an integration test, I create a server, and then immediately delete it (I don't wait for it to come up).

This causes all sorts of problems and errors to be reported, mostly on the compute node.

It's definitely timing dependent:

* For a while I was seeing errors because the directory structure was being deleted underneath the instance during construction; this manifested itself as file not found errors on console.log (as I recall)

* I'm currently seeing a problem where the compute process goes to 100% CPU (I believe on all cores) and rapidly eats all memory until the process is killed by the Linux OOM killer. The last few messages are at the bottom of this bug report.

I think that we need to make sure that a destroy can't proceed concurrently with a create on the same instance. Also, adding a test case would probably be useful here.

...
2012-02-20 23:22:02 DEBUG nova.rpc.common [-] Making asynchronous cast on network... from (pid=1177) cast /opt/stack/nova/nova/rpc/amqp.py:343
2012-02-20 23:22:02 DEBUG nova.compute.manager [-] [instance: cdcdd06c-2030-4826-bf99-b09dd96b2373] Checking state from (pid=1177) _get_power_state /opt/stack/nova/nova/compute/manager.py:253
libvir: QEMU error : Domain not found: no domain with matching name 'instance-0000001f'
libvir: QEMU error : Domain not found: no domain with matching name 'instance-0000001f'
2012-02-20 23:22:03 INFO nova.virt.libvirt.firewall [-] [instance: cdcdd06c-2030-4826-bf99-b09dd96b2373] Attempted to unfilter instance which is not filtered
2012-02-20 23:22:03 INFO nova.virt.libvirt.connection [-] [instance: cdcdd06c-2030-4826-bf99-b09dd96b2373] Deleting instance files /opt/stack/nova/nova/..//instances/instance-0000001f
2012-02-20 23:22:03 DEBUG nova.utils [-] Running cmd (subprocess): sudo mount /dev/nbd14 /tmp/tmp1D8j0M from (pid=1177) execute /opt/stack/nova/nova/utils.py:207
libvir: QEMU error : Domain not found: no domain with matching name 'instance-0000001f'
2012-02-20 23:22:04 INFO nova.virt.libvirt.connection [-] [instance: cdcdd06c-2030-4826-bf99-b09dd96b2373] Instance destroyed successfully.
2012-02-20 23:22:04 DEBUG nova.compute.manager [-] [instance: cdcdd06c-2030-4826-bf99-b09dd96b2373] Instance network_info: |[VIF({'network': Network({'bridge': u'br100', 'subnets': [Subnet({'ips': [FixedIP({'meta': {}, 'version': 4, 'type': u'fixed', 'floating_ips': [], 'address': u'192.168.71.2'})], 'version': 4, 'meta': {}, 'dns': [IP({'meta': {}, 'version': 4, 'type': u'dns', 'address': u'8.8.4.4'})], 'routes': [], 'cidr': u'192.168.0.0/16', 'gateway': IP({'meta': {}, 'version': 4, 'type': u'gateway', 'address': u'192.168.1.1'})}), Subnet({'ips': [], 'version': None, 'meta': {}, 'dns': [], 'routes': [], 'cidr': None, 'gateway': IP({'meta': {}, 'version': None, 'type': u'gateway', 'address': None})})], 'meta': {u'injected': True, u'tenant_id': None}, 'id': u'e800dca0-b76d-4309-8cb9-c5a3eb1f3105', 'label': u'private'}), 'meta': {}, 'id': u'b1625dc2-f9c2-4c70-914e-0eb554d43fc6', 'address': u'02:16:3e:66:5e:24'})]| from (pid=1177) _allocate_network /opt/stack/nova/nova/compute/manager.py:561
2012-02-20 23:22:04 DEBUG nova.compute.manager [-] [instance: cdcdd06c-2030-4826-bf99-b09dd96b2373] Deallocating network for instance from (pid=1177) _deallocate_network /opt/stack/nova/nova/compute/manager.py:611
2012-02-20 23:22:04 DEBUG nova.rpc.common [-] Making asynchronous cast on network... from (pid=1177) cast /opt/stack/nova/nova/rpc/amqp.py:343
2012-02-20 23:22:04 ERROR nova.compute.manager [-] Instance cdcdd06c-2030-4826-bf99-b09dd96b2373 not found.
(nova.compute.manager): TRACE: Traceback (most recent call last):
(nova.compute.manager): TRACE: File "/opt/stack/nova/nova/compute/manager.py", line 427, in _run_instance
(nova.compute.manager): TRACE: self._deallocate_network(context, instance)
(nova.compute.manager): TRACE: File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
(nova.compute.manager): TRACE: self.gen.next()
(nova.compute.manager): TRACE: File "/opt/stack/nova/nova/compute/manager.py", line 421, in _run_instance
(nova.compute.manager): TRACE: block_device_info = self._prep_block_device(context, instance)
(nova.compute.manager): TRACE: File "/opt/stack/nova/nova/compute/manager.py", line 569, in _prep_block_device
(nova.compute.manager): TRACE: task_state=task_states.BLOCK_DEVICE_MAPPING)
(nova.compute.manager): TRACE: File "/opt/stack/nova/nova/compute/manager.py", line 213, in _instance_update
(nova.compute.manager): TRACE: return self.db.instance_update(context, instance_id, kwargs)
(nova.compute.manager): TRACE: File "/opt/stack/nova/nova/db/api.py", line 661, in instance_update
(nova.compute.manager): TRACE: return IMPL.instance_update(context, instance_id, values)
(nova.compute.manager): TRACE: File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 119, in wrapper
(nova.compute.manager): TRACE: return f(*args, **kwargs)
(nova.compute.manager): TRACE: File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 1710, in instance_update
(nova.compute.manager): TRACE: session=session)
(nova.compute.manager): TRACE: File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 119, in wrapper
(nova.compute.manager): TRACE: return f(*args, **kwargs)
(nova.compute.manager): TRACE: File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 1452, in instance_get_by_uuid
(nova.compute.manager): TRACE: raise exception.InstanceNotFound(instance_id=uuid)
(nova.compute.manager): TRACE: InstanceNotFound: Instance cdcdd06c-2030-4826-bf99-b09dd96b2373 could not be found.
(nova.compute.manager): TRACE:
2012-02-20 23:22:04 INFO nova.compute.manager [-] Going to force the deletion of the vm cdcdd06c-2030-4826-bf99-b09dd96b2373, even if it is deleted
2012-02-20 23:22:04 INFO nova.compute.manager [req-e12699ae-b49b-47d6-8416-3ee51f1b7802 47c273884edd402d9c6cba469d6dbf0f b560b4f9ab4a4941b3bafbb44e3e100b] check_instance_lock: decorating: |<function terminate_instance at 0x2bef848>|
2012-02-20 23:22:04 INFO nova.compute.manager [req-e12699ae-b49b-47d6-8416-3ee51f1b7802 47c273884edd402d9c6cba469d6dbf0f b560b4f9ab4a4941b3bafbb44e3e100b] check_instance_lock: arguments: |<nova.compute.manager.ComputeManager object at 0x7fda66825990>| |<nova.rpc.amqp.RpcContext object at 0x46b7b90>| |cdcdd06c-2030-4826-bf99-b09dd96b2373|
2012-02-20 23:22:04 DEBUG nova.compute.manager [req-e12699ae-b49b-47d6-8416-3ee51f1b7802 47c273884edd402d9c6cba469d6dbf0f b560b4f9ab4a4941b3bafbb44e3e100b] instance cdcdd06c-2030-4826-bf99-b09dd96b2373: getting locked state from (pid=1177) get_lock /opt/stack/nova/nova/compute/manager.py:1544
<Mega crash>

Revision history for this message
Russell Bryant (russellb) wrote :

Regarding the security classification, I'm leaning toward not calling it a vulnerability. It could be a DoS, but only by an authenticated user. There is also no privilege escalation or information leakage.

Thoughts from other VMT members?

Revision history for this message
Thierry Carrez (ttx) wrote :

This looks a lot like bug 893582, which was supposedly fixed in E3. Justin, what version do you reproduce on ?

I'd consider this a vulnerability if it's quite easy to DoS the whole thing. Requiring an authenticated user is not really a high bar of entry :) We have a bunch of those action concurrency issues (tagged "concurrency" for those playing at home), but most (all?) of them were just triggering bugs, not crashes...

Changed in nova:
status: New → Incomplete
Revision history for this message
justinsb (justin-fathomdb) wrote :

This also has similar symptoms to bug 938699, which is public.

I was very close to HEAD on trunk, with a few of my patches that have now merged (but no patches that should be relevant). I've gone to head today (00f8caaeff15b03062db99290b8ebbe96a6712c5 - Fix weighted host), re-run devstack and the problem is still there.

Running on Oneirc.

uname -a
Linux openstack1 3.0.0-12-server #20-Ubuntu SMP Fri Oct 7 16:36:30 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

I'm attaching bouncer.py, which starts an instance and then immediately terminates it. It crashes devstack pretty reliably. Config is hard-coded at the top, most likely you just need to change the OS_AUTH_URL variable at the top (I don't run it against the same machine)

Revision history for this message
justinsb (justin-fathomdb) wrote :
Download full text (19.3 KiB)

I managed to attach gdb and got a backtrace. Unfortunately it isn't terribly informative because of the way Python is compiled on Oneric, but it does look like we're in Python code (not in some C module):

#0 0x00007f6cdbe1170c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f6cdbe1377d in malloc () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x000000000044a2eb in fill_free_list () at ../Objects/intobject.c:52
#3 0x000000000044b3e1 in PyInt_FromLong (ival=<optimized out>) at ../Objects/intobject.c:104
#4 PyInt_FromLong (ival=<optimized out>) at ../Objects/intobject.c:124
#5 PyInt_FromSsize_t (ival=<optimized out>) at ../Objects/intobject.c:127
#6 0x000000000044e6f6 in listextend (self=0x41396c8, b=<optimized out>) at ../Objects/listobject.c:872
#7 0x000000000044ece6 in list_init (self=0x41396c8, args=<optimized out>, kw=<optimized out>) at ../Objects/listobject.c:2458
#8 0x000000000047c1d1 in type_call (type=<optimized out>, args=(<itertools.count at remote 0x3b8bf38>,), kwds=0x0) at ../Objects/typeobject.c:737
#9 0x000000000041ad2a in PyObject_Call (func=<type at remote 0x844a00>, arg=<optimized out>, kw=<optimized out>) at ../Objects/abstract.c:2529
#10 0x00000000004b6b9e in do_call (nk=<optimized out>, na=<optimized out>, pp_stack=0x7fffeac59340, func=<type at remote 0x844a00>) at ../Python/ceval.c:4239
#11 call_function (oparg=<optimized out>, pp_stack=0x7fffeac59340) at ../Python/ceval.c:4044
#12 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:2666
#13 0x00000000004bcd2d in PyEval_EvalCodeEx (co=0x7f6cdae7fe30, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kws=<optimized out>, kwcount=2, defs=0x7f6cda53fde8,
    defcount=2, closure=0x0) at ../Python/ceval.c:3253
#14 0x00000000004b6a5b in fast_function (nk=<optimized out>, na=1, n=<optimized out>, pp_stack=0x7fffeac59530, func=<function at remote 0x7f6cda2c5f50>) at ../Python/ceval.c:4117
#15 call_function (oparg=<optimized out>, pp_stack=0x7fffeac59530) at ../Python/ceval.c:4042
#16 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:2666
#17 0x00000000004bcd2d in PyEval_EvalCodeEx (co=0x7f6cdae7fe30, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kws=<optimized out>, kwcount=2, defs=0x7f6cda53fde8,
    defcount=2, closure=0x0) at ../Python/ceval.c:3253
#18 0x00000000004b6a5b in fast_function (nk=<optimized out>, na=1, n=<optimized out>, pp_stack=0x7fffeac59720, func=<function at remote 0x7f6cda2c5f50>) at ../Python/ceval.c:4117
#19 call_function (oparg=<optimized out>, pp_stack=0x7fffeac59720) at ../Python/ceval.c:4042
#20 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at ../Python/ceval.c:2666
#21 0x00000000004bcd2d in PyEval_EvalCodeEx (co=0x7f6cdae7fe30, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kws=<optimized out>, kwcount=2, defs=0x7f6cda53fde8,
    defcount=2, closure=0x0) at ../Python/ceval.c:3253
#22 0x00000000004b6a5b in fast_function (nk=<optimized out>, na=1, n=<optimized out>, pp_stack=0x7fffeac59910, func=<func...

Revision history for this message
Russell Bryant (russellb) wrote :

I confirmed that I can reproduce this when running your script against devstack.

Revision history for this message
Russell Bryant (russellb) wrote :

From bug 938699, can you try setting the use_usb_tablet option to False? That seems to fix the crash for me.

Revision history for this message
justinsb (justin-fathomdb) wrote :

Thanks for reproducing and for the suggestion. I'm glad to hear I'm not completely crazy :-)

Setting use_usb_tablet to False didn't fix it for me - same symptoms as before.

Changing use_usb_tablet doesn't really fit my mental model here :-) Did this fix your devstack vs the above script, or bug 938699? If it fixed your devstack, can you please try running it a few times, just to check that you didn't 'get lucky' (as I'm sure this is timing dependent).

Revision history for this message
Russell Bryant (russellb) wrote :

Sure. I was using your script against devstack and after I changed this option, it didn't happen, but I only tried it once after making the change. I'll try a few more times and see what happens ...

Revision history for this message
Russell Bryant (russellb) wrote :

It appears that I got lucky the first time. I can still reproduce it, so I'll keep looking!

Revision history for this message
justinsb (justin-fathomdb) wrote :

I'm trying to recompile Python with the recommended gdb compiler flags so we can get a real backtrace. Hopefully everything will be clear with a nice(r) stack trace.

Revision history for this message
Russell Bryant (russellb) wrote :

Ok, now I really can't make it happen anymore ... I switched devstack over to my branch includes the fix for bug 934575 and can no longer make it happen. That patch just went into master. Can you try that?

Revision history for this message
justinsb (justin-fathomdb) wrote :

Ah.... bug 934575 seems likely to be the cause. I'm going to try to see if I can get that stack trace, hopefully that'll confirm, and then I'll merge and verify that all is well.

Revision history for this message
justinsb (justin-fathomdb) wrote :
Download full text (4.1 KiB)

I have a python with gdb support, and have a nice stack trace:

(gdb) pystack
/opt/stack/nova/nova/utils.py (745): to_primitive
/opt/stack/nova/nova/utils.py (736): to_primitive
/opt/stack/nova/nova/utils.py (751): to_primitive
/opt/stack/nova/nova/utils.py (730): to_primitive
/opt/stack/nova/nova/utils.py (745): to_primitive
/opt/stack/nova/nova/utils.py (736): to_primitive
/opt/stack/nova/nova/utils.py (751): to_primitive
/opt/stack/nova/nova/utils.py (736): to_primitive
/opt/stack/nova/nova/utils.py (751): to_primitive
/opt/stack/nova/nova/utils.py (730): to_primitive
/opt/stack/nova/nova/utils.py (736): to_primitive
/opt/stack/nova/nova/notifier/api.py (118): notify
/opt/stack/nova/nova/exception.py (136): wrapped
/opt/stack/nova/nova/compute/manager.py (141): decorated_function
/opt/stack/nova/nova/exception.py (112): wrapped
/opt/stack/nova/nova/compute/manager.py (394): _shutdown_instance_even_if_deleted
/opt/stack/nova/nova/compute/manager.py (443): _run_instance
/opt/stack/nova/nova/compute/manager.py (644): run_instance
/opt/stack/nova/nova/compute/manager.py (170): decorated_function
/opt/stack/nova/nova/exception.py (112): wrapped
/opt/stack/nova/nova/rpc/amqp.py (250): _process_data
/opt/stack/nova/nova/exception.py (112): wrapped
/usr/lib/python2.7/dist-packages/eventlet/greenpool.py (80): _spawn_n_impl
/usr/lib/python2.7/dist-packages/eventlet/hubs/epolls.py (59): do_poll
/usr/lib/python2.7/dist-packages/eventlet/hubs/poll.py (97): wait
/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py (226): run
/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py (177): switch
/usr/lib/python2.7/dist-packages/eventlet/event.py (116): wait
/usr/lib/python2.7/dist-packages/eventlet/greenthread.py (166): wait
/opt/stack/nova/nova/service.py (131): wait
/opt/stack/nova/nova/service.py (411): wait
/opt/stack/nova/bin/nova-compute (54): <module>

So this does look like bug 934575. I merged trunk to get the patch, and no more crashes. Thanks Russell for the help + for the fix.

Here are my notes on how to get a stack trace with Python & GDB:

# I'm presuming root, I think...
sudo bash

# You probably want gdb :-)
sudo apt-get install gdb

# Useful symbols
sudo apt-get install python2.7-dbg
sudo apt-get install libc6-dbg
sudo apt-get install python-greenlet-dbg

#Grrr... from http://wiki.python.org/moin/DebuggingWithGdb :
#Also, with gcc 4.5.2 on Ubuntu (at least) the macros fail because the call_function routine appears to be between PyEval_EvalFrameEx and PyEval_EvalCodeEx so the macro fails with No symbol "co" in current context..
#Recompiling python with make "CFLAGS=-g -fno-inline -fno-strict-aliasing" solves this problem.

# So let's build our own python...
sudo apt-get install dpkg-dev
sudo apt-get build-dep python2.7

mkdir -p ~/apts
pushd ~/apts
apt-get source python2.7
cd python2.7-2.7.2/
#Edit configure.in; add -f -fno-inline to the -fno-strict-aliasing
vim configure.in
Change:
      BASECFLAGS="$BASECFLAGS -fno-strict-aliasing"
To:
      BASECFLAGS="$BASECFLAGS -fno-strict-aliasing -fno-inline -g"

# Build the package, skip tests
export DEB_BUILD_OPTIONS="nocheck notest nobench parallel=8"
# That should work, but I haven't actua...

Read more...

Revision history for this message
Russell Bryant (russellb) wrote :

Thierry, this bug is already fixed. It was only in master for about a week, maybe a little less. It wasn't in essex-3 and doesn't affect stable/diablo. Given all that, I think we can pass on an advisory. Thoughts?

Changed in nova:
assignee: nobody → Russell Bryant (russellb)
importance: Undecided → High
milestone: none → essex-4
status: Incomplete → Fix Committed
Revision history for this message
justinsb (justin-fathomdb) wrote :

I agree that there's no need to classify this as a security bug, as I don't think the problem ever made it to a release.

Revision history for this message
Thierry Carrez (ttx) wrote :

Agreed, and opening to public. Thanks Justin! Like we like to say in security land: "better safe than sorry" :)

visibility: private → public
To post a comment you must log in.
This report contains Public Security information  
Everyone can see this security related information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.