Can not attach volume to server : AssertionError: PID check failed

Bug #1341739 reported by Pavlo Shchelokovskyy
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Cinder
Invalid
High
Unassigned

Bug Description

Using master devstack with master of all the services.

Steps to reproduce:

1) create nova server and cinder volume
2) nova volume-attach <server_id> <volume_id> <mount>

Expected: volume attached to server, confirmed by both "cinder list" and "nova show <server_id>"

Actual: volume is not attached (at least those two commands do not show it as such). If I quickly issue "nova show .. " several times after "nova volume-attach .." I can see volume first appearing as attached per nova knowledge, but then it disappears from list of volume attachments of server. Cinder itself never reports the volume as attached.

This is what I think a relevant trace that appears in "c-vol" screen several times (4 at least) during an attempt to attach the volume:

2014-07-14 19:47:18.796 DEBUG cinder.volume.manager [req-aef1ab98-5c9b-420a-a095-0849b57ce04c 4a9d5904306a4888b63fc55685687df3 63f92ae55e7b429b9c1f6907cd58b129] Volume 6efdd070-
e97c-4d42-a338-84a9308fcd74: creating export from (pid=13760) initialize_connection /opt/stack/cinder/cinder/volume/manager.py:795
2014-07-14 19:47:18.797 ERROR oslo.messaging.rpc.dispatcher [req-aef1ab98-5c9b-420a-a095-0849b57ce04c 4a9d5904306a4888b63fc55685687df3 63f92ae55e7b429b9c1f6907cd58b129] Exceptio
n during message handling: PID check failed. RNG must be re-initialized after fork(). Hint: Try Random.atfork()
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher incoming.message))
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher return self._do_dispatch(endpoint, method, ctxt, args)
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher result = getattr(endpoint, method)(ctxt, **new_args)
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/cinder/cinder/volume/manager.py", line 797, in initialize_connection
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher volume)
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/cinder/cinder/volume/drivers/lvm.py", line 527, in create_export
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher return self._create_export(context, volume)
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/cinder/cinder/volume/drivers/lvm.py", line 539, in _create_export
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher self.configuration)
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/cinder/cinder/volume/iscsi.py", line 44, in create_export
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher chap_username = utils.generate_username()
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/cinder/cinder/utils.py", line 408, in generate_username
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher return generate_password(length, symbolgroups)
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/cinder/cinder/utils.py", line 387, in generate_password
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher password = [random.choice(s) for s in symbolgroups]
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/Crypto/Random/random.py", line 102, in choice
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher return seq[self.randrange(len(seq))]
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/Crypto/Random/random.py", line 83, in randrange
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher r = self.getrandbits(size(num_choices))
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/Crypto/Random/random.py", line 51, in getrandbits
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher return mask & bytes_to_long(self._randfunc(ceil_div(k, 8)))
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 202, in read
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher return self._singleton.read(bytes)
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 178, in read
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher return _UserFriendlyRNG.read(self, bytes)
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 137, in read
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher self._check_pid()
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 153, in _check_pid
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher raise AssertionError("PID check failed. RNG must be re-initialized after fork(). Hint: Try Random.atfork()")
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher AssertionError: PID check failed. RNG must be re-initialized after fork(). Hint: Try Random.atfork()
2014-07-14 19:47:18.797 TRACE oslo.messaging.rpc.dispatcher

Revision history for this message
John Griffith (john-griffith) wrote :

Hi Pavlo,
Could you provide your devstack local.conf file as well as cinder.conf please? Also, what Platform you're running on (OS and Version).

Thanks,
John

Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote :

I am running master DevStack on Ubuntu 14.04 x64 VM under qemu-kvm. cinder.conf files is attached. local.conf has nothing Cinder-specific, so Cinder is probably using DevStack defaults.

Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote :

local.conf is attached

Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote :

Just a side-comment - there is a review https://review.openstack.org/#/c/90143 of a tempest test that tries to excersise volume operations through Heat. At some point Jenkins started consistently failing on it with Heat not being able to attach a volume to instance. Judging from the time of first failure, there was something relevant changed in Cinder between June 19 and June 24... But of course I might be completely wrong about it, as the error here seems to pop from 3rd party dependency.

Revision history for this message
git-harry (git-harry) wrote :

I wasn't able to reproduce this using a default devstack install. The trace suggests that https://review.openstack.org/#/c/105779/ may be the cause.

Revision history for this message
Ivan Kolodyazhny (e0ne) wrote :

Also can't reproduce on fresh devstack install from master

Revision history for this message
John Griffith (john-griffith) wrote :

I agree with git-harry, what seems to be happening here is the change to use pycrypto for CHAP credential generation didn't take into account Random Number Generator initialization for the threads. There are quite a few bugs logged against pycrypto for this which come up pretty quick on a Google search.

Looking at options to fix this, but may want to consider reverting the patch.

Changed in cinder:
status: New → Triaged
importance: Undecided → High
Revision history for this message
John Griffith (john-griffith) wrote :

By the way, I've never been able to get a clear explanation of *why* pycrypto's random is better than say sysrandom in this case? IIRC sysrandom will default init PRNG using systime and wouldn't exhibit this problem for threads/forks.

Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote :

So I decided to test it in somewhat cleaner env:
Same DevStack VM,
- killed everything in /opt
- killed /etc/cinder/
- killed and recloned devstack

Started devstack with the same local.conf as attached above. Bug is reproduced.

Revision history for this message
Ivan Kolodyazhny (e0ne) wrote :

John, we could use random.SystemRandom to generate passwords instead of Crypto.Random

Using Crypto.Random.atfork() method as a fix looks like e hack but it resolves issue too.

Revision history for this message
John Griffith (john-griffith) wrote :

Trying to setup an env that can repro this and having issues. One thing I was wondering is what if we explicitly reinitialized RNG in the method? Would you be willing to try this out since you have an env that exhibits the problem?

Revision history for this message
Ivan Kolodyazhny (e0ne) wrote :

Such patch http://paste.openstack.org/show/86601/ fixes problem on Pavlo's environment but it doesn't look well

Revision history for this message
John Griffith (john-griffith) wrote :

I'm not overly object to the proposed patch so long as it fixes the issue. I am concerned that I still can't seem to reproduce this on my setup. Fresh Trusty install and using the provided local.conf (just added HOST_IP) field.

I'd consider either the patch your provided or honestly even reverting the change in question as I don't know that I agree our usage of random here is really a 'bad thing' to begin with.

Revision history for this message
Ivan Kolodyazhny (e0ne) wrote :

Agree with you. I'm also trying to reproduce it on my environment.

Revision history for this message
Matt Riedemann (mriedem) wrote :

We're seeing this in our internal CI on Juno, started on 7/14 (same as when this was opened).

We're on RHEL 6.5 and python-crypto 2.6.0. I updated to python-crypto 2.6.1 but that didn't help. The bug fix in 2.6.1 looked interesting/related though:

https://github.com/dlitz/pycrypto/blob/7fd528d03b5eae58eef6fd219af5d9ac9c83fa50/ChangeLog

I queried for the error in logstash but that doesn't show anything, so I was thinking maybe it's a py26 or certain gcc version issue, but comment 2 says it's happening on ubuntu 14.04. It's odd we're not seeing this in the gate, but these are the tests we're seeing fail internally:

tempest.api.compute.servers.test_server_rescue_negative.ServerRescueNegativeTestJSON.test_rescued_vm_detach_volume[gate,negative]
tempest.api.compute.servers.test_server_rescue_negative.ServerRescueNegativeTestXML.test_rescued_vm_detach_volume[gate,negative]
tempest.api.compute.servers.test_delete_server.DeleteServersTestJSON.test_delete_server_while_in_attached_volume[gate]
tempest.api.compute.servers.test_delete_server.DeleteServersTestXML.test_delete_server_while_in_attached_volume[gate]
tempest.scenario.test_encrypted_cinder_volumes.TestEncryptedCinderVolumes.test_encrypted_cinder_volumes_cryptsetup[compute,image,volume]
tempest.scenario.test_encrypted_cinder_volumes.TestEncryptedCinderVolumes.test_encrypted_cinder_volumes_luks[compute,image,volume]

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

Note that the gate doesn't run with multi_backend=True in Tempest, it's not even run in a periodic-qa job from what I can tell, so we have no multi-backend testing upstream but we're testing it internally, so that might also be part of the issue here.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Test patch here: https://review.openstack.org/#/c/107827/

I'd like to see if running with cinder multi-backend will reproduce the failure in the check queue, then we'll know that's related to the problem since we don't run multi-backend testing right now.

Revision history for this message
Ivan Kolodyazhny (e0ne) wrote :

This issue is reproducible inside KVM: OpenStack (devstack) is running inside kvm instance

Revision history for this message
Matt Riedemann (mriedem) wrote :

Note that oslo-incubator has a crypt utils module, maybe that can be used instead, otherwise glance and heat have their own crypto modules as well so maybe something can be re-used from there if using pycrypto directly isn't working.

Revision history for this message
Ivan Kolodyazhny (e0ne) wrote :

For a 'default' python random generator we run random.reseed() [1]. pycrypto Random.atfork should be invoked for every child too. I'm not sure that proposed match in a comment #12 is a good solution. So the reverting [2] is looking like a good option.

Also, I'm not sure while this bug is reproducible only inside KVM for me. I tested on a bare-metal, KVM and VirtualBox.

[1] https://github.com/openstack/cinder/blob/master/cinder/openstack/common/service.py#L295
[2] https://review.openstack.org/#/c/105779/

Revision history for this message
Ivan Kolodyazhny (e0ne) wrote :

Can't reproduce it on my env after redeployment

Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote :

I can not reproduce it any more either. Using Ubuntu 14.04 all updated from official repos inside KVM (host - Ubuntu 14.04 all updated), master DevStack + masters of all the components. The bug is gone.

As I noticed that qemu, kvm and libvirt were updated during latest "apt-get upgrade" I believe it might have been a bug in hypervisor.

Closing as Invalid.

Changed in cinder:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.