"Invalid block device mapping" error on a creating >200 instances

Bug #1533197 reported by Sergey Arkhipov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Confirmed
High
Ivan Kolodyazhny
8.0.x
Won't Fix
High
Ivan Kolodyazhny
9.x
Won't Fix
High
Ivan Kolodyazhny

Bug Description

Problem discovered on MOS 8.0 ISO #361

When I start a large amount of instances (in my case, 250), a big part of them failed with error like "Build of instance 070995e4-66f4-4acd-87c7-92815a489edd aborted: Block Device Mapping is Invalid."

Message: Build of instance 070995e4-66f4-4acd-87c7-92815a489edd aborted: Block Device Mapping is Invalid.
Code: 500
Details: File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 1905, in _do_build_and_run_instance filter_properties) File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2025, in _build_and_run_instance 'create.error', fault=e) File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 195, in __exit__ six.reraise(self.type_, self.value, self.tb) File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 1996, in _build_and_run_instance block_device_mapping) as resources: File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2143, in _build_resources reason=e.format_message())

Environment:
    * 50 baremetal nodes
    * Ceph *is not* used
        - Cinder LVM over iSCSI for volumes is enabled, rest of checkboxes are disabled
    * Ceilometer, Murano and Sahara *are* installed
    * Each node has 32142MB RAM, 12 CPU

Steps to reproduce:
    * Upload an image with Ubuntu Trusty
    * Proceed to Horizon
    * Proceed to 'Instances', click on 'Launch instance'
    * Select m1.small flavour (1 vCPU, root disk - 20GB, RAM - 2048 MB)
    * Set 'Boot from image (creates a new volume)'
    * Choose uploaded image
    * Set device size to 20 GB
    * Ensure that 'Delete on Terminate' checkbox is active
    * Set 250 instances
    * Choose network
    * Click on 'Launch'

Expected results:
    * All instances are up and running

Real result:
    * Only 81 instance started. 169 failed with error above
    * Cinder has 235 created volumes available or in-use

Due to bug https://bugs.launchpad.net/bugs/1529182 I cannot create diagnostic snapshot, so I've gathered all logs after error happened:
https://drive.google.com/a/mirantis.com/file/d/0B9tzODpFABxkWk05Ykl2V1NZWE0/view?usp=sharing

Please find cinder and nova output attached.

Related upstream bug: https://bugs.launchpad.net/cinder/+bug/1556902

Revision history for this message
Sergey Arkhipov (sarkhipov) wrote :
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

We don't have such large environment, but we are going to give it a try on smaller scale, keeping the number of VMs and volumes used

tags: added: area-nova
tags: removed: long-haul-testing
Revision history for this message
Alexander Gubanov (ogubanov) wrote :

In attached logs I found that nova had got 504 from haproxy.
haproxy logs for cinder have a lot of 504 answers

ogubanov@ogubanov-pc:~/Downloads/logs$ find ./docker-logs/ -name haproxy.log | xargs grep 'cinder' | grep -c ' 504 '
161

and I found that haproxy (seems) reset POST requests if it takes >= 60 second

ogubanov@ogubanov-pc:~/Downloads/logs$ find ./docker-logs/ -name haproxy.log | xargs grep 'cinder' | grep ' 504 ' | tail -n 3
./docker-logs/node-20.domain.tld/haproxy.log:2016-01-12T09:50:37.451451+00:00 info: 192.168.0.35:49739 [12/Jan/2016:09:49:37.453] cinder-api cinder-api/node-20 0/0/0/-1/60001 504 194 - - sH-- 1848/24/24/24/0 0/0 "POST /v1/fcf23cb570d640ffb4c93d4dcca9c138/volumes/20d52a22-efdd-4848-b82e-c9a347510adb/action HTTP/1.1"
./docker-logs/node-20.domain.tld/haproxy.log:2016-01-12T09:51:06.319330+00:00 info: 192.168.0.23:47543 [12/Jan/2016:09:50:06.300] cinder-api cinder-api/node-20 0/0/0/-1/60011 504 194 - - sH-- 1862/23/23/23/0 0/0 "POST /v1/fcf23cb570d640ffb4c93d4dcca9c138/volumes/4f453d0f-6602-476c-8330-1ef1caa91aa1/action HTTP/1.1"
./docker-logs/node-20.domain.tld/haproxy.log:2016-01-12T09:51:06.330370+00:00 info: 192.168.0.36:41157 [12/Jan/2016:09:50:06.302] cinder-api cinder-api/node-20 0/0/0/-1/60021 504 194 - - sH-- 1861/22/22/22/0 0/0 "POST /v1/fcf23cb570d640ffb4c93d4dcca9c138/volumes/f19c61b7-ebc8-40cc-af57-1840a59ae428/action HTTP/1.1"

Details here http://pastebin.com/MjWu9sMr

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

We deploy cinder-api in active/backup mode, which means only one cinder-api instance processes API requests at a moment of time.

According to what Alexander found in logs it's either:

1) there was a problem with node-20 (e.g. we ran out of memory and used swap instead, etc)m which made things really slow and increased cinder-api response times to more than 60s

2) cinder-api simply can't process so many requests in time (== we reached the limit of 1 cinder-api)

Unfortunately, atop reports are missing, so we can't know for sure which one was that.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Cinder team, could you please take a look at logs and provide your feedback on my previous comment?

tags: added: area-cinder
removed: area-nova
Revision history for this message
Ivan Kolodyazhny (e0ne) wrote :

@Sergey, please, provide atop reports and access to the env. We can't reproduce it and find the root cause

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Re-assigning the bug to Sergey, as the ball is on his side now

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Evgeny, please do not re-assign bugs without leaving a comment first.

Ivan made a valid request.

Revision history for this message
Evgeny Ivanov (eivanov) wrote :

*The problem has been reproduced, the env was provided for further analysis. @rpodolyaka created a diagnostic snapshoot.*

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

So I took a look at the environment and uploaded a diagnostic snapshot.

Errors in nova-compute logs are caused by the fact that haproxy terminates the HTTP requests to Cinder (initializing of volume connection) after 60s timeout:

http://paste.openstack.org/show/485411/
http://paste.openstack.org/show/485417/

I checked atop logs to figure out if we are CPU bound or not (cinder-api runs in active-backup mode):

http://paste.openstack.org/show/485413/

The CPU usage is rather high, still it must be ok for a 12-core server.

cinder-volume / cinder-api interaction is interesting:

http://paste.openstack.org/show/485428/

HTTP POST to cinder-api is processed *synchronously* as API waits for return of an RPC call to cinder-volume. The latter is for some reason slow on processing of the RPC call - looks like it gets stuck somehere in Ceph calls - by the time it's ready to send an RPC reply rabbitmq closes the connection (heartbeats missed?).

Eventually, cinder-volume retries and delivers a message back to cinder-api, but it's too late and haproxy already closed the connection on its side.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

MOS Cinder, please take a look at my comment and the diagnostic snapshot and tell, what you think.

tags: added: move-to-mu
tags: added: release-notes
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

We'll need more time to find the root cause of this. The user impact is limited. We believe it's safe to move this to a MU and document this as a current performance/scale limitation in the release notes.

Revision history for this message
Ivan Kolodyazhny (e0ne) wrote :

The issue is reproducible both with Ceph and LVM backends

Revision history for this message
Ivan Kolodyazhny (e0ne) wrote :

The reason of failures is:
We've got a lot of concurrent requests to cinder-api <==> cinder-volume. Some of them executed more than 1 minute and HAProxy terminates connection. That's why Nova raises "Block Device Mapping is Invalid" exception. I'm still investigating why cinder-volume works slow in such cases

Revision history for this message
Sergey Arkhipov (sarkhipov) wrote :

The problem was reproduced on build #562 on scale env 14 at even smaller scale (on order of magnitude!).

Environment:
    * 3 controllers
        - Sahara, Murano activated
        - Ceph is used for everything
    * 2 LMA nodes
    * 5 computes

I've reproduced the steps above on different amount of booting VMs:
  * During boot of 15 instances (3 per compute) I've got following instances failed:
    - e960617e-041f-4348-8d3c-1a38b39cec11
  * During boot of 20 instances (4 per compute) I've got following instances failed:
    - 04e0746b-5ba6-4622-bd8a-acf68f6c77e1
    - 528a9bc1-5374-494b-b405-3ac35d33e3a4

Please find diagnostic snapshot here:
https://drive.google.com/a/mirantis.com/file/d/0B9tzODpFABxkdk9JeDNEZ0l1Y0k/view?usp=sharing

tags: added: 8.0 release-notes-done
removed: release-notes
Revision history for this message
Ivan Kolodyazhny (e0ne) wrote :
Ivan Kolodyazhny (e0ne)
description: updated
Revision history for this message
Ivan Kolodyazhny (e0ne) wrote :

Actually, there are two different issues with Ceph and LVM backends:

1. Ceph:
Using 'raw' format for images increased number of concurrent builds of volumes. Now, it's in our guidelines so, I hope, it should not be an issue any more.

2. LVM:
We can't fix it in 9.0 and earlier. Fix requires a lot of coder changed cinder and should be done not in scope of bug. I'll propose a spec for it to Newton. There is some feedback of my proposal both in openstack-dev and operators mailing lists [1], [2]
These solution also could be helpful for Ceph backend

Unfortunately, we can't provide only workaround as a recommendation to use 'Generic image cache' feature which was introduces in Liberty/MOS8.0

[1] http://lists.openstack.org/pipermail/openstack-dev/2016-May/095668.html
[2] http://lists.openstack.org/pipermail/openstack-operators/2016-May/010527.html

Revision history for this message
Ivan Kolodyazhny (e0ne) wrote :

The root cause of this issue is the same as for #1550192

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.