In multinode setup VM fails to launch due to cinder not checking with all glance api servers

Bug #1571211 reported by Serguei Bezverkhi
22
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Cinder
Invalid
Medium
Sean McGinnis
Glance
New
Undecided
Unassigned

Bug Description

In multinode setup every other instance fails to launch due to cinder not checking with all configured glance api servers for the requested image.

Steven Dake (sdake)
affects: kolla → cinder
Changed in cinder:
status: New → Confirmed
assignee: nobody → Steven Dake (sdake)
Changed in cinder:
status: Confirmed → In Progress
Revision history for this message
Steven Dake (sdake) wrote :

kolla uses glance_api_servers as a list of all glance API services and glance_num_retries. These values are not honored if glance returns a None type which happens if the call() operation results in a missing image. Sergeuei to add logs.

Revision history for this message
Steven Dake (sdake) wrote :

I have confirmed this bug on multinode setup using LVM SCSI. I have also confirmed it via inspection of the code base.

Revision history for this message
Steven Dake (sdake) wrote :

The review tracking this bug is:
https://review.openstack.org/306756

Changed in cinder:
importance: Undecided → Medium
Revision history for this message
Steven Dake (sdake) wrote :

Sheel suggested using this exception:
[10:08:22] <sheel> sdake: nopes, exception.ImageNotFound :)

Revision history for this message
Serguei Bezverkhi (sbezverk) wrote :

Here are the steps to reproduce it:

Due to the nature of this bug, at least 2 better 3 VM needs to be launched.

1. Add image to glance, could be ubuntu or centos (these two were tested)
2. Create 3 volumes and make them bootable, these volumes will be used as VMs hard disks.

cinder create --name centos7-1-disk 10
cinder create --name ubuntu-1-disk 10
cinder create --name ubuntu-2-disk 10
cinder set-bootable $(cinder list | grep centos7-1-disk | awk {'print $2'}) true
cinder set-bootable $(cinder list | grep ubuntu-1-disk | awk {'print $2'}) true
cinder set-bootable $(cinder list | grep ubuntu-2-disk | awk {'print $2'}) true

3. Launched 3 instances one right after another. Since the issue is between cinder and glance it is very important to use exact command structure, 1st disk is cinder volume which will be created automatically based on glance image, 2nd disk, the disk prepared during step 2.

nova boot --flavor m1.small-10g \
--nic net-id=$(neutron net-list | grep net-1710 | awk '{print $2}') \
--block-device id=$(glance image-list | grep CentOS-7-x86_64 | awk '{print $2}'),source=image,dest=volume,bus=ide,device=/dev/hdc,size=5,type=cdrom,bootindex=1 \
--block-device source=volume,id=$(cinder list | grep centos7-1-disk | awk '{print $2}'),dest=volume,size=10,bootindex=0 centos-1

nova boot --flavor m1.small-10g \
--nic net-id=$(neutron net-list | grep net-1710 | awk '{print $2}') \
--block-device id=$(glance image-list | grep ubuntu | awk '{print $2}'),source=image,dest=volume,bus=ide,device=/dev/hdc,size=1,type=cdrom,bootindex=1 \
--block-device source=volume,id=$(cinder list | grep ubuntu-1-disk | awk '{print $2}'),dest=volume,size=10,bootindex=0 ubuntu-1

nova boot --flavor m1.small-10g \
--nic net-id=$(neutron net-list | grep net-1710 | awk '{print $2}') \
--block-device id=$(glance image-list | grep ubuntu | awk '{print $2}'),source=image,dest=volume,bus=ide,device=/dev/hdc,size=1,type=cdrom,bootindex=1 \
--block-device source=volume,id=$(cinder list | grep ubuntu-2-disk | awk '{print $2}'),dest=volume,size=10,bootindex=0 ubuntu-2

When the bug is triggered, 1st and 3rd instance will be in active state, but 2nd instance will be in error state.

[root@deployment-1 tools]# nova list
+--------------------------------------+----------+--------+------------+-------------+----------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+----------+--------+------------+-------------+----------------------+
| 334f41d6-2a24-4641-8c75-dee091ed018c | centos-1 | ACTIVE | - | Running | net-1710=10.57.10.11 |
| e1c57483-7232-4aae-8175-6ea9cc6b661e | ubuntu-1 | ERROR | - | NOSTATE | |
| 5dd944ad-578d-45f0-9404-0e7bb6e43933 | ubuntu-2 | ACTIVE | - | Running | net-1710=10.57.10.13 |
+--------------------------------------+----------+--------+------------+-------------+----------------------+

Revision history for this message
Serguei Bezverkhi (sbezverk) wrote :

cinder-volume.log collected on the server before applying the proposed fix.

Changed in cinder:
assignee: Steven Dake (sdake) → John Griffith (john-griffith)
Revision history for this message
Sean McGinnis (sean-mcginnis) wrote :

The actual issue is the glance backends are not synced, correct? So this isn't so much a bug in Cinder as a proposal to work around configuration issues with retries on the Cinder side.

Revision history for this message
Serguei Bezverkhi (sbezverk) wrote :

Hi Sean,

Glance allows this configuration when an image exist on one api server and does not on another. The client's logic (nova, cinder, etc) then should be to try to get the image from each of configured api servers and fails only after all api servers were contacted.

Serguei

Revision history for this message
Michal Dulko (michal-dulko-f) wrote :

I disagree with Serguei here. Having non-replicated Glance instances in a single OpenStack deployment seems very non-HA and I believe is an abuse of retries functionality. The correct way of having Glance configured for HA is to fully replicate the stores of each of them.

Revision history for this message
Michal Dulko (michal-dulko-f) wrote :

And why exactly aren't you using a single backend for all Glance instances in Kolla? This seems *very* inefficient to always query all the g-api's to find a certain image. Consistent hash ring and its implementations (it's Swift here I guess?) were invented exactly to avoid that.

Revision history for this message
Serguei Bezverkhi (sbezverk) wrote :

Michal, do not get me wrong, I agree with what you saying. But the fact that using file backend with api server list is still VALID and documented configuration despite all its inefficiencies it should work. If something does not work as described by the doc, it is a bug and must be fixed.

Revision history for this message
Michal Dulko (michal-dulko-f) wrote :

Serguei, can you point to Glance docs stating that this is a supported configuration? Using glance_api_servers for that purpose renders Keystone's service catalog useless to me, so I'm surprised to hear this is in the official docs.

Revision history for this message
Serguei Bezverkhi (sbezverk) wrote :

Here is configuration lines from cinder.conf(liberty), I do not think if it were not official, it would be here, right? and plural of "servers" indicates that there might be more than just one.

# A list of the glance API servers available to cinder
# ([hostname|ip]:port) (list value)
#glance_api_servers=$glance_host:$glance_port

I posted a question on glance IRC asking them to confirm, will let you know as soon as I hear back.

Revision history for this message
Serguei Bezverkhi (sbezverk) wrote :

I got confirmation from glance, in this agreed weird scenario, that each controller used with file as glance backend, has no knowledge about other controllers and which images they have. It is responsibility of a client process to walk provided api server list and check with each for that image presence. Basically if this fix is not done, we will have to publish it somewhere as a caveat that the glance file backend is not supported in multinode setup.

Revision history for this message
Serguei Bezverkhi (sbezverk) wrote :

All, after some additional discussions with glance folks the agreement was to publish a doc patch stating that multi-node scenario does not support file as a back-end for glance.

Revision history for this message
Michal Dulko (michal-dulko-f) wrote :

Serguei: Does this render this bug invalid?

Changed in cinder:
status: In Progress → Won't Fix
Revision history for this message
Steven Dake (sdake) wrote :

Michal,

The bug is still valid. The code as is is defective. If an image is not found for whatever reason, only the first api server is examined. Look at review:
https://review.openstack.org/306756

Changed in cinder:
status: Won't Fix → In Progress
Revision history for this message
GrzegorzKoper (grzegorz-koper) wrote :

Hello,
This is exactly the situation i faced when I deployed multinode topology. Whenever you try to boot the instance from Image , it will use round robin to connect glance API , and fails to deploy if image is not found on local storage.

Revision history for this message
GrzegorzKoper (grzegorz-koper) wrote :

Sry for double post, but is there a related bug opened in glance ?
Even if we move glance backend to external swift ( to eliminate this issue ), situation forces us to use glance-cache - and then the same problems with the round-robin mechanism, and glance-cache-manage , glance-cache-prefetcher not finding the images :/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on cinder (master)

Change abandoned by John Griffith (<email address hidden>) on branch: master
Review: https://review.openstack.org/306756

Revision history for this message
Vladislav Belogrudov (vlad-belogrudov) wrote :

i doubt it is a problem of Cinder. Glance will fail on its own in multinode setup if you don't specify ha configuration with proper shared storage backend like ceph or nfs. Just try to store / fetch images to get the same error.

Revision history for this message
Sean McGinnis (sean-mcginnis) wrote : Bug Assignee Expired

Unassigning due to no activity for > 6 months.

Changed in cinder:
assignee: John Griffith (john-griffith) → nobody
Changed in cinder:
status: In Progress → New
Changed in cinder:
assignee: nobody → Sean McGinnis (sean-mcginnis)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on cinder (master)

Change abandoned by Sean McGinnis (<email address hidden>) on branch: master
Review: https://review.openstack.org/306756

Revision history for this message
Sean McGinnis (sean-mcginnis) wrote :

Consensus appears to be that this is something Glance, or more accurately python-glanceclient, should handle, not every consuming project.

Changed in cinder:
status: In Progress → Invalid
Revision history for this message
Viorel-Cosmin Miron (uhl-hosting) wrote :

What happened to this? Is this solved, still affecting users?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.