In multinode setup VM fails to launch due to cinder not checking with all glance api servers

Bug #1571211 reported by Serguei Bezverkhi on 2016-04-16
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Cinder
Medium
Sean McGinnis
Glance
Undecided
Unassigned

Bug Description

In multinode setup every other instance fails to launch due to cinder not checking with all configured glance api servers for the requested image.

Steven Dake (sdake) on 2016-04-16
affects: kolla → cinder
Changed in cinder:
status: New → Confirmed
assignee: nobody → Steven Dake (sdake)
Changed in cinder:
status: Confirmed → In Progress
Steven Dake (sdake) wrote :

kolla uses glance_api_servers as a list of all glance API services and glance_num_retries. These values are not honored if glance returns a None type which happens if the call() operation results in a missing image. Sergeuei to add logs.

Steven Dake (sdake) wrote :

I have confirmed this bug on multinode setup using LVM SCSI. I have also confirmed it via inspection of the code base.

Steven Dake (sdake) wrote :

The review tracking this bug is:
https://review.openstack.org/306756

Changed in cinder:
importance: Undecided → Medium
Steven Dake (sdake) wrote :

Sheel suggested using this exception:
[10:08:22] <sheel> sdake: nopes, exception.ImageNotFound :)

Serguei Bezverkhi (sbezverk) wrote :

Here are the steps to reproduce it:

Due to the nature of this bug, at least 2 better 3 VM needs to be launched.

1. Add image to glance, could be ubuntu or centos (these two were tested)
2. Create 3 volumes and make them bootable, these volumes will be used as VMs hard disks.

cinder create --name centos7-1-disk 10
cinder create --name ubuntu-1-disk 10
cinder create --name ubuntu-2-disk 10
cinder set-bootable $(cinder list | grep centos7-1-disk | awk {'print $2'}) true
cinder set-bootable $(cinder list | grep ubuntu-1-disk | awk {'print $2'}) true
cinder set-bootable $(cinder list | grep ubuntu-2-disk | awk {'print $2'}) true

3. Launched 3 instances one right after another. Since the issue is between cinder and glance it is very important to use exact command structure, 1st disk is cinder volume which will be created automatically based on glance image, 2nd disk, the disk prepared during step 2.

nova boot --flavor m1.small-10g \
--nic net-id=$(neutron net-list | grep net-1710 | awk '{print $2}') \
--block-device id=$(glance image-list | grep CentOS-7-x86_64 | awk '{print $2}'),source=image,dest=volume,bus=ide,device=/dev/hdc,size=5,type=cdrom,bootindex=1 \
--block-device source=volume,id=$(cinder list | grep centos7-1-disk | awk '{print $2}'),dest=volume,size=10,bootindex=0 centos-1

nova boot --flavor m1.small-10g \
--nic net-id=$(neutron net-list | grep net-1710 | awk '{print $2}') \
--block-device id=$(glance image-list | grep ubuntu | awk '{print $2}'),source=image,dest=volume,bus=ide,device=/dev/hdc,size=1,type=cdrom,bootindex=1 \
--block-device source=volume,id=$(cinder list | grep ubuntu-1-disk | awk '{print $2}'),dest=volume,size=10,bootindex=0 ubuntu-1

nova boot --flavor m1.small-10g \
--nic net-id=$(neutron net-list | grep net-1710 | awk '{print $2}') \
--block-device id=$(glance image-list | grep ubuntu | awk '{print $2}'),source=image,dest=volume,bus=ide,device=/dev/hdc,size=1,type=cdrom,bootindex=1 \
--block-device source=volume,id=$(cinder list | grep ubuntu-2-disk | awk '{print $2}'),dest=volume,size=10,bootindex=0 ubuntu-2

When the bug is triggered, 1st and 3rd instance will be in active state, but 2nd instance will be in error state.

[root@deployment-1 tools]# nova list
+--------------------------------------+----------+--------+------------+-------------+----------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+----------+--------+------------+-------------+----------------------+
| 334f41d6-2a24-4641-8c75-dee091ed018c | centos-1 | ACTIVE | - | Running | net-1710=10.57.10.11 |
| e1c57483-7232-4aae-8175-6ea9cc6b661e | ubuntu-1 | ERROR | - | NOSTATE | |
| 5dd944ad-578d-45f0-9404-0e7bb6e43933 | ubuntu-2 | ACTIVE | - | Running | net-1710=10.57.10.13 |
+--------------------------------------+----------+--------+------------+-------------+----------------------+

Serguei Bezverkhi (sbezverk) wrote :

cinder-volume.log collected on the server before applying the proposed fix.

Changed in cinder:
assignee: Steven Dake (sdake) → John Griffith (john-griffith)
Sean McGinnis (sean-mcginnis) wrote :

The actual issue is the glance backends are not synced, correct? So this isn't so much a bug in Cinder as a proposal to work around configuration issues with retries on the Cinder side.

Serguei Bezverkhi (sbezverk) wrote :

Hi Sean,

Glance allows this configuration when an image exist on one api server and does not on another. The client's logic (nova, cinder, etc) then should be to try to get the image from each of configured api servers and fails only after all api servers were contacted.

Serguei

Michal Dulko (michal-dulko-f) wrote :

I disagree with Serguei here. Having non-replicated Glance instances in a single OpenStack deployment seems very non-HA and I believe is an abuse of retries functionality. The correct way of having Glance configured for HA is to fully replicate the stores of each of them.

Michal Dulko (michal-dulko-f) wrote :

And why exactly aren't you using a single backend for all Glance instances in Kolla? This seems *very* inefficient to always query all the g-api's to find a certain image. Consistent hash ring and its implementations (it's Swift here I guess?) were invented exactly to avoid that.

Serguei Bezverkhi (sbezverk) wrote :

Michal, do not get me wrong, I agree with what you saying. But the fact that using file backend with api server list is still VALID and documented configuration despite all its inefficiencies it should work. If something does not work as described by the doc, it is a bug and must be fixed.

Michal Dulko (michal-dulko-f) wrote :

Serguei, can you point to Glance docs stating that this is a supported configuration? Using glance_api_servers for that purpose renders Keystone's service catalog useless to me, so I'm surprised to hear this is in the official docs.

Serguei Bezverkhi (sbezverk) wrote :

Here is configuration lines from cinder.conf(liberty), I do not think if it were not official, it would be here, right? and plural of "servers" indicates that there might be more than just one.

# A list of the glance API servers available to cinder
# ([hostname|ip]:port) (list value)
#glance_api_servers=$glance_host:$glance_port

I posted a question on glance IRC asking them to confirm, will let you know as soon as I hear back.

Serguei Bezverkhi (sbezverk) wrote :

I got confirmation from glance, in this agreed weird scenario, that each controller used with file as glance backend, has no knowledge about other controllers and which images they have. It is responsibility of a client process to walk provided api server list and check with each for that image presence. Basically if this fix is not done, we will have to publish it somewhere as a caveat that the glance file backend is not supported in multinode setup.

Serguei Bezverkhi (sbezverk) wrote :

All, after some additional discussions with glance folks the agreement was to publish a doc patch stating that multi-node scenario does not support file as a back-end for glance.

Michal Dulko (michal-dulko-f) wrote :

Serguei: Does this render this bug invalid?

Changed in cinder:
status: In Progress → Won't Fix
Steven Dake (sdake) wrote :

Michal,

The bug is still valid. The code as is is defective. If an image is not found for whatever reason, only the first api server is examined. Look at review:
https://review.openstack.org/306756

Changed in cinder:
status: Won't Fix → In Progress
GrzegorzKoper (grzegorz-koper) wrote :

Hello,
This is exactly the situation i faced when I deployed multinode topology. Whenever you try to boot the instance from Image , it will use round robin to connect glance API , and fails to deploy if image is not found on local storage.

GrzegorzKoper (grzegorz-koper) wrote :

Sry for double post, but is there a related bug opened in glance ?
Even if we move glance backend to external swift ( to eliminate this issue ), situation forces us to use glance-cache - and then the same problems with the round-robin mechanism, and glance-cache-manage , glance-cache-prefetcher not finding the images :/

Change abandoned by John Griffith (<email address hidden>) on branch: master
Review: https://review.openstack.org/306756

i doubt it is a problem of Cinder. Glance will fail on its own in multinode setup if you don't specify ha configuration with proper shared storage backend like ceph or nfs. Just try to store / fetch images to get the same error.

Unassigning due to no activity for > 6 months.

Changed in cinder:
assignee: John Griffith (john-griffith) → nobody
Changed in cinder:
status: In Progress → New
Changed in cinder:
assignee: nobody → Sean McGinnis (sean-mcginnis)
status: New → In Progress

Change abandoned by Sean McGinnis (<email address hidden>) on branch: master
Review: https://review.openstack.org/306756

Sean McGinnis (sean-mcginnis) wrote :

Consensus appears to be that this is something Glance, or more accurately python-glanceclient, should handle, not every consuming project.

Changed in cinder:
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers