Glance HA Kilo: clustered glance image downloads fail 2/3 of the time

Bug #1456379 reported by Chad Smith
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
glance (Juju Charms Collection)
Fix Released
Critical
James Page

Bug Description

In an HA Openstack (Kilo) Autopilot deployment with 4 physical machines and ceph object and ceph block under glance, we have 3 glance units deployed, a VIP allocated for glance and the glance units related via hacluster.

A very high incidence of glance being deployed in a broken state that results in 404s for image downloads 2/3 of the time using glance cmdline client via "glance image-download <UUID>" or during instance creation in Horizon. Horizon fails with a generic "No host found" error which can be mapped to a glance "Image not found <UUID>" in nova-cloud-controller:/var/log/nova/nova-conductor.log.

python-glanceclient responds with:
glance image-download a6a1271d-5a5b-4c6d-852f-226d4d9b9e4c > download
<html>
 <head>
  <title>404 Not Found</title>
 </head>
 <body>
  <h1>404 Not Found</h1>
  Image /var/lib/glance/images/a6a1271d-5a5b-4c6d-852f-226d4d9b9e4c not found<br /><br />

Since our HA cluster sets up the 3 units behind a VIP, failing 2/3 of the time leads me to believe we have a split brain problem. Either the image uploads were pushed to the default before the ha cluster was set up, or some units are not participating in the cluster and writing objects data to ceph.

OSA uploads two images to the openstack cloud and I have found the image files on either 1 or 2 separate glance units in the default filestore /var/lib/glance/images/ depending on the timing of the glance interaction and which unit responds initially during the deployment.

juju run --service glance 'ls /var/lib/glance/images'
- MachineId: 0/lxc/2
  Stdout: ""
  UnitId: glance/0
- MachineId: 2/lxc/7
  Stdout: |
    ef873813-4c86-45c8-b6db-5eeca2de2cd7
  UnitId: glance/1
- MachineId: 3/lxc/1
  Stdout: |
    a6a1271d-5a5b-4c6d-852f-226d4d9b9e4c
  UnitId: glance/2

Connections to the shared mysql vip are all defined the same in each /etc/glance/glance-registry.conf, so glance image-show always succeeds (seeing the metadata for the images). Only the download attempts get blocked with a 404.

Below are the debug logs from glance-api.log and glance-registry.log for one of the 404s glance image-download attempts

# Glance-api.log
https://pastebin.canonical.com/131636/

# glance-registry
https://pastebin.canonical.com/131637/

Related branches

Revision history for this message
Chad Smith (chad.smith) wrote :
description: updated
Revision history for this message
Chad Smith (chad.smith) wrote :

Making pastes more widely visible

# /var/log/glance/glance-api.log
http://paste.ubuntu.com/11227463/

# /var/log/glance/glance-registry.log

http://paste.ubuntu.com/11227493/

Revision history for this message
James Page (james-page) wrote :

In a ceph backend deployment, its possible that the glance service will go through a phase of using local disk storage prior to ceph configuration being complete.

Any images uploaded during this period will only be accessible on the unit that received them, not generally across the cluster.

The answer is to determine when glance is using ceph - looking into how that might work.

Revision history for this message
Adam Collard (adam-collard) wrote :

> In a ceph backend deployment, its possible that the glance service will go through a phase of using local disk storage prior to
> ceph configuration being complete.

I adjusted Landscape to make it wait for relations to finish settling down before uploading to Glance, and still had the issue.

Further, after Landscape had "finished" deploying the cloud and seeing the failure above, I uploaded cirros to Glance (via Horizon) and then started an instance, but found the same intermittent issues when attempting to download the image from the glance CLI.

Revision history for this message
Adam Collard (adam-collard) wrote :

Hmm, so it seems like glance-api.conf was editted some time after the glance-api service was last started

root@juju-machine-0-lxc-0:~# stat /etc/glance/glance-api.conf | grep Modify
Modify: 2015-06-02 12:26:53.930066434 +0000
root@juju-machine-0-lxc-0:~# for pid in $(pgrep -f glance-api); do stat /proc/$pid | grep Modify; done
Modify: 2015-06-02 12:25:57.682069074 +0000
Modify: 2015-06-02 12:26:03.066068825 +0000
Modify: 2015-06-02 12:26:03.066068825 +0000
Modify: 2015-06-02 12:26:03.066068825 +0000
Modify: 2015-06-02 12:26:03.066068825 +0000
Modify: 2015-06-02 12:26:03.066068825 +0000
Modify: 2015-06-02 12:26:03.066068825 +0000
Modify: 2015-06-02 12:26:03.066068825 +0000
Modify: 2015-06-02 12:26:03.066068825 +0000

Revision history for this message
Adam Collard (adam-collard) wrote :
Revision history for this message
Adam Collard (adam-collard) wrote :

Diving in to this further, I see that the faulty units all sent a request to the Ceph broker /after/ receiving a successful response. glance-api is restarted when a request succeeds.

$ juju run --service glance "grep 'Ceph broker' /var/log/juju/unit-glance*"
- MachineId: 0/lxc/0
  Stdout: |
    2015-06-02 12:20:03 INFO unit.glance/0.juju-log cmd.go:247 ceph:51: Request(s) sent to Ceph broker (rid=ceph:51)
    2015-06-02 12:22:58 INFO unit.glance/0.juju-log cmd.go:247 ceph:51: Request(s) sent to Ceph broker (rid=ceph:51)
    2015-06-02 12:24:10 INFO unit.glance/0.juju-log cmd.go:247 ceph:51: Ceph broker request succeeded (rc=0, msg=None)
    2015-06-02 12:24:59 INFO unit.glance/0.juju-log cmd.go:247 ceph:51: Request(s) sent to Ceph broker (rid=ceph:51)
  UnitId: glance/0
- MachineId: 2/lxc/4
  Stdout: |
    2015-06-02 12:22:51 INFO unit.glance/1.juju-log cmd.go:247 ceph:51: Request(s) sent to Ceph broker (rid=ceph:51)
    2015-06-02 12:24:37 INFO unit.glance/1.juju-log cmd.go:247 ceph:51: Ceph broker request succeeded (rc=0, msg=None)
    2015-06-02 12:25:15 INFO unit.glance/1.juju-log cmd.go:247 ceph:51: Request(s) sent to Ceph broker (rid=ceph:51)
  UnitId: glance/1
- MachineId: 3/lxc/6
  Stdout: |
    2015-06-02 12:20:25 INFO unit.glance/2.juju-log cmd.go:247 ceph:51: Request(s) sent to Ceph broker (rid=ceph:51)
    2015-06-02 12:22:31 INFO unit.glance/2.juju-log cmd.go:247 ceph:51: Request(s) sent to Ceph broker (rid=ceph:51)
    2015-06-02 12:24:12 INFO unit.glance/2.juju-log cmd.go:247 ceph:51: Request(s) sent to Ceph broker (rid=ceph:51)
    2015-06-02 12:24:32 INFO unit.glance/2.juju-log cmd.go:247 ceph:51: Ceph broker request succeeded (rc=0, msg=None)
  UnitId: glance/2
$ juju run --service glance 'ls /var/lib/glance/images'
- MachineId: 0/lxc/0
  Stdout: |
    32497170-748d-4a85-ba3d-f1c5f34eb3ed
    ccd9d8fe-72e1-4f03-95c5-81629a5e5472
  UnitId: glance/0
- MachineId: 2/lxc/4
  Stdout: |
    e73fa89b-bb54-46a7-b347-b7d04344fe0d
  UnitId: glance/1
- MachineId: 3/lxc/6
  Stdout: ""
  UnitId: glance/2

Changed in glance (Juju Charms Collection):
importance: Undecided → High
Revision history for this message
James Page (james-page) wrote :

OK - so glance backends are bust under kilo - not captured during testing - fixing now.

Changed in glance (Juju Charms Collection):
importance: High → Critical
milestone: none → 15.07
Changed in glance (Juju Charms Collection):
status: New → In Progress
David Britton (dpb)
Changed in glance (Juju Charms Collection):
assignee: nobody → James Page (james-page)
Revision history for this message
James Page (james-page) wrote :

Fix pushed to next and stable branches

Changed in glance (Juju Charms Collection):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.