Juju Charms Collection
glance package

Glance HA Kilo: clustered glance image downloads fail 2/3 of the time

Bug #1456379 reported by Chad Smith on 2015-05-18

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	glance (Juju Charms Collection)	Fix Released	Critical	James Page	Juju Charms Collection 15.07

Bug Description

In an HA Openstack (Kilo) Autopilot deployment with 4 physical machines and ceph object and ceph block under glance, we have 3 glance units deployed, a VIP allocated for glance and the glance units related via hacluster.

A very high incidence of glance being deployed in a broken state that results in 404s for image downloads 2/3 of the time using glance cmdline client via "glance image-download <UUID>" or during instance creation in Horizon. Horizon fails with a generic "No host found" error which can be mapped to a glance "Image not found <UUID>" in nova-cloud-controller:/var/log/nova/nova-conductor.log.

python-glanceclient responds with:
glance image-download a6a1271d-5a5b-4c6d-852f-226d4d9b9e4c > download
<html>
<head>
  <title>404 Not Found</title>
</head>
<body>
  <h1>404 Not Found</h1>
  Image /var/lib/glance/images/a6a1271d-5a5b-4c6d-852f-226d4d9b9e4c not found<br /><br />

Since our HA cluster sets up the 3 units behind a VIP, failing 2/3 of the time leads me to believe we have a split brain problem. Either the image uploads were pushed to the default before the ha cluster was set up, or some units are not participating in the cluster and writing objects data to ceph.

OSA uploads two images to the openstack cloud and I have found the image files on either 1 or 2 separate glance units in the default filestore /var/lib/glance/images/ depending on the timing of the glance interaction and which unit responds initially during the deployment.

juju run --service glance 'ls /var/lib/glance/images'
- MachineId: 0/lxc/2
  Stdout: ""
  UnitId: glance/0
- MachineId: 2/lxc/7
  Stdout: |
    ef873813-4c86-45c8-b6db-5eeca2de2cd7
  UnitId: glance/1
- MachineId: 3/lxc/1
  Stdout: |
    a6a1271d-5a5b-4c6d-852f-226d4d9b9e4c
  UnitId: glance/2

Connections to the shared mysql vip are all defined the same in each /etc/glance/glance-registry.conf, so glance image-show always succeeds (seeing the metadata for the images). Only the download attempts get blocked with a 404.

Below are the debug logs from glance-api.log and glance-registry.log for one of the 404s glance image-download attempts

# Glance-api.log
https://pastebin.canonical.com/131636/

# glance-registry
https://pastebin.canonical.com/131637/

See original description

Tags:

Related branches

lp:~james-page/charms/trusty/glance/lp1456379

Merged into lp:~openstack-charmers-archive/charms/trusty/glance/next at revision 117

Liam Young (community): Approve on 2015-06-03

Revision history for this message

Chad Smith (chad.smith) wrote on 2015-05-18:

Juju status Edit (40.6 KiB, text/plain)

Adam Collard (adam-collard) on 2015-05-19

description:

updated

Revision history for this message

Chad Smith (chad.smith) wrote on 2015-05-19:

Making pastes more widely visible

# /var/log/glance/glance-api.log
http://paste.ubuntu.com/11227463/

# /var/log/glance/glance-registry.log

http://paste.ubuntu.com/11227493/

Revision history for this message

James Page (james-page) wrote on 2015-06-01:

In a ceph backend deployment, its possible that the glance service will go through a phase of using local disk storage prior to ceph configuration being complete.

Any images uploaded during this period will only be accessible on the unit that received them, not generally across the cluster.

The answer is to determine when glance is using ceph - looking into how that might work.

Revision history for this message

Adam Collard (adam-collard) wrote on 2015-06-02:

> In a ceph backend deployment, its possible that the glance service will go through a phase of using local disk storage prior to
> ceph configuration being complete.

I adjusted Landscape to make it wait for relations to finish settling down before uploading to Glance, and still had the issue.

Further, after Landscape had "finished" deploying the cloud and seeing the failure above, I uploaded cirros to Glance (via Horizon) and then started an instance, but found the same intermittent issues when attempting to download the image from the glance CLI.

Revision history for this message

Adam Collard (adam-collard) wrote on 2015-06-02:

Hmm, so it seems like glance-api.conf was editted some time after the glance-api service was last started

root@juju-machine-0-lxc-0:~# stat /etc/glance/glance-api.conf | grep Modify
Modify: 2015-06-02 12:26:53.930066434 +0000
root@juju-machine-0-lxc-0:~# for pid in $(pgrep -f glance-api); do stat /proc/$pid | grep Modify; done
Modify: 2015-06-02 12:25:57.682069074 +0000
Modify: 2015-06-02 12:26:03.066068825 +0000
Modify: 2015-06-02 12:26:03.066068825 +0000
Modify: 2015-06-02 12:26:03.066068825 +0000
Modify: 2015-06-02 12:26:03.066068825 +0000
Modify: 2015-06-02 12:26:03.066068825 +0000
Modify: 2015-06-02 12:26:03.066068825 +0000
Modify: 2015-06-02 12:26:03.066068825 +0000
Modify: 2015-06-02 12:26:03.066068825 +0000

Revision history for this message

Adam Collard (adam-collard) wrote on 2015-06-02:

http://paste.ubuntu.com/11521361/ unit-glance-0.log

Revision history for this message

Adam Collard (adam-collard) wrote on 2015-06-02:

Diving in to this further, I see that the faulty units all sent a request to the Ceph broker /after/ receiving a successful response. glance-api is restarted when a request succeeds.

$ juju run --service glance "grep 'Ceph broker' /var/log/juju/unit-glance*"
- MachineId: 0/lxc/0
  Stdout: |
    2015-06-02 12:20:03 INFO unit.glance/0.juju-log cmd.go:247 ceph:51: Request(s) sent to Ceph broker (rid=ceph:51)
    2015-06-02 12:22:58 INFO unit.glance/0.juju-log cmd.go:247 ceph:51: Request(s) sent to Ceph broker (rid=ceph:51)
    2015-06-02 12:24:10 INFO unit.glance/0.juju-log cmd.go:247 ceph:51: Ceph broker request succeeded (rc=0, msg=None)
    2015-06-02 12:24:59 INFO unit.glance/0.juju-log cmd.go:247 ceph:51: Request(s) sent to Ceph broker (rid=ceph:51)
  UnitId: glance/0
- MachineId: 2/lxc/4
  Stdout: |
    2015-06-02 12:22:51 INFO unit.glance/1.juju-log cmd.go:247 ceph:51: Request(s) sent to Ceph broker (rid=ceph:51)
    2015-06-02 12:24:37 INFO unit.glance/1.juju-log cmd.go:247 ceph:51: Ceph broker request succeeded (rc=0, msg=None)
    2015-06-02 12:25:15 INFO unit.glance/1.juju-log cmd.go:247 ceph:51: Request(s) sent to Ceph broker (rid=ceph:51)
  UnitId: glance/1
- MachineId: 3/lxc/6
  Stdout: |
    2015-06-02 12:20:25 INFO unit.glance/2.juju-log cmd.go:247 ceph:51: Request(s) sent to Ceph broker (rid=ceph:51)
    2015-06-02 12:22:31 INFO unit.glance/2.juju-log cmd.go:247 ceph:51: Request(s) sent to Ceph broker (rid=ceph:51)
    2015-06-02 12:24:12 INFO unit.glance/2.juju-log cmd.go:247 ceph:51: Request(s) sent to Ceph broker (rid=ceph:51)
    2015-06-02 12:24:32 INFO unit.glance/2.juju-log cmd.go:247 ceph:51: Ceph broker request succeeded (rc=0, msg=None)
  UnitId: glance/2
$ juju run --service glance 'ls /var/lib/glance/images'
- MachineId: 0/lxc/0
  Stdout: |
    32497170-748d-4a85-ba3d-f1c5f34eb3ed
    ccd9d8fe-72e1-4f03-95c5-81629a5e5472
  UnitId: glance/0
- MachineId: 2/lxc/4
  Stdout: |
    e73fa89b-bb54-46a7-b347-b7d04344fe0d
  UnitId: glance/1
- MachineId: 3/lxc/6
  Stdout: ""
  UnitId: glance/2

Diving in to this further, I see that the faulty units all sent a request to the Ceph broker /after/ receiving a successful response. glance-api is restarted when a request succeeds.

Adam Collard (adam-collard) on 2015-06-02

Changed in glance (Juju Charms Collection):
importance:	Undecided → High

Revision history for this message

James Page (james-page) wrote on 2015-06-02:

OK - so glance backends are bust under kilo - not captured during testing - fixing now.

Changed in glance (Juju Charms Collection):
importance:	High → Critical
milestone:	none → 15.07

Andreas Hasenack (ahasenack) on 2015-06-02

Changed in glance (Juju Charms Collection):
status:	New → In Progress

David Britton (dpb) on 2015-06-02

Changed in glance (Juju Charms Collection):
assignee:	nobody → James Page (james-page)

Revision history for this message

James Page (james-page) wrote on 2015-06-03:

Fix pushed to next and stable branches

Changed in glance (Juju Charms Collection):
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Juju status Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Juju Charms Collectionglance package

Glance HA Kilo: clustered glance image downloads fail 2/3 of the time

Bug Description

Related branches

Other bug subscribers

Bug attachments

Remote bug watches

Juju Charms Collection
glance package