glance restarted during image upload, image stuck in "saving" state

Bug #1583276 reported by Andreas Hasenack on 2016-05-18
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Glance - Simplestreams Sync Charm
Undecided
Данило Шеган
Landscape Server
Medium
Данило Шеган
16.06
Medium
Данило Шеган
simplestreams
Medium
Данило Шеган
simplestreams (Ubuntu)
Undecided
Unassigned
Xenial
Undecided
Unassigned
Bionic
Undecided
Unassigned
Disco
Undecided
Unassigned

Bug Description

[Impact]

CI run: https://ci.lscape.net/job/landscape-system-tests/2451/

gs3 log attached. Below some excerpts:

It looks like the problem started when it uploaded the first image (14.04) to glance. That failed:
WARNING * 05-18 13:36:48 [PID:10259] * root * No rabbitmq connection available for msg{'status': 'Syncing', 'message': 'ubuntu-trusty-14.04-amd64-server-20160516 99%\n
DEBUG * 05-18 13:36:50 [PID:10259] * glanceclient.common.http * curl -i -X POST -H 'x-image-meta-property-source_content_id: com.ubuntu.cloud:released:download' -H 'INFO * 05-18 13:37:01 [PID:10364] * root * glance-simplestreams-sync started.
INFO * 05-18 13:37:01 [PID:10364] * root * /var/run/glance-simplestreams-sync.pid is locked, exiting
ERROR * 05-18 13:37:07 [PID:10259] * root * Glance Client exception during do_sync:
Error communicating with http://10.96.10.146:9292 [Errno 32] Broken pipe
Will continue polling.
Traceback (most recent call last):
  File "/usr/share/glance-simplestreams-sync/glance-simplestreams-sync.py", line 444, in main
    do_sync(charm_conf, status_exchange)
  File "/usr/share/glance-simplestreams-sync/glance-simplestreams-sync.py", line 215, in do_sync
    tmirror.sync(smirror, path=initial_path)
  File "/usr/lib/python2.7/dist-packages/simplestreams/mirrors/__init__.py", line 91, in sync
    return self.sync_index(reader, path, data, content)
  File "/usr/lib/python2.7/dist-packages/simplestreams/mirrors/__init__.py", line 254, in sync_index
    self.sync(reader, path=epath)
  File "/usr/lib/python2.7/dist-packages/simplestreams/mirrors/__init__.py", line 89, in sync
    return self.sync_products(reader, path, data, content)
  File "/usr/lib/python2.7/dist-packages/simplestreams/mirrors/__init__.py", line 341, in sync_products
    self.insert_item(item, src, target, pgree, ipath_cs)
  File "/usr/lib/python2.7/dist-packages/simplestreams/mirrors/glance.py", line 303, in insert_item
    ret = self.gclient.images.create(**create_kwargs)
  File "/usr/lib/python2.7/dist-packages/glanceclient/v1/images.py", line 253, in create
    'POST', '/v1/images', headers=hdrs, body=image_data)
  File "/usr/lib/python2.7/dist-packages/glanceclient/common/http.py", line 289, in raw_request
    return self._http_request(url, method, **kwargs)
  File "/usr/lib/python2.7/dist-packages/glanceclient/common/http.py", line 235, in _http_request
    raise exc.CommunicationError(message=message)
CommunicationError: Error communicating with http://10.96.10.146:9292 [Errno 32] Broken pipe
INFO * 05-18 13:37:07 [PID:10259] * root * sync done.
INFO * 05-18 13:38:01 [PID:10372] * root * glance-simplestreams-sync started.

[Test Case]

Found evidence in the juju logs for glance/0 that it restarted right at 13:37:07:
2016-05-18 13:37:07 INFO ceph-relation-changed ^MReading state information... 0%^MReading state information... 0%^MReading state information... Done
2016-05-18 13:37:07 INFO juju-log ceph:50: Loaded template from templates/ceph.conf
2016-05-18 13:37:07 INFO juju-log ceph:50: Rendering from template: ceph.conf
2016-05-18 13:37:07 INFO juju-log ceph:50: Wrote template /var/lib/charm/glance/ceph.conf.
2016-05-18 13:37:07 INFO ceph-relation-changed glance-api stop/waiting
2016-05-18 13:37:07 INFO ceph-relation-changed glance-api start/running, process 62839

glance/1 had its last restart later:
2016-05-18 13:32:01 INFO ceph-relation-changed glance-api start/running, process 31045

glance/2 at that time too:
2016-05-18 13:32:00 INFO ceph-relation-changed glance-api start/running, process 30584

In gs3, a few log entries later, we can see that 14.04 is in state "saving" in glance:
(...)
{"images": [{"status": "saving", "deleted_at": null, "name": "auto-sync/ubuntu-trusty-14.04-amd64-server-20160516-disk1.img", "deleted": false, "container_format": "bare
(...)

It remains in this state throughout the logs.

gs3 then proceeds to download 16.04, upload it to glance, and then finally publishes the streams. It's unknown if 14.04 is part of this publication or not.

[Regression Potential]

This part was added specifically for the Xenial backport, including:

- 433-glance-ignore-inactive-images.patch (LP: #1583276)

And chances of regression are small based on the MR feedback from SEG and this particular bug already stating the issue was fixed.

[Other Info]

Landscape however fails, because it doesn't find enough "active" images:
May 18 14:33:29 job-handler-1 INFO RetryingCall for '_wait_for_active_image' failed, will try again.
May 18 14:33:29 job-handler-1 INFO Traceback: <class 'canonical.landscape.model.openstack.jobs.errors.NoActiveImageAvailable'>: #012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:393:callback#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:501:_startRunCallbacks#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:588:_runCallbacks#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:1184:gotResult#012--- <exception caught here> ---#012/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:1128:_inlineCallbacks#012/opt/canonical/landscape/canonical/landscape/model/openstack/jobs/images.py:48:_wait_for_active_image

Related branches

Andreas Hasenack (ahasenack) wrote :
Andreas Hasenack (ahasenack) wrote :

Log showing the glance restart

tags: removed: kanban
description: updated

Note that upstream charm doesn't exactly match our version in this: it doesn't catch ConnectionError and instead removes the per-minute cronjob on that failure. Chad fixed this in our version of the gs3 charm.

It would not change the eventual failure when the script is re-run as part of the daily cronjob: it would simply still never move 14.04 image from "saving" to "active", though only a bit later.

My high level take on this is that something other than the charm is messing up. Most likely glance needs to allow for "atomic" image uploads, which will not get into these messed up states, or python-simplestreams needs to handle it better and not treat this image as uploaded. Perhaps we can overwrite any non-active images on every run. I'll look into that approach.

Changed in glance-simplestreams-sync-charm:
assignee: nobody → Данило Шеган (danilo)

FWIW, the script in glance-simplestreams-sync charm only does the following:

  tmirror.sync(smirror, path=initial_path)

with tmirror being a glance.GlanceMirror() from simplestreams.mirrors package. So, the fix will likely have to go there.

trusty is using python-glanceclient 0.12, whereas xenial has 2.0.0. Relevant code has significantly changed between versions, I'll try reproducing this with latest version just to exclude any potential fixes there before I delve deeper.

(We are using python-simplestreams almost-trunk now since we pull it from our PPA)

This is easily reproducible, and it behaves the same with python-glanceclient 2.0.0 from xenial.

If I kick off the glance-simplestreams-sync.py script, and restart glance-api right after message along the lines of

  DEBUG * 05-19 09:16:22 [PID:36034] * glanceclient.common.http * curl -g -i -X POST -H 'x-image-meta-property-source_content_id: com.ubuntu.cloud:released:download' -H 'x-image-meta-property-version_name: 20160516' -H 'x-image-meta-property-os_version: 14.04' -H 'x-image-meta-checksum: e903d04033149e800fc90400dee1abb8' -H 'x-image-meta-container_format: bare' -H 'x-image-meta-property-content_id: auto.sync' -H 'x-image-meta-property-architecture: x86_64' -H 'Accept: */*' -H 'X-Auth-Token: {SHA1}906818b8a051308fa657b016479038d65e086b8c' -H 'x-image-meta-property-os_distro: ubuntu' -H 'x-image-meta-property-item_name: disk1.img' -H 'x-image-meta-property-product_name: com.ubuntu.cloud:server:14.04:amd64' -H 'Connection: keep-alive' -H 'Accept-Encoding: gzip, deflate' -H 'x-image-meta-is_public: True' -H 'User-Agent: python-glanceclient' -H 'x-image-meta-size: 259850752' -H 'Content-Type: application/octet-stream' -H 'x-image-meta-disk_format: qcow2' -H 'x-image-meta-name: auto-sync/ubuntu-trusty-14.04-amd64-server-20160516-disk1.img' http://10.44.203.31:9292/v1/images

shows up in /var/log/glance-simplestreams-sync.log, I get the above CommunicationError, and image is left-over in a "saving" state (also visible from "Images" tab in the Openstack Dashboard).

Re-running the glance-simplestreams-sync script doesn't "fix" this particular image.

Now, let's see what should be done:

 1. Ideally, glance itself would drop the image that has not been fully uploaded (in glance/api/v1/images.py: _upload_and_activate method - it attempts to do that, but I am guessing that service restart is not that graceful and just outright kills the process)
 2. python-simplestreams could treat "status" != "active" images as incomplete and re-download and re-publish them.

I've tested a patch to do 2 (https://pastebin.canonical.com/156799/), and when re-running the gs3 script, it results in the "saving" image remaining there, but a correct image that matches the "saving" one appearing as "active" one (http://people.canonical.com/~danilo/tmp/images-saving+active.png). I'll propose this for inclusion in python-simplestreams, but not before I do another round of testing, and ensure juju bootstrap still works even with such incomplete images potentially listed in the index file (I hope they aren't, though).

Non-active images are not listed in the generated index.json (well, auto.sync.json that's referenced from index.json), so juju bootstrap works fine.

Andreas Hasenack (ahasenack) wrote :

Is it risky to delete the image that is in the "saving" state? I guess a problem would be how to be sure it's "our" image, from our interrupted upload?

Francis Ginther (fginther) wrote :

Hit what I think is another case of this, but the first image was left in a "queued" state:

[from glance-simplestreams-sync-0/var/log/glance-simplestreams-sync.log]
{"images": [{"status": "queued", "deleted_at": null, "name": "auto-sync/ubuntu-trusty-14.04-amd64-server-20160516-disk1.img", "deleted": false, "container_format": "bare", "created_at": "2016-05-18T19:12:02.000000", "disk_format": "qcow2", "updated_at": "2016-05-18T19:12:02.000000", "min_disk": 0, "protected": false, "id": "0b69de8b-b024-47e2-aecd-a84e90489562", "min_ram": 0, "checksum": "e903d04033149e800fc90400dee1abb8", "owner": "23b1d952ab4746b4b3990f10142f8a8e", "is_public": true, "virtual_size": null, "properties": {"os_distro": "ubuntu", "item_name": "disk1.img", "os_version": "14.04", "architecture": "x86_64", "version_name": "20160516", "content_id": "auto.sync", "product_name": "com.ubuntu.cloud:server:14.04:amd64", "source_content_id": "com.ubuntu.cloud:released:download"}, "size": 259850752}]}

The log indicates that 16.04 was successfully updated and the sync completes with 14.04 in a bad state:

{"image": {"status": "active", "deleted": false, "container_format": "bare", "min_ram": 0, "updated_at": "2016-05-18T19:14:58.000000", "owner": "23b1d952ab4746b4b3990f10142f8a8e", "min_disk": 0, "is_public": true, "deleted_at": null, "id": "03f1dc12-fe60-499c-a582-c65f8cb5a9e8", "size": 303824896, "virtual_size": null, "name": "auto-sync/ubuntu-xenial-16.04-amd64-server-20160516.1-disk1.img", "checksum": "29c4d152758268383323b86208105960", "created_at": "2016-05-18T19:14:43.000000", "disk_format": "qcow2", "properties": {"os_distro": "ubuntu", "item_name": "disk1.img", "os_version": "16.04", "architecture": "x86_64", "version_name": "20160516.1", "content_id": "auto.sync", "product_name": "com.ubuntu.cloud:server:16.04:amd64", "source_content_id": "com.ubuntu.cloud:released:download"}, "protected": false}}
...
INFO * 05-18 19:14:59 [PID:16923] * root * Initial sync attempt done. every-minute cronjob removed.
INFO * 05-18 19:14:59 [PID:16923] * root * sync done.

I've attached the full logs from the glance and gs3 units.

Changed in glance-simplestreams-sync-charm:
status: New → In Progress
Changed in landscape:
status: New → In Progress
assignee: nobody → Данило Шеган (danilo)
Changed in glance-simplestreams-sync-charm:
status: In Progress → Invalid
Changed in simplestreams:
assignee: nobody → Данило Шеган (danilo)
Scott Moser (smoser) on 2016-05-19
Changed in simplestreams:
status: New → Confirmed
importance: Undecided → Medium
Francis Ginther (fginther) wrote :

Another variant of this? This time, I'm not seeing that an image was partially uploaded, but it looks like glance restarted during an API call from gs3 that led to an exception without a retry

[From glance-simplestreams-sync-0/var/log/glance-simplestreams-sync.log]
ClientException: Object GET failed: http://10.245.201.183:80:80/swift/v1/simplestreams/data/streams/v1/auto.sync.json 404 Not Found NoSuchKey
DEBUG * 05-19 16:37:52 [PID:16871] * glanceclient.common.http * curl -i -X GET -H 'X-Auth-Token: 51228ff97043447b94f3f2db09e0c45f' -H 'Content-Type: application/json' -H 'User-Agent: python-glanceclient' http://10.245.201.207:9292/v1/images/detail?limit=20
ERROR * 05-19 16:37:53 [PID:16871] * root * Exception during do_sync
Traceback (most recent call last):
  File "/usr/share/glance-simplestreams-sync/glance-simplestreams-sync.py", line 444, in main
    do_sync(charm_conf, status_exchange)
  File "/usr/share/glance-simplestreams-sync/glance-simplestreams-sync.py", line 215, in do_sync
    tmirror.sync(smirror, path=initial_path)
  File "/usr/lib/python2.7/dist-packages/simplestreams/mirrors/__init__.py", line 91, in sync
    return self.sync_index(reader, path, data, content)
  File "/usr/lib/python2.7/dist-packages/simplestreams/mirrors/__init__.py", line 254, in sync_index
    self.sync(reader, path=epath)
  File "/usr/lib/python2.7/dist-packages/simplestreams/mirrors/__init__.py", line 89, in sync
    return self.sync_products(reader, path, data, content)
  File "/usr/lib/python2.7/dist-packages/simplestreams/mirrors/__init__.py", line 269, in sync_products
    target = self.load_products(path, content_id)
  File "/usr/lib/python2.7/dist-packages/simplestreams/mirrors/glance.py", line 172, in load_products
    for image in images:
  File "/usr/lib/python2.7/dist-packages/glanceclient/v1/images.py", line 174, in paginate
    images = self._list(url, "images")
  File "/usr/lib/python2.7/dist-packages/glanceclient/common/base.py", line 53, in _list
    resp, body = self.api.json_request('GET', url)
  File "/usr/lib/python2.7/dist-packages/glanceclient/common/http.py", line 266, in json_request
    resp, body_iter = self._http_request(url, method, **kwargs)
  File "/usr/lib/python2.7/dist-packages/glanceclient/common/http.py", line 226, in _http_request
    resp = conn.getresponse()
  File "/usr/lib/python2.7/httplib.py", line 1051, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 415, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 379, in _read_status
    raise BadStatusLine(line)
BadStatusLine: ''

And on glance:
[from glance-0/var/log/juju/unit-glance-0.log]
2016-05-19 16:37:54 INFO ceph-relation-changed glance-api stop/waiting
2016-05-19 16:37:54 INFO ceph-relation-changed glance-api start/running, process 27074
2016-05-19 16:37:54 INFO ceph-relation-changed glance-api stop/waiting
2016-05-19 16:37:54 INFO ceph-relation-changed glance-api start/running, process 27093
2016-05-19 16:37:54 INFO ceph-relation-changed glance-registry stop/waiting
2016-05-19 16:37:55 INFO ceph-relation-changed glance-registry start/running, process 27113

That one is slightly different since it's an exception that is not handled yet. I guess I'll have to hurry landing https://code.launchpad.net/~danilo/charms/trusty/glance-simplestreams-sync/invert-failure-mode/+merge/294635 and even push that against 16.05 (that one makes the charm script retry on *all* failures, instead of only on a known subset that we grew by a couple exceptions already).

Changed in simplestreams:
status: Confirmed → Fix Committed

Fixed version of python-simplestreams has been pushed to our PPA that Autopilot will use on glance-simplestreams-sync unit, thus making this fix committed (see https://launchpad.net/~landscape/+archive/ubuntu/glance-simplestreams-sync).

Changed in landscape:
importance: Undecided → Medium
status: In Progress → Fix Committed

This bug is believed to be fixed in simplestreams in version 0.1.0. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in simplestreams:
status: Fix Committed → Fix Released
description: updated
Changed in simplestreams (Ubuntu):
status: New → Fix Released
Changed in simplestreams (Ubuntu Disco):
status: New → Fix Released
Changed in simplestreams (Ubuntu Bionic):
status: New → Fix Released
Changed in simplestreams (Ubuntu Xenial):
status: New → In Progress
assignee: nobody → Rafael David Tinoco (rafaeldtinoco)

An upload of simplestreams to xenial-proposed has been rejected from the upload queue for the following reason: "See https://bugs.launchpad.net/charms/+source/glance-simplestreams-sync/+bug/1611987/comments/21".

Changed in simplestreams (Ubuntu Xenial):
status: In Progress → Won't Fix
Changed in simplestreams (Ubuntu Bionic):
status: Fix Released → In Progress
Changed in simplestreams (Ubuntu Xenial):
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
Changed in simplestreams (Ubuntu Bionic):
assignee: nobody → Rafael David Tinoco (rafaeldtinoco)

TL;DR version:

We are going to fix Bionic simplestreams package and provide this Bionic simplestreams package in Ubuntu Cloud Archive for Xenial. With that, we are fixing Xenial simplestreams behavior (to work with keystone v3) only if end-user enables Ubuntu Cloud Archive.

Conversation about this topic:

https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/simplestreams/+git/simplestreams/+merge/373030/comments/980111

Changed in simplestreams (Ubuntu Bionic):
status: In Progress → Fix Released
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
Simon Poirier (simpoir) on 2020-09-03
Changed in landscape:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers