Image creation failed: Error finding address for http://xx.xx.xx.xx:9292/v2/images/x-x-x-x-x/file: Unable to establish connection to http://xx.xx.xx.xx:9292/v2/images/x-x-x-x-x/file

Bug #1751842 reported by Jason Hobbs on 2018-02-26
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack ceph-mon charm
Undecided
Chris MacNaughton
OpenStack ceph-osd charm
Undecided
Chris MacNaughton
OpenStack glance charm
Medium
David Ames

Bug Description

We have hit several failures of this pattern. This is using HA glance with ceph as the back end.

Sometimes this occurs right after we finish the openstack deploy and are uploading images:
http://paste.ubuntu.com/p/hm7K7FfYsB/

At least once it's happened during a rally test.

Here is the bundle and overlay:
http://paste.ubuntu.com/p/Rhf4FN2kJ3/

Despite the error message, I think it can connect, because glance-api.log has these log messages:
http://paste.ubuntu.com/p/zRrR8QczXN/

10.244.40.215 is the ip of the glance 0 unit, and the above log comes from the glance 1 unit. I think it's haproxy from glance 0 forwarding the connection to glance 1.

I think this is probably a timeout problem; maybe haproxy needs to be adjusted?

This is also suspicious, regarding upload speed:
http://paste.ubuntu.com/p/g5f6fhyc2M/

Some example failures from rally:
http://paste.ubuntu.com/p/ZPbFD4WNRZ/

description: updated
description: updated
Jason Hobbs (jason-hobbs) wrote :

Speaking the David Ames on IRC, it sounds like we may be able to solve/workaround this by increasing the server/client haproxy timeouts for glance.

Jason Hobbs (jason-hobbs) wrote :

I've changed our timeout values from 90 seconds to 180 seconds; we will report if we continue to hit this.

Christian Reis (kiko) wrote :

If that works, should we change the standard charm timeouts?

Jason Hobbs (jason-hobbs) wrote :

We are still hitting this with:

haproxy-server-timeout: 180000 # XXX: 1751842
haproxy-client-timeout: 180000

Chris Gregan (cgregan) wrote :

Bump to field high due to repeat blocking failures in CI

Jason Hobbs (jason-hobbs) wrote :

Some ceph errors from a run today that hit this:

ceph-mon_1/var/log/ceph/ceph-mon.juju-16afa6-1-lxd-0.log:2018-02-27 02:00:00.000125 7f479cf4a700 0 log_channel(cluster) log [WRN] : overall HEALTH_WARN no active mgr; application not enabled on 2 pool(s)

Jason Hobbs (jason-hobbs) wrote :

http://paste.ubuntu.com/p/j8dfT4qk9n/ Here is the bundle from comment #7

Jason Hobbs (jason-hobbs) wrote :

We need to add logging of when we start the request, and when the request finishes. This may be in the rally logs; we are missing some due to a bug in our log collection. We'll get that fixed, and add logging to the other place this is happening.

David Ames (thedac) wrote :

Found this warning in the glance logs:

2018-02-25 10:01:09.975 189710 WARNING glance_store._drivers.cinder [-] Cinder store is considered experimental. Current deployers should be aware that the use of it in production right now may be risky.

The bundle does create the relation that uses cinder store:
 - [ "glance:cinder-volume-service", "cinder:cinder-volume-service" ]

I have a similar deployment with the same details. I'll test without that relationship.

Changed in charm-glance:
status: New → Confirmed
importance: Undecided → Medium
assignee: nobody → David Ames (thedac)
milestone: none → 18.02
David Ames (thedac) wrote :

Also the ceph-mon relation exists. Which may be competing storage backends.

Jason Hobbs (jason-hobbs) wrote :

We've updated our bundle to remove the cinder-volume-service relation and will re-test.

David Ames (thedac) wrote :

This turns out to be a ceph-mon/ceph-osd problem. When customize-failure-domain=True ceph comes up degraded. This may be a bug in ceph-mon/ceph-osd or it may be a misconfiguration of Juju availability zones. I'll discuss with my team.

@Jason can you test with your bundle and customize-failure-domain=False?

Replication:
The problem can be recreated running the ceph-osd amulet test with customize-failure-domain=True set.

Further information:
When ceph has a status with pgs undersized (and degraded) it is effectively non-functional. All services that rely on it will also break. Both glance and ceph-radosgw were broken in my testing environment.

ceph status
  cluster:
    id: 6547bd3e-1397-11e2-82e5-53567c8d32dc
    health: HEALTH_WARN
            Reduced data availability: 132 pgs inactive
            Degraded data redundancy: 132 pgs unclean, 132 pgs degraded, 132 pgs undersized
            too few PGs per OSD (22 < min 30)

  services:
    mon: 3 daemons, quorum juju-b3f97f-ceph-1,juju-b3f97f-ceph-0,juju-b3f97f-ceph-2
    mgr: juju-b3f97f-ceph-1(active), standbys: juju-b3f97f-ceph-0, juju-b3f97f-ceph-2
    osd: 6 osds: 6 up, 6 in

  data:
    pools: 17 pools, 132 pgs
    objects: 0 objects, 0 bytes
    usage: 8967 MB used, 78483 MB / 87450 MB avail
    pgs: 100.000% pgs not active
             132 undersized+degraded+peered

root@juju-b3f97f-ceph-0:/home/ubuntu# ceph health detail
HEALTH_WARN Reduced data availability: 132 pgs inactive; Degraded data redundancy: 132 pgs unclean, 132 pgs degraded, 132 pgs undersized; to
o few PGs per OSD (22 < min 30)
PG_AVAILABILITY Reduced data availability: 132 pgs inactive
    pg 2.16 is stuck inactive for 729.804093, current state undersized+degraded+peered, last acting [5]
    pg 2.17 is stuck inactive for 729.804093, current state undersized+degraded+peered, last acting [3]
    pg 2.18 is stuck inactive for 729.804093, current state undersized+degraded+peered, last acting [5]
...
PG_DEGRADED Degraded data redundancy: 132 pgs unclean, 132 pgs degraded, 132 pgs undersized
    pg 2.16 is stuck unclean for 729.804093, current state undersized+degraded+peered, last acting [5]
    pg 2.17 is stuck undersized for 728.762611, current state undersized+degraded+peered, last acting [3]
    pg 2.18 is stuck undersized for 728.756797, current state undersized+degraded+peered, last acting [5]
...
TOO_FEW_PGS too few PGs per OSD (22 < min 30)

Changed in charm-glance:
status: Confirmed → Invalid

If customize-failure-domain is enabled, then the Juju deployment *must* have multiple availability zones available, either through the Juju Availability Zone configuration, or through juju configuration on the ceph-osd charms (where you would need three+ application deployments).

When the customize-failure-domain is enabled, Ceph is instructed to spread replicas in the default pools across rack/row, which is not possible if there is not enough zones configured, causing the health to never become healthy.

David Ames (thedac) on 2018-03-02
Changed in charm-ceph-mon:
status: New → Invalid
Changed in charm-ceph-osd:
status: New → Invalid
Jason Hobbs (jason-hobbs) wrote :

We do have multiple availability zones in our deployment; we have 3 of them. I'm not sure which charm this should be against at this point, but it shouldn't be invalid across the board.

Changed in charm-glance:
status: Invalid → New
Changed in charm-ceph-osd:
status: Invalid → New
Changed in charm-ceph-mon:
status: Invalid → New

@jason-hobbs, looking a bit into the charms, and the juju status of one of the deployments, it is clear to me there there is a bug in the ceph* charms, in that we never update crush to actually use a configuration based failure domain (ceph-osd: https://github.com/openstack/charm-ceph-osd/blob/master/hooks/ceph_hooks.py#L193 ; ceph-mon: https://github.com/openstack/charm-ceph-mon/blob/master/hooks/ceph_hooks.py#L370); however, you should be using the availability zone support, so should not be affected by that bug.

Looking at the juju status output included in the initial crashdump pasted into this bug, I can see that the machines used for the ceph-osd deployment are on (0, 1, 2) are in three separate availability zones, so I will continue to dig into both these logs, and thedac's reproducer to try to identify the cause of the unhealthy ceph cluster.

Would it be possible to get the output of `ceph status` on any of the ceph-mon units in a deployment that has failed in a similar fashion? While I will be trying to dig in on this in the meantime, knowing how healthy ceph thinks that it is may help reduce the number of rabbit h oles I have to go down.

Changed in charm-ceph-mon:
assignee: nobody → Chris MacNaughton (chris.macnaughton)
Changed in charm-ceph-osd:
assignee: nobody → Chris MacNaughton (chris.macnaughton)
David Ames (thedac) wrote :

This bug is invalid for glance, it is a ceph bug.

Changed in charm-glance:
status: New → Invalid
Jason Hobbs (jason-hobbs) wrote :

Chris, why not build ceph status into the charm's update status hook?

We don't pause our CI on failures, so it's hard to get stuff manually like that.

I've filed https://bugs.launchpad.net/charm-ceph-mon/+bug/1752942 to track the separate identified bug in failure domain configuration that should not be affecting this situation

It's a bit big for juju status output:

 $ juju ssh ceph-mon/0 sudo ceph status
sudo: unable to resolve host juju-adc4f7-icey-0
    cluster dd921caa-1d6c-11e8-893f-fa163e127e47
     health HEALTH_OK
     monmap e1: 3 mons at {juju-adc4f7-icey-0=10.5.0.24:6789/0,juju-adc4f7-icey-1=10.5.0.16:6789/0,juju-adc4f7-icey-2=10.5.0.12:6789/0}
            election epoch 10, quorum 0,1,2 juju-adc4f7-icey-2,juju-adc4f7-icey-1,juju-adc4f7-icey-0
     osdmap e13: 3 osds: 3 up, 3 in
            flags sortbitwise,require_jewel_osds
      pgmap v1021: 64 pgs, 1 pools, 0 bytes data, 0 objects
            7964 MB used, 36463 MB / 44475 MB avail
                  64 active+clean

I'm not saying make it part of the juju status output, except maybe to
summarize - i.e. just show health status there.

You could capture the rest in the unit log.

On Fri, Mar 2, 2018 at 11:32 AM, Chris MacNaughton
<email address hidden> wrote:
> It's a bit big for juju status output:
>
> $ juju ssh ceph-mon/0 sudo ceph status
> sudo: unable to resolve host juju-adc4f7-icey-0
> cluster dd921caa-1d6c-11e8-893f-fa163e127e47
> health HEALTH_OK
> monmap e1: 3 mons at {juju-adc4f7-icey-0=10.5.0.24:6789/0,juju-adc4f7-icey-1=10.5.0.16:6789/0,juju-adc4f7-icey-2=10.5.0.12:6789/0}
> election epoch 10, quorum 0,1,2 juju-adc4f7-icey-2,juju-adc4f7-icey-1,juju-adc4f7-icey-0
> osdmap e13: 3 osds: 3 up, 3 in
> flags sortbitwise,require_jewel_osds
> pgmap v1021: 64 pgs, 1 pools, 0 bytes data, 0 objects
> 7964 MB used, 36463 MB / 44475 MB avail
> 64 active+clean
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1751842
>
> Title:
> Image creation failed: Error finding address for
> http://xx.xx.xx.xx:9292/v2/images/x-x-x-x-x/file: Unable to establish
> connection to http://xx.xx.xx.xx:9292/v2/images/x-x-x-x-x/file
>
> Status in OpenStack ceph-mon charm:
> New
> Status in OpenStack ceph-osd charm:
> New
> Status in OpenStack glance charm:
> Invalid
>
> Bug description:
> We have hit several failures of this pattern. This is using HA glance
> with ceph as the back end.
>
> Sometimes this occurs right after we finish the openstack deploy and are uploading images:
> http://paste.ubuntu.com/p/hm7K7FfYsB/
>
> At least once it's happened during a rally test.
>
> Here is the bundle and overlay:
> http://paste.ubuntu.com/p/Rhf4FN2kJ3/
>
> Despite the error message, I think it can connect, because glance-api.log has these log messages:
> http://paste.ubuntu.com/p/zRrR8QczXN/
>
> 10.244.40.215 is the ip of the glance 0 unit, and the above log comes
> from the glance 1 unit. I think it's haproxy from glance 0 forwarding
> the connection to glance 1.
>
> I think this is probably a timeout problem; maybe haproxy needs to be
> adjusted?
>
> This is also suspicious, regarding upload speed:
> http://paste.ubuntu.com/p/g5f6fhyc2M/
>
> Some example failures from rally:
> http://paste.ubuntu.com/p/ZPbFD4WNRZ/
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/charm-ceph-mon/+bug/1751842/+subscriptions

Ryan Beisner (1chb1n) wrote :

@jhobbs I think incorporating ceph health summary into workload status is a good idea. Please raise a separate wishlist bug to represent that feature request. Thank you.

Jason Hobbs (jason-hobbs) wrote :

Looks like this might be an haproxy timeout after all: http://paste.ubuntu.com/p/bKynKjcJGt/

180 seconds is our haproxy timeout.

It seems odd to me that the haproxy timeout would apply if we're actively uploading. As long as we're still pushing stuff and the server is responding, it doesn't seem like it should timeout.

James Page (james-page) wrote :

shade.exc.OpenStackCloudException: Image creation failed: Error finding address for http://10.244.40.85:9292/v2/images/2c1e2ceb-3142-46f8-93ac-ea0eac00509b/file: Unable to establish connection to http://10.244.40.85:9292/v2/images/2c1e2ceb-3142-46f8-93ac-ea0eac00509b/file

I'm not sure you are pushing anything - that's a 'Unable to establish connection' error; haproxy won't terminate a connection if client and server are activity communicating - the timeouts relate to periods of inactivity I think - i.e. the server taking along time to respond to the client or vica versa.

James Page (james-page) wrote :

Snip from glancelient (source for "Error finding address for")

        except requests.exceptions.ConnectionError as e:
            message = ("Error finding address for %(url)s: %(e)s" %
                       dict(url=conn_url, e=e))
            raise exc.(message=message)
        except socket.gaierror as e:
            message = "Error finding address for %s: %s" % (
                self.endpoint_hostname, e)
            raise exc.InvalidEndpoint(message=message)

James Page (james-page) wrote :

I think this is raised from a requests.exceptions.ConnectionError exception in keystoneclient which handles the actual URL call.

James Page (james-page) wrote :

unfortunately the actual underlying error is hidden here; re the original bug report - yes there are successful requests to the glance API, but they don't appear to correspond timestamp wise to the 180 second timeout detailed (the good requests are some seconds later).

James Page (james-page) wrote :

Marking Incomplete for now as I still don't think we understand the root cause of the error, and we don't have a reproducer as a result.

Changed in charm-ceph-osd:
status: New → Incomplete
Changed in charm-ceph-mon:
status: New → Incomplete
Jason Hobbs (jason-hobbs) wrote :

James,

What additional information/logs do you need to help identify the root cause?

Changed in charm-ceph-mon:
status: Incomplete → New
Changed in charm-ceph-osd:
status: Incomplete → New
James Page (james-page) wrote :

Logs won't help - I need a reproducer or access to a system which is exhibiting this issue.

Changed in charm-ceph-mon:
status: New → Incomplete
Changed in charm-ceph-osd:
status: New → Incomplete
Chris Gregan (cgregan) wrote :

Looks like it is not just glance:
shade.exc.OpenStackCloudException: Error fetching keypair list (Inner Exception: Unable to establish connection to http://10.244.40.91:8774/v2.1/os-keypairs: HTTPConnectionPool(host='10.244.40.91', port=8774): Max retries exceeded with url: /v2.1/os-keypairs (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f599ff4d110>: Failed to establish a new connection: [Errno 113] No route to host',)))
Exception in thread Thread-3 (most likely raised during interpreter shutdown):
Traceback (most recent call last):

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers