Bug #1751842 “Image creation failed: Error finding address for h...” : Bugs : Ceph Monitor Charm

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-02-26:

#1

juju-crashdump-fd59bef7-c249-4b87-a342-53ea17ececba.tar.gz Edit (21.1 MiB, application/x-tar)

Jason Hobbs (jason-hobbs) on 2018-02-26

description:

updated

Jason Hobbs (jason-hobbs) on 2018-02-26

description:

updated

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-02-26:

#2

Speaking the David Ames on IRC, it sounds like we may be able to solve/workaround this by increasing the server/client haproxy timeouts for glance.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-02-26:

#3

I've changed our timeout values from 90 seconds to 180 seconds; we will report if we continue to hit this.

Revision history for this message

Christian Reis (kiko) wrote on 2018-02-26:

#4

If that works, should we change the standard charm timeouts?

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-02-27:

#5

We are still hitting this with:

haproxy-server-timeout: 180000 # XXX: 1751842
haproxy-client-timeout: 180000

Revision history for this message

Chris Gregan (cgregan) wrote on 2018-02-27:

#6

Bump to field high due to repeat blocking failures in CI

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-02-27:

#7

juju-crashdump-c8777874-41ea-4ead-83b3-54e0793f8d18.tar.gz Edit (28.1 MiB, application/x-tar)

Some ceph errors from a run today that hit this:

ceph-mon_1/var/log/ceph/ceph-mon.juju-16afa6-1-lxd-0.log:2018-02-27 02:00:00.000125 7f479cf4a700 0 log_channel(cluster) log [WRN] : overall HEALTH_WARN no active mgr; application not enabled on 2 pool(s)

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-02-27:

#8

http://paste.ubuntu.com/p/j8dfT4qk9n/ Here is the bundle from comment #7

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-02-27:

#9

We need to add logging of when we start the request, and when the request finishes. This may be in the rally logs; we are missing some due to a bug in our log collection. We'll get that fixed, and add logging to the other place this is happening.

Revision history for this message

David Ames (thedac) wrote on 2018-02-28:

#10

Found this warning in the glance logs:

2018-02-25 10:01:09.975 189710 WARNING glance_store._drivers.cinder [-] Cinder store is considered experimental. Current deployers should be aware that the use of it in production right now may be risky.

The bundle does create the relation that uses cinder store:
- [ "glance:cinder-volume-service", "cinder:cinder-volume-service" ]

I have a similar deployment with the same details. I'll test without that relationship.

Changed in charm-glance:
status:	New → Confirmed
importance:	Undecided → Medium
assignee:	nobody → David Ames (thedac)
milestone:	none → 18.02

Revision history for this message

David Ames (thedac) wrote on 2018-02-28:

#11

Also the ceph-mon relation exists. Which may be competing storage backends.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-03-01:

#12

We've updated our bundle to remove the cinder-volume-service relation and will re-test.

Revision history for this message

David Ames (thedac) wrote on 2018-03-02:

#13

This turns out to be a ceph-mon/ceph-osd problem. When customize-failure-domain=True ceph comes up degraded. This may be a bug in ceph-mon/ceph-osd or it may be a misconfiguration of Juju availability zones. I'll discuss with my team.

@Jason can you test with your bundle and customize-failure-domain=False?

Replication:
The problem can be recreated running the ceph-osd amulet test with customize-failure-domain=True set.

Further information:
When ceph has a status with pgs undersized (and degraded) it is effectively non-functional. All services that rely on it will also break. Both glance and ceph-radosgw were broken in my testing environment.

ceph status
  cluster:
    id: 6547bd3e-1397-11e2-82e5-53567c8d32dc
    health: HEALTH_WARN
            Reduced data availability: 132 pgs inactive
            Degraded data redundancy: 132 pgs unclean, 132 pgs degraded, 132 pgs undersized
            too few PGs per OSD (22 < min 30)

  services:
    mon: 3 daemons, quorum juju-b3f97f-ceph-1,juju-b3f97f-ceph-0,juju-b3f97f-ceph-2
    mgr: juju-b3f97f-ceph-1(active), standbys: juju-b3f97f-ceph-0, juju-b3f97f-ceph-2
    osd: 6 osds: 6 up, 6 in

  data:
    pools: 17 pools, 132 pgs
    objects: 0 objects, 0 bytes
    usage: 8967 MB used, 78483 MB / 87450 MB avail
    pgs: 100.000% pgs not active
             132 undersized+degraded+peered

root@juju-b3f97f-ceph-0:/home/ubuntu# ceph health detail
HEALTH_WARN Reduced data availability: 132 pgs inactive; Degraded data redundancy: 132 pgs unclean, 132 pgs degraded, 132 pgs undersized; to
o few PGs per OSD (22 < min 30)
PG_AVAILABILITY Reduced data availability: 132 pgs inactive
    pg 2.16 is stuck inactive for 729.804093, current state undersized+degraded+peered, last acting [5]
    pg 2.17 is stuck inactive for 729.804093, current state undersized+degraded+peered, last acting [3]
    pg 2.18 is stuck inactive for 729.804093, current state undersized+degraded+peered, last acting [5]
...
PG_DEGRADED Degraded data redundancy: 132 pgs unclean, 132 pgs degraded, 132 pgs undersized
    pg 2.16 is stuck unclean for 729.804093, current state undersized+degraded+peered, last acting [5]
    pg 2.17 is stuck undersized for 728.762611, current state undersized+degraded+peered, last acting [3]
    pg 2.18 is stuck undersized for 728.756797, current state undersized+degraded+peered, last acting [5]
...
TOO_FEW_PGS too few PGs per OSD (22 < min 30)

This turns out to be a ceph-mon/ceph-osd problem. When customize-failure-domain=True ceph comes up degraded. This may be a bug in ceph-mon/ceph-osd or it may be a misconfiguration of Juju availability zones. I'll discuss with my team.

@Jason can you test with your bundle and customize-failure-domain=False?

Replication:
The problem can be recreated running the ceph-osd amulet test with customize-failure-domain=True set.

Further information:
When ceph has a status with pgs undersized (and degraded) it is effectively non-functional. All services that rely on it will also break. Both glance and ceph-radosgw were broken in my testing environment.

ceph status
  cluster:
    id:     6547bd3e-1397-11e2-82e5-53567c8d32dc
    health: HEALTH_WARN
            Reduced data availability: 132 pgs inactive
            Degraded data redundancy: 132 pgs unclean, 132 pgs degraded, 132 pgs undersized
            too few PGs per OSD (22 < min 30)
 
  services:
    mon: 3 daemons, quorum juju-b3f97f-ceph-1,juju-b3f97f-ceph-0,juju-b3f97f-ceph-2
    mgr: juju-b3f97f-ceph-1(active), standbys: juju-b3f97f-ceph-0, juju-b3f97f-ceph-2
    osd: 6 osds: 6 up, 6 in
 
  data:
    pools:   17 pools, 132 pgs
    objects: 0 objects, 0 bytes
    usage:   8967 MB used, 78483 MB / 87450 MB avail
    pgs:     100.000% pgs not active
             132 undersized+degraded+peered

root@juju-b3f97f-ceph-0:/home/ubuntu# ceph health detail                                                                                    
HEALTH_WARN Reduced data availability: 132 pgs inactive; Degraded data redundancy: 132 pgs unclean, 132 pgs degraded, 132 pgs undersized; to
o few PGs per OSD (22 < min 30)                                                                                                             
PG_AVAILABILITY Reduced data availability: 132 pgs inactive                                                                                 
    pg 2.16 is stuck inactive for 729.804093, current state undersized+degraded+peered, last acting [5]                                     
    pg 2.17 is stuck inactive for 729.804093, current state undersized+degraded+peered, last acting [3]                                     
    pg 2.18 is stuck inactive for 729.804093, current state undersized+degraded+peered, last acting [5]                                     
...
PG_DEGRADED Degraded data redundancy: 132 pgs unclean, 132 pgs degraded, 132 pgs undersized
    pg 2.16 is stuck unclean for 729.804093, current state undersized+degraded+peered, last acting [5]
    pg 2.17 is stuck undersized for 728.762611, current state undersized+degraded+peered, last acting [3]
    pg 2.18 is stuck undersized for 728.756797, current state undersized+degraded+peered, last acting [5]
...
TOO_FEW_PGS too few PGs per OSD (22 < min 30)

Changed in charm-glance:
status:	Confirmed → Invalid

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2018-03-02:

#14

If customize-failure-domain is enabled, then the Juju deployment *must* have multiple availability zones available, either through the Juju Availability Zone configuration, or through juju configuration on the ceph-osd charms (where you would need three+ application deployments).

When the customize-failure-domain is enabled, Ceph is instructed to spread replicas in the default pools across rack/row, which is not possible if there is not enough zones configured, causing the health to never become healthy.

David Ames (thedac) on 2018-03-02

Changed in charm-ceph-mon:
status:	New → Invalid
Changed in charm-ceph-osd:
status:	New → Invalid

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-03-02:

#15

We do have multiple availability zones in our deployment; we have 3 of them. I'm not sure which charm this should be against at this point, but it shouldn't be invalid across the board.

Changed in charm-glance:
status:	Invalid → New
Changed in charm-ceph-osd:
status:	Invalid → New
Changed in charm-ceph-mon:
status:	Invalid → New

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2018-03-02:

#16

@jason-hobbs, looking a bit into the charms, and the juju status of one of the deployments, it is clear to me there there is a bug in the ceph* charms, in that we never update crush to actually use a configuration based failure domain (ceph-osd: https://github.com/openstack/charm-ceph-osd/blob/master/hooks/ceph_hooks.py#L193 ; ceph-mon: https://github.com/openstack/charm-ceph-mon/blob/master/hooks/ceph_hooks.py#L370); however, you should be using the availability zone support, so should not be affected by that bug.

Looking at the juju status output included in the initial crashdump pasted into this bug, I can see that the machines used for the ceph-osd deployment are on (0, 1, 2) are in three separate availability zones, so I will continue to dig into both these logs, and thedac's reproducer to try to identify the cause of the unhealthy ceph cluster.

Would it be possible to get the output of `ceph status` on any of the ceph-mon units in a deployment that has failed in a similar fashion? While I will be trying to dig in on this in the meantime, knowing how healthy ceph thinks that it is may help reduce the number of rabbit h oles I have to go down.

Changed in charm-ceph-mon:
assignee:	nobody → Chris MacNaughton (chris.macnaughton)
Changed in charm-ceph-osd:
assignee:	nobody → Chris MacNaughton (chris.macnaughton)

Revision history for this message

David Ames (thedac) wrote on 2018-03-02:

#17

This bug is invalid for glance, it is a ceph bug.

Changed in charm-glance:
status:	New → Invalid

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-03-02:

#18

Chris, why not build ceph status into the charm's update status hook?

We don't pause our CI on failures, so it's hard to get stuff manually like that.

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2018-03-02:

#19

I've filed https://bugs.launchpad.net/charm-ceph-mon/+bug/1752942 to track the separate identified bug in failure domain configuration that should not be affecting this situation

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2018-03-02:

#20

It's a bit big for juju status output:

$ juju ssh ceph-mon/0 sudo ceph status
sudo: unable to resolve host juju-adc4f7-icey-0
    cluster dd921caa-1d6c-11e8-893f-fa163e127e47
     health HEALTH_OK
     monmap e1: 3 mons at {juju-adc4f7-icey-0=10.5.0.24:6789/0,juju-adc4f7-icey-1=10.5.0.16:6789/0,juju-adc4f7-icey-2=10.5.0.12:6789/0}
            election epoch 10, quorum 0,1,2 juju-adc4f7-icey-2,juju-adc4f7-icey-1,juju-adc4f7-icey-0
     osdmap e13: 3 osds: 3 up, 3 in
            flags sortbitwise,require_jewel_osds
      pgmap v1021: 64 pgs, 1 pools, 0 bytes data, 0 objects
            7964 MB used, 36463 MB / 44475 MB avail
                  64 active+clean

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-03-02: Re: [Bug 1751842] Re: Image creation failed: Error finding address for http://xx.xx.xx.xx:9292/v2/images/x-x-x-x-x/file: Unable to establish connection to http://xx.xx.xx.xx:9292/v2/images/x-x-x-x-x/file

#21

I'm not saying make it part of the juju status output, except maybe to
summarize - i.e. just show health status there.

You could capture the rest in the unit log.

On Fri, Mar 2, 2018 at 11:32 AM, Chris MacNaughton
<email address hidden> wrote:
> It's a bit big for juju status output:
>
> $ juju ssh ceph-mon/0 sudo ceph status
> sudo: unable to resolve host juju-adc4f7-icey-0
> cluster dd921caa-1d6c-11e8-893f-fa163e127e47
> health HEALTH_OK
> monmap e1: 3 mons at {juju-adc4f7-icey-0=10.5.0.24:6789/0,juju-adc4f7-icey-1=10.5.0.16:6789/0,juju-adc4f7-icey-2=10.5.0.12:6789/0}
> election epoch 10, quorum 0,1,2 juju-adc4f7-icey-2,juju-adc4f7-icey-1,juju-adc4f7-icey-0
> osdmap e13: 3 osds: 3 up, 3 in
> flags sortbitwise,require_jewel_osds
> pgmap v1021: 64 pgs, 1 pools, 0 bytes data, 0 objects
> 7964 MB used, 36463 MB / 44475 MB avail
> 64 active+clean
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1751842
>
> Title:
> Image creation failed: Error finding address for
> http://xx.xx.xx.xx:9292/v2/images/x-x-x-x-x/file: Unable to establish
> connection to http://xx.xx.xx.xx:9292/v2/images/x-x-x-x-x/file
>
> Status in OpenStack ceph-mon charm:
> New
> Status in OpenStack ceph-osd charm:
> New
> Status in OpenStack glance charm:
> Invalid
>
> Bug description:
> We have hit several failures of this pattern. This is using HA glance
> with ceph as the back end.
>
> Sometimes this occurs right after we finish the openstack deploy and are uploading images:
> http://paste.ubuntu.com/p/hm7K7FfYsB/
>
> At least once it's happened during a rally test.
>
> Here is the bundle and overlay:
> http://paste.ubuntu.com/p/Rhf4FN2kJ3/
>
> Despite the error message, I think it can connect, because glance-api.log has these log messages:
> http://paste.ubuntu.com/p/zRrR8QczXN/
>
> 10.244.40.215 is the ip of the glance 0 unit, and the above log comes
> from the glance 1 unit. I think it's haproxy from glance 0 forwarding
> the connection to glance 1.
>
> I think this is probably a timeout problem; maybe haproxy needs to be
> adjusted?
>
> This is also suspicious, regarding upload speed:
> http://paste.ubuntu.com/p/g5f6fhyc2M/
>
> Some example failures from rally:
> http://paste.ubuntu.com/p/ZPbFD4WNRZ/
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/charm-ceph-mon/+bug/1751842/+subscriptions

I'm not saying make it part of the juju status output, except maybe to
summarize - i.e. just show health status there.

You could capture the rest in the unit log.

On Fri, Mar 2, 2018 at 11:32 AM, Chris MacNaughton
<1751842@bugs.launchpad.net> wrote:
> It's a bit big for juju status output:
>
>  $ juju ssh ceph-mon/0 sudo ceph status
> sudo: unable to resolve host juju-adc4f7-icey-0
>     cluster dd921caa-1d6c-11e8-893f-fa163e127e47
>      health HEALTH_OK
>      monmap e1: 3 mons at {juju-adc4f7-icey-0=10.5.0.24:6789/0,juju-adc4f7-icey-1=10.5.0.16:6789/0,juju-adc4f7-icey-2=10.5.0.12:6789/0}
>             election epoch 10, quorum 0,1,2 juju-adc4f7-icey-2,juju-adc4f7-icey-1,juju-adc4f7-icey-0
>      osdmap e13: 3 osds: 3 up, 3 in
>             flags sortbitwise,require_jewel_osds
>       pgmap v1021: 64 pgs, 1 pools, 0 bytes data, 0 objects
>             7964 MB used, 36463 MB / 44475 MB avail
>                   64 active+clean
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1751842
>
> Title:
>   Image creation failed: Error finding address for
>   http://xx.xx.xx.xx:9292/v2/images/x-x-x-x-x/file: Unable to establish
>   connection to http://xx.xx.xx.xx:9292/v2/images/x-x-x-x-x/file
>
> Status in OpenStack ceph-mon charm:
>   New
> Status in OpenStack ceph-osd charm:
>   New
> Status in OpenStack glance charm:
>   Invalid
>
> Bug description:
>   We have hit several failures of this pattern. This is using HA glance
>   with ceph as the back end.
>
>   Sometimes this occurs right after we finish the openstack deploy and are uploading images:
>   http://paste.ubuntu.com/p/hm7K7FfYsB/
>
>   At least once it's happened during a rally test.
>
>   Here is the bundle and overlay:
>   http://paste.ubuntu.com/p/Rhf4FN2kJ3/
>
>   Despite the error message, I think it can connect, because glance-api.log has these log messages:
>   http://paste.ubuntu.com/p/zRrR8QczXN/
>
>   10.244.40.215 is the ip of the glance 0 unit, and the above log comes
>   from the glance 1 unit. I think it's haproxy from glance 0 forwarding
>   the connection to glance 1.
>
>   I think this is probably a timeout problem; maybe haproxy needs to be
>   adjusted?
>
>   This is also suspicious, regarding upload speed:
>   http://paste.ubuntu.com/p/g5f6fhyc2M/
>
>   Some example failures from rally:
>   http://paste.ubuntu.com/p/ZPbFD4WNRZ/
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/charm-ceph-mon/+bug/1751842/+subscriptions

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2018-03-05:

#22

@jhobbs I think incorporating ceph health summary into workload status is a good idea. Please raise a separate wishlist bug to represent that feature request. Thank you.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-03-09:

#23

Looks like this might be an haproxy timeout after all: http://paste.ubuntu.com/p/bKynKjcJGt/

180 seconds is our haproxy timeout.

It seems odd to me that the haproxy timeout would apply if we're actively uploading. As long as we're still pushing stuff and the server is responding, it doesn't seem like it should timeout.

Revision history for this message

James Page (james-page) wrote on 2018-03-15:

#24

shade.exc.OpenStackCloudException: Image creation failed: Error finding address for http://10.244.40.85:9292/v2/images/2c1e2ceb-3142-46f8-93ac-ea0eac00509b/file: Unable to establish connection to http://10.244.40.85:9292/v2/images/2c1e2ceb-3142-46f8-93ac-ea0eac00509b/file

I'm not sure you are pushing anything - that's a 'Unable to establish connection' error; haproxy won't terminate a connection if client and server are activity communicating - the timeouts relate to periods of inactivity I think - i.e. the server taking along time to respond to the client or vica versa.

Revision history for this message

James Page (james-page) wrote on 2018-03-15:

#25

Snip from glancelient (source for "Error finding address for")

        except requests.exceptions.ConnectionError as e:
            message = ("Error finding address for %(url)s: %(e)s" %
                       dict(url=conn_url, e=e))
            raise exc.(message=message)
        except socket.gaierror as e:
            message = "Error finding address for %s: %s" % (
                self.endpoint_hostname, e)
            raise exc.InvalidEndpoint(message=message)

Revision history for this message

James Page (james-page) wrote on 2018-03-15:

#26

I think this is raised from a requests.exceptions.ConnectionError exception in keystoneclient which handles the actual URL call.

Revision history for this message

James Page (james-page) wrote on 2018-03-15:

#27

unfortunately the actual underlying error is hidden here; re the original bug report - yes there are successful requests to the glance API, but they don't appear to correspond timestamp wise to the 180 second timeout detailed (the good requests are some seconds later).

Revision history for this message

James Page (james-page) wrote on 2018-03-15:

#28

Marking Incomplete for now as I still don't think we understand the root cause of the error, and we don't have a reproducer as a result.

Changed in charm-ceph-osd:
status:	New → Incomplete
Changed in charm-ceph-mon:
status:	New → Incomplete

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-03-19:

#29

James,

What additional information/logs do you need to help identify the root cause?

Changed in charm-ceph-mon:
status:	Incomplete → New
Changed in charm-ceph-osd:
status:	Incomplete → New

Revision history for this message

James Page (james-page) wrote on 2018-03-22:

#30

Logs won't help - I need a reproducer or access to a system which is exhibiting this issue.

Changed in charm-ceph-mon:
status:	New → Incomplete
Changed in charm-ceph-osd:
status:	New → Incomplete

Revision history for this message

Chris Gregan (cgregan) wrote on 2019-01-31:

#31

juju-crashdump-openstack-2019-01-29-21.15.17.tar.gz Edit (173.8 MiB, application/x-tar)

Looks like it is not just glance:
shade.exc.OpenStackCloudException: Error fetching keypair list (Inner Exception: Unable to establish connection to http://10.244.40.91:8774/v2.1/os-keypairs: HTTPConnectionPool(host='10.244.40.91', port=8774): Max retries exceeded with url: /v2.1/os-keypairs (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f599ff4d110>: Failed to establish a new connection: [Errno 113] No route to host',)))
Exception in thread Thread-3 (most likely raised during interpreter shutdown):
Traceback (most recent call last):

Ceph Monitor Charm

Image creation failed: Error finding address for http://xx.xx.xx.xx:9292/v2/images/x-x-x-x-x/file: Unable to establish connection to http://xx.xx.xx.xx:9292/v2/images/x-x-x-x-x/file

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to	Milestone
Ceph Monitor Charm	Incomplete	Undecided	Chris MacNaughton
Ceph OSD Charm	Incomplete	Undecided	Chris MacNaughton
OpenStack Glance Charm	Invalid	Medium	David Ames	OpenStack Glance Charm 18.02