ceph-radosgw "Initialization timeout, failed to initialize"

Bug #1577519 reported by Francis Ginther on 2016-05-02
54
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Landscape Server
High
Unassigned
16.06
High
Andreas Hasenack
OpenStack ceph-radosgw charm
High
Unassigned
ceph-radosgw (Juju Charms Collection)
High
Unassigned

Bug Description

Had a landscape OSA deployment fail (ceph object and ceph block storage). Juju status indicated that radosgw was not running. Looking at the logs:

[/var/log/ceph/radosgw.log]
2016-05-02 15:57:58.635872 7fabea0d37c0 0 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403), process radosgw, pid 14235
2016-05-02 15:57:58.644866 7fabea0d37c0 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication
2016-05-02 15:57:58.644943 7fabea0d37c0 0 librados: client.radosgw.gateway initialization error (2) No such file or directory
2016-05-02 15:57:58.645304 7fabea0d37c0 -1 Couldn't init storage provider (RADOS)
2016-05-02 15:57:58.914016 7f02f61517c0 0 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403), process radosgw, pid 14299
2016-05-02 15:57:58.926124 7f02f61517c0 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication
2016-05-02 15:57:58.926126 7f02f61517c0 0 librados: client.radosgw.gateway initialization error (2) No such file or directory
2016-05-02 15:57:58.928791 7f02f61517c0 -1 Couldn't init storage provider (RADOS)
2016-05-02 15:58:33.891675 7fbd500bf7c0 0 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403), process radosgw, pid 15500
2016-05-02 15:58:34.120087 7f923d49c7c0 0 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403), process radosgw, pid 15569
2016-05-02 16:03:34.120903 7f9232ee3700 -1 Initialization timeout, failed to initialize

The juju logs show mon-relation-changed activity ended at 15:58:37:
[/var/log/juju/unit-ceph-radosgw-0.log]
2016-05-02 15:58:33 INFO mon-relation-changed creating /etc/ceph/keyring.rados.gateway
2016-05-02 15:58:33 INFO mon-relation-changed added entity client.radosgw.gateway auth auth(auid = 18446744073709551615 key=AQCPeCdXn5MzCxAAbfu863TJF5LV4qSU5TjIhg== with 0 caps)
2016-05-02 15:58:33 INFO mon-relation-changed No /usr/bin/radosgw found running; none killed.
2016-05-02 15:58:33 INFO mon-relation-changed Starting client.radosgw.gateway...
2016-05-02 15:58:34 INFO mon-relation-changed Starting client.radosgw.gateway...
2016-05-02 15:58:37 INFO mon-relation-changed ^MReading package lists... 0%^M^MReading package lists... [snip]
2016-05-02 15:58:37 INFO mon-relation-changed ^MBuilding dependency tree... [snip]
2016-05-02 15:58:37 INFO mon-relation-changed ^MReading state information... 0%^M^MReading state information... 0%^M^MReading state information... Done
2016-05-02 15:58:37 INFO juju-log mon:33: Unit is ready
2016-05-02 15:58:37 INFO juju-log mon:33: Unit is ready

I was able to manually start radosgw after I noticed the failure. Not sure why it failed to initialize.

Francis Ginther (fginther) wrote :

Found another case of this "Initialization timeout, failed to initialize" error with 10.2.0.

/var/log/ceph/radosgw.log: https://pastebin.canonical.com/156342/

Francis Ginther (fginther) wrote :

/var/log/ceph/radosgw.log

Francis Ginther (fginther) wrote :

/var/log/juju/* from failure with 10.2.0.

Francis Ginther (fginther) wrote :

"juju get ceph-radosgw" from failure with 10.2.0.

Francis Ginther (fginther) wrote :

Failures with 10.2.0 was using charm: cs:trusty/ceph-radosgw-239

tags: added: kanban-cross-team
tags: removed: kanban-cross-team
James Page (james-page) wrote :

Are you using the new ceph-mon charm? I'm wondering whether radosgw can't init itself if the ceph-mon charm has provided a key, but there is no OSD storage at the point of startup.

Francis Ginther (fginther) wrote :

James, we're using "cs:trusty/ceph-mon-0"

Francis Ginther (fginther) wrote :

Another case of this. Attaching /var/log/[ceph|juju]/*

Francis Ginther (fginther) wrote :

Yet another case, ceph-radosgw failed to initializes on all three units. Logs contains /var/log/[juju|ceph]/* on all ceph-* units.

James Page (james-page) wrote :

I was able to reproduce this by deploying ceph-mon (3 units) with ceph-radosgw (single unit).

After 5 minutes, the radosgw process dies with the following error message:

2016-05-16 12:56:11.743599 7ff7d1acb700 -1 Initialization timeout, failed to initialize

Basically, because no OSD's appear in the cluster in that initial window, the radosgw never completes its initialization and shuts itself down.

radosgw is dependent on OSD storage for state - so kinda understandable.

James Page (james-page) wrote :

Default in Jewel:

common/config_opts.h:OPTION(rgw_init_timeout, OPT_INT, 300) // time in seconds

James Page (james-page) on 2016-05-16
Changed in ceph-radosgw (Juju Charms Collection):
status: New → Confirmed
importance: Undecided → Medium
milestone: none → 16.07
James Page (james-page) on 2016-05-16
Changed in ceph-radosgw (Juju Charms Collection):
importance: Medium → High
James Page (james-page) on 2016-05-17
Changed in ceph-radosgw (Juju Charms Collection):
assignee: nobody → James Page (james-page)
status: Confirmed → In Progress
James Page (james-page) wrote :

Proposed fix published here:

  cs:~james-page/xenial/ceph-mon-gated-0

Testing locally with my reproducer.

James Page (james-page) on 2016-05-20
Changed in ceph-radosgw (Juju Charms Collection):
status: In Progress → Fix Committed
James Page (james-page) on 2016-05-20
Changed in ceph-radosgw (Juju Charms Collection):
status: Fix Committed → Fix Released
Changed in landscape:
status: New → Fix Committed
milestone: none → 16.06
Changed in landscape:
status: Fix Committed → Confirmed
Chris Glass (tribaal) wrote :

Marking the charm task as confirmed again since we are still seeing failures.

Changed in ceph-radosgw (Juju Charms Collection):
status: Fix Released → Confirmed
Andreas Hasenack (ahasenack) wrote :

We are still seeing this with cs:trusty/ceph-mon-1 and cs:trusty/ceph-radosgw-239. Logs attached for ceph-{mon,osd,rados,keystone}*. I have logs for all other units as well if needed.

Changed in landscape:
importance: Undecided → High
James Page (james-page) on 2016-05-31
Changed in ceph-radosgw (Juju Charms Collection):
assignee: James Page (james-page) → nobody
Chris Holcombe (xfactor973) wrote :

Andreas, what's the repro for this? I tried deploying 3x ceph-mon, 1x ceph-osd and 1x ceph-radosgw and the radosgw waited until 2 more ceph-osds were added the cluster went healthy. I'm curious how landscape is deploying and how that is still hitting the bug.

I don't have a repro. We deploy, and most of the time with ceph/ceph it
happens. All the data we have is in the attached logs. We have logs for
other services in the deployment if you want. Uncompressed in total it's
about 2 gigabytes.
On May 31, 2016 19:21, "Chris Holcombe" <email address hidden> wrote:

> Andreas, what's the repro for this? I tried deploying 3x ceph-mon, 1x
> ceph-osd and 1x ceph-radosgw and the radosgw waited until 2 more ceph-
> osds were added the cluster went healthy. I'm curious how landscape is
> deploying and how that is still hitting the bug.
>
> --
> You received this bug notification because you are subscribed to
> Landscape Project.
> Matching subscriptions: default, everything
> https://bugs.launchpad.net/bugs/1577519
>
> Title:
> ceph-radosgw "Initialization timeout, failed to initialize"
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/landscape/+bug/1577519/+subscriptions
>

Chris Holcombe (xfactor973) wrote :

Ok. Could you show me your build/deploy process? I was chatting with
beisner and he mentioned that landscape has an amulet like deploy
procedure for deploying the charms. I think that could help me figure
out how it ended up in a race condition.

On 05/31/2016 05:08 PM, Andreas Hasenack wrote:
> I don't have a repro. We deploy, and most of the time with ceph/ceph it
> happens. All the data we have is in the attached logs. We have logs for
> other services in the deployment if you want. Uncompressed in total it's
> about 2 gigabytes.
> On May 31, 2016 19:21, "Chris Holcombe" <email address hidden> wrote:
>
>> Andreas, what's the repro for this? I tried deploying 3x ceph-mon, 1x
>> ceph-osd and 1x ceph-radosgw and the radosgw waited until 2 more ceph-
>> osds were added the cluster went healthy. I'm curious how landscape is
>> deploying and how that is still hitting the bug.
>>
>> --
>> You received this bug notification because you are subscribed to
>> Landscape Project.
>> Matching subscriptions: default, everything
>> https://bugs.launchpad.net/bugs/1577519
>>
>> Title:
>> ceph-radosgw "Initialization timeout, failed to initialize"
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/landscape/+bug/1577519/+subscriptions
>>
>

Andreas Hasenack (ahasenack) wrote :

We don't use deployer or amulet. Landscape talks to the juju api natively.

We "reserve" machines beforehand via add-machine (both machines and containers) after bootstrap, and then we deploy services onto them. We also use the juju run trick twice to wait for relations to settle.

There is no simple way to emulate landscape's deployment, you will have to use it yourself I guess, or figure out the ordering of events via the log files.

If these logs are not enough, we can enable charm debugging if that would help.

Chris Gregan (cgregan) on 2016-06-01
tags: added: cdo-qa-blocker
Chris Holcombe (xfactor973) wrote :

I'm not exactly sure at the moment what I'll need to get this race
condition to expose itself. Let me give it some thought and take a
closer look at the logs.

On 06/01/2016 09:41 AM, Andreas Hasenack wrote:
> We don't use deployer or amulet. Landscape talks to the juju api
> natively.
>
> We "reserve" machines beforehand via add-machine (both machines and
> containers) after bootstrap, and then we deploy services onto them. We
> also use the juju run trick twice to wait for relations to settle.
>
> There is no simple way to emulate landscape's deployment, you will have
> to use it yourself I guess, or figure out the ordering of events via the
> log files.
>
> If these logs are not enough, we can enable charm debugging if that
> would help.
>

Ryan Beisner (1chb1n) wrote :

@Chris H. - deploy process @ https://bazaar.launchpad.net/~landscape/landscape/release-43/view/head:/canonical/landscape/model/openstack/services.py

Also fyi:

<blackboxsw> beisner, confirmed. that link you sent is basically charms we deploy and config params we set for those charms. The charm versions are sourced from one of the files based on whether you are deploying trusty (default) or xenial. https://bazaar.launchpad.net/~landscape/landscape/release-43/view/head:/canonical/landscape/model/openstack/charms/trusty-stable.json or xenial-stable.json

Ryan Beisner (1chb1n) wrote :

@Chris H. - also, James describes how he reproduced it @ https://bugs.launchpad.net/charms/+source/ceph-radosgw/+bug/1577519/comments/10

Ryan Beisner (1chb1n) wrote :

FTR, landscape logs show that this is Trusty-Mitaka

2016-05-28 16:40:35 INFO install Ign http://ubuntu-cloud.archive.canonical.com trusty-updates/mitaka InRelease

Reviewed: https://review.openstack.org/328374
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-radosgw/commit/?id=3bacc91230dbc00305d0fac677b4022bda58e350
Submitter: Jenkins
Branch: master

commit 3bacc91230dbc00305d0fac677b4022bda58e350
Author: Chris Holcombe <email address hidden>
Date: Fri Jun 10 09:26:33 2016 -0700

    Increase rgw init timeout to 1200 seconds

    This changeset increases the initialization timeout
    for rgw instances from 300 -> 1200 seconds.

    This change decreases the chance that the rgw instance
    will timeout prior to OSD's actually joing the Ceph storage
    cluster as usable storage is required for this operation.

    Change-Id: I6c5442edc2fb25ff37d7a4bd0bc49aabd6f2d24c
    Closes-Bug: 1577519

Changed in ceph-radosgw (Juju Charms Collection):
status: Confirmed → Fix Committed

Reviewed: https://review.openstack.org/331197
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-radosgw/commit/?id=21b441810da0fd34ba6f304b23bc864767f63025
Submitter: Jenkins
Branch: stable/16.04

commit 21b441810da0fd34ba6f304b23bc864767f63025
Author: Chris Holcombe <email address hidden>
Date: Fri Jun 10 09:26:33 2016 -0700

    Increase rgw init timeout to 1200 seconds

    This changeset increases the initialization timeout
    for rgw instances from 300 -> 1200 seconds.

    This change decreases the chance that the rgw instance
    will timeout prior to OSD's actually joing the Ceph storage
    cluster as usable storage is required for this operation.

    Change-Id: I6c5442edc2fb25ff37d7a4bd0bc49aabd6f2d24c
    Closes-Bug: 1577519
    (cherry picked from commit 3bacc91230dbc00305d0fac677b4022bda58e350)

Andreas Hasenack (ahasenack) wrote :

Bumped radosgw trusty charm revision to 241 and xenial to 4 in landscape/16.06 r10084.

Changed in landscape:
status: Confirmed → Fix Committed
Andreas Hasenack (ahasenack) wrote :

Unfortunately we are still seeing this with cs:trusty/ceph-radosgw-241 which is that last fix (a timeout bump). The evidence:
- juju status shows:
    service-status: {current: blocked, message: 'Services not running that should
        be: radosgw', since: '28 Jun 2016 05:26:55Z'}
- last entry in radosgw.log in all rados units:
$ tail ceph-radosgw-{0,1,2}/var/log/ceph/radosgw.log -n 1
==> ceph-radosgw-0/var/log/ceph/radosgw.log <==
2016-06-28 04:40:41.097768 7f4188335700 -1 Initialization timeout, failed to initialize

==> ceph-radosgw-1/var/log/ceph/radosgw.log <==
2016-06-28 04:40:30.564641 7fabb583f700 -1 Initialization timeout, failed to initialize

==> ceph-radosgw-2/var/log/ceph/radosgw.log <==
2016-06-28 04:40:30.608160 7fd42d3e5700 -1 Initialization timeout, failed to initialize

Process listing on all radosgw units confirms no rados process running (file /var/log/ps-fauxww.txt created by our log collector script)

I'm attaching a tarball with /var/log/* from the rados units, and next I will attach juju status.

Changed in ceph-radosgw (Juju Charms Collection):
status: Fix Committed → Confirmed
Andreas Hasenack (ahasenack) wrote :
Andreas Hasenack (ahasenack) wrote :

The logs confirmed it waited for 20min:
2016-06-28 04:20:30.607877 7fd4488d4900 0 pidfile_write: ignore empty --pid-file
2016-06-28 04:40:30.608160 7fd42d3e5700 -1 Initialization timeout, failed to initialize

Timeout setting:
$ grep timeout ceph-radosgw-{0,1,2}/etc/ceph/ceph.conf
ceph-radosgw-0/etc/ceph/ceph.conf:rgw init timeout = 1200
ceph-radosgw-1/etc/ceph/ceph.conf:rgw init timeout = 1200
ceph-radosgw-2/etc/ceph/ceph.conf:rgw init timeout = 1200

Andreas Hasenack (ahasenack) wrote :

ceph mon and osd from the same deployment

Andreas Hasenack (ahasenack) wrote :
Andreas Hasenack (ahasenack) wrote :
Liam Young (gnuoy) on 2016-07-29
Changed in ceph-radosgw (Juju Charms Collection):
milestone: 16.07 → 16.10
Changed in landscape:
status: Fix Committed → Fix Released
James Page (james-page) on 2016-10-14
Changed in ceph-radosgw (Juju Charms Collection):
milestone: 16.10 → 17.01
heyao1992 (heyao2) wrote :

i meet the same problem,and now is there some solution?

heyao1992 (heyao2) wrote :

what situation will hit this bug, and what situation will avoid this bug?

James Page (james-page) on 2017-02-23
Changed in charm-ceph-radosgw:
importance: Undecided → High
status: New → Confirmed
Changed in ceph-radosgw (Juju Charms Collection):
status: Confirmed → Invalid
James Page (james-page) on 2018-11-20
Changed in charm-ceph-radosgw:
milestone: none → 19.04
James Page (james-page) wrote :

This issue should no longer occur as the ceph-mon cluster will not provide keys until a minimal set of viable OSD's have been booted into the cluster which ensures the radosgw process can write to the pools that have been created at the point of first configuration.

Changed in charm-ceph-radosgw:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers