Juju Charms Collection
ceph-radosgw package

ceph-radosgw "Initialization timeout, failed to initialize"

Bug #1577519 reported by Francis Ginther on 2016-05-02

This bug affects 8 people

	Status	Importance	Assigned to	Milestone
Ceph RADOS Gateway Charm	Fix Released	High	Unassigned	Ceph RADOS Gateway Charm 19.04
Landscape Server	Fix Released	High	Unassigned	Landscape Server 16.07
16.06	Fix Released	High	Andreas Hasenack	Landscape Server 16.06
ceph-radosgw (Juju Charms Collection)	Invalid	High	Unassigned	Juju Charms Collection 17.01

Bug Description

Had a landscape OSA deployment fail (ceph object and ceph block storage). Juju status indicated that radosgw was not running. Looking at the logs:

[/var/log/ceph/radosgw.log]
2016-05-02 15:57:58.635872 7fabea0d37c0 0 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403), process radosgw, pid 14235
2016-05-02 15:57:58.644866 7fabea0d37c0 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication
2016-05-02 15:57:58.644943 7fabea0d37c0 0 librados: client.radosgw.gateway initialization error (2) No such file or directory
2016-05-02 15:57:58.645304 7fabea0d37c0 -1 Couldn't init storage provider (RADOS)
2016-05-02 15:57:58.914016 7f02f61517c0 0 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403), process radosgw, pid 14299
2016-05-02 15:57:58.926124 7f02f61517c0 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication
2016-05-02 15:57:58.926126 7f02f61517c0 0 librados: client.radosgw.gateway initialization error (2) No such file or directory
2016-05-02 15:57:58.928791 7f02f61517c0 -1 Couldn't init storage provider (RADOS)
2016-05-02 15:58:33.891675 7fbd500bf7c0 0 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403), process radosgw, pid 15500
2016-05-02 15:58:34.120087 7f923d49c7c0 0 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403), process radosgw, pid 15569
2016-05-02 16:03:34.120903 7f9232ee3700 -1 Initialization timeout, failed to initialize

The juju logs show mon-relation-changed activity ended at 15:58:37:
[/var/log/juju/unit-ceph-radosgw-0.log]
2016-05-02 15:58:33 INFO mon-relation-changed creating /etc/ceph/keyring.rados.gateway
2016-05-02 15:58:33 INFO mon-relation-changed added entity client.radosgw.gateway auth auth(auid = 18446744073709551615 key=AQCPeCdXn5MzCxAAbfu863TJF5LV4qSU5TjIhg== with 0 caps)
2016-05-02 15:58:33 INFO mon-relation-changed No /usr/bin/radosgw found running; none killed.
2016-05-02 15:58:33 INFO mon-relation-changed Starting client.radosgw.gateway...
2016-05-02 15:58:34 INFO mon-relation-changed Starting client.radosgw.gateway...
2016-05-02 15:58:37 INFO mon-relation-changed ^MReading package lists... 0%^M^MReading package lists... [snip]
2016-05-02 15:58:37 INFO mon-relation-changed ^MBuilding dependency tree... [snip]
2016-05-02 15:58:37 INFO mon-relation-changed ^MReading state information... 0%^M^MReading state information... 0%^M^MReading state information... Done
2016-05-02 15:58:37 INFO juju-log mon:33: Unit is ready
2016-05-02 15:58:37 INFO juju-log mon:33: Unit is ready

I was able to manually start radosgw after I noticed the failure. Not sure why it failed to initialize.

Tags:

Revision history for this message

Francis Ginther (fginther) wrote on 2016-05-12:

Found another case of this "Initialization timeout, failed to initialize" error with 10.2.0.

/var/log/ceph/radosgw.log: https://pastebin.canonical.com/156342/

Revision history for this message

Francis Ginther (fginther) wrote on 2016-05-12:

radosgw.log Edit (2.8 KiB, text/plain)

/var/log/ceph/radosgw.log

Revision history for this message

Francis Ginther (fginther) wrote on 2016-05-12:

juju.tgz Edit (1.8 MiB, application/x-tar)

/var/log/juju/* from failure with 10.2.0.

Revision history for this message

Francis Ginther (fginther) wrote on 2016-05-12:

ceph-radosgw.juju-get Edit (8.2 KiB, text/plain)

"juju get ceph-radosgw" from failure with 10.2.0.

Revision history for this message

Francis Ginther (fginther) wrote on 2016-05-12:

Failures with 10.2.0 was using charm: cs:trusty/ceph-radosgw-239

Andreas Hasenack (ahasenack) on 2016-05-12

tags:

added: kanban-cross-team

🤖 Landscape Builder (landscape-builder) on 2016-05-12

tags:

removed: kanban-cross-team

Revision history for this message

James Page (james-page) wrote on 2016-05-12:

Are you using the new ceph-mon charm? I'm wondering whether radosgw can't init itself if the ceph-mon charm has provided a key, but there is no OSD storage at the point of startup.

Revision history for this message

Francis Ginther (fginther) wrote on 2016-05-12:

James, we're using "cs:trusty/ceph-mon-0"

Revision history for this message

Francis Ginther (fginther) wrote on 2016-05-13:

ceph-radosgw.tgz Edit (986.6 KiB, application/x-tar)

Another case of this. Attaching /var/log/[ceph|juju]/*

Revision history for this message

Francis Ginther (fginther) wrote on 2016-05-16:

ceph-failed-int.tgz Edit (10.9 MiB, application/x-tar)

Yet another case, ceph-radosgw failed to initializes on all three units. Logs contains /var/log/[juju|ceph]/* on all ceph-* units.

Revision history for this message

James Page (james-page) wrote on 2016-05-16:

#10

I was able to reproduce this by deploying ceph-mon (3 units) with ceph-radosgw (single unit).

After 5 minutes, the radosgw process dies with the following error message:

2016-05-16 12:56:11.743599 7ff7d1acb700 -1 Initialization timeout, failed to initialize

Basically, because no OSD's appear in the cluster in that initial window, the radosgw never completes its initialization and shuts itself down.

radosgw is dependent on OSD storage for state - so kinda understandable.

Revision history for this message

James Page (james-page) wrote on 2016-05-16:

#11

Default in Jewel:

common/config_opts.h:OPTION(rgw_init_timeout, OPT_INT, 300) // time in seconds

James Page (james-page) on 2016-05-16

Changed in ceph-radosgw (Juju Charms Collection):
status:	New → Confirmed
importance:	Undecided → Medium
milestone:	none → 16.07

James Page (james-page) on 2016-05-16

Changed in ceph-radosgw (Juju Charms Collection):
importance:	Medium → High

James Page (james-page) on 2016-05-17

Changed in ceph-radosgw (Juju Charms Collection):
assignee:	nobody → James Page (james-page)
status:	Confirmed → In Progress

Revision history for this message

James Page (james-page) wrote on 2016-05-17:

#12

Proposed fix published here:

cs:~james-page/xenial/ceph-mon-gated-0

Testing locally with my reproducer.

Revision history for this message

James Page (james-page) wrote on 2016-05-17:

#13

https://review.openstack.org/#/c/317313/

James Page (james-page) on 2016-05-20

Changed in ceph-radosgw (Juju Charms Collection):
status:	In Progress → Fix Committed

James Page (james-page) on 2016-05-20

Changed in ceph-radosgw (Juju Charms Collection):
status:	Fix Committed → Fix Released

Francis Ginther (fginther) on 2016-05-25

Changed in landscape:
status:	New → Fix Committed
milestone:	none → 16.06

Andreas Hasenack (ahasenack) on 2016-05-30

Changed in landscape:
status:	Fix Committed → Confirmed

Revision history for this message

Chris Glass (tribaal) wrote on 2016-05-30:

#14

Marking the charm task as confirmed again since we are still seeing failures.

Changed in ceph-radosgw (Juju Charms Collection):
status:	Fix Released → Confirmed

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2016-05-30:

#15

logs-28may2016.tar.bz2 Edit (19.2 MiB, application/x-tar)

We are still seeing this with cs:trusty/ceph-mon-1 and cs:trusty/ceph-radosgw-239. Logs attached for ceph-{mon,osd,rados,keystone}*. I have logs for all other units as well if needed.

Changed in landscape:
importance:	Undecided → High

James Page (james-page) on 2016-05-31

Changed in ceph-radosgw (Juju Charms Collection):
assignee:	James Page (james-page) → nobody

Revision history for this message

Chris Holcombe (xfactor973) wrote on 2016-05-31:

#16

Andreas, what's the repro for this? I tried deploying 3x ceph-mon, 1x ceph-osd and 1x ceph-radosgw and the radosgw waited until 2 more ceph-osds were added the cluster went healthy. I'm curious how landscape is deploying and how that is still hitting the bug.

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2016-06-01: Re: [Bug 1577519] Re: ceph-radosgw "Initialization timeout, failed to initialize"

#17

I don't have a repro. We deploy, and most of the time with ceph/ceph it
happens. All the data we have is in the attached logs. We have logs for
other services in the deployment if you want. Uncompressed in total it's
about 2 gigabytes.
On May 31, 2016 19:21, "Chris Holcombe" <email address hidden> wrote:

> Andreas, what's the repro for this? I tried deploying 3x ceph-mon, 1x
> ceph-osd and 1x ceph-radosgw and the radosgw waited until 2 more ceph-
> osds were added the cluster went healthy. I'm curious how landscape is
> deploying and how that is still hitting the bug.
>
> --
> You received this bug notification because you are subscribed to
> Landscape Project.
> Matching subscriptions: default, everything
> https://bugs.launchpad.net/bugs/1577519
>
> Title:
> ceph-radosgw "Initialization timeout, failed to initialize"
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/landscape/+bug/1577519/+subscriptions
>

Revision history for this message

Chris Holcombe (xfactor973) wrote on 2016-06-01:

#18

Ok. Could you show me your build/deploy process? I was chatting with
beisner and he mentioned that landscape has an amulet like deploy
procedure for deploying the charms. I think that could help me figure
out how it ended up in a race condition.

On 05/31/2016 05:08 PM, Andreas Hasenack wrote:
> I don't have a repro. We deploy, and most of the time with ceph/ceph it
> happens. All the data we have is in the attached logs. We have logs for
> other services in the deployment if you want. Uncompressed in total it's
> about 2 gigabytes.
> On May 31, 2016 19:21, "Chris Holcombe" <email address hidden> wrote:
>
>> Andreas, what's the repro for this? I tried deploying 3x ceph-mon, 1x
>> ceph-osd and 1x ceph-radosgw and the radosgw waited until 2 more ceph-
>> osds were added the cluster went healthy. I'm curious how landscape is
>> deploying and how that is still hitting the bug.
>>
>> --
>> You received this bug notification because you are subscribed to
>> Landscape Project.
>> Matching subscriptions: default, everything
>> https://bugs.launchpad.net/bugs/1577519
>>
>> Title:
>> ceph-radosgw "Initialization timeout, failed to initialize"
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/landscape/+bug/1577519/+subscriptions
>>
>

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2016-06-01:

#19

We don't use deployer or amulet. Landscape talks to the juju api natively.

We "reserve" machines beforehand via add-machine (both machines and containers) after bootstrap, and then we deploy services onto them. We also use the juju run trick twice to wait for relations to settle.

There is no simple way to emulate landscape's deployment, you will have to use it yourself I guess, or figure out the ordering of events via the log files.

If these logs are not enough, we can enable charm debugging if that would help.

Chris Gregan (cgregan) on 2016-06-01

tags:

added: cdo-qa-blocker

Revision history for this message

Chris Holcombe (xfactor973) wrote on 2016-06-01:

#20

I'm not exactly sure at the moment what I'll need to get this race
condition to expose itself. Let me give it some thought and take a
closer look at the logs.

On 06/01/2016 09:41 AM, Andreas Hasenack wrote:
> We don't use deployer or amulet. Landscape talks to the juju api
> natively.
>
> We "reserve" machines beforehand via add-machine (both machines and
> containers) after bootstrap, and then we deploy services onto them. We
> also use the juju run trick twice to wait for relations to settle.
>
> There is no simple way to emulate landscape's deployment, you will have
> to use it yourself I guess, or figure out the ordering of events via the
> log files.
>
> If these logs are not enough, we can enable charm debugging if that
> would help.
>

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2016-06-03:

#21

@Chris H. - deploy process @ https://bazaar.launchpad.net/~landscape/landscape/release-43/view/head:/canonical/landscape/model/openstack/services.py

Also fyi:

<blackboxsw> beisner, confirmed. that link you sent is basically charms we deploy and config params we set for those charms. The charm versions are sourced from one of the files based on whether you are deploying trusty (default) or xenial. https://bazaar.launchpad.net/~landscape/landscape/release-43/view/head:/canonical/landscape/model/openstack/charms/trusty-stable.json or xenial-stable.json

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2016-06-03:

#22

@Chris H. - also, James describes how he reproduced it @ https://bugs.launchpad.net/charms/+source/ceph-radosgw/+bug/1577519/comments/10

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2016-06-03:

#23

FTR, landscape logs show that this is Trusty-Mitaka

2016-05-28 16:40:35 INFO install Ign http://ubuntu-cloud.archive.canonical.com trusty-updates/mitaka InRelease

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-17: Fix merged to charm-ceph-radosgw (master)

#24

Reviewed: https://review.openstack.org/328374
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-radosgw/commit/?id=3bacc91230dbc00305d0fac677b4022bda58e350
Submitter: Jenkins
Branch: master

commit 3bacc91230dbc00305d0fac677b4022bda58e350
Author: Chris Holcombe <email address hidden>
Date: Fri Jun 10 09:26:33 2016 -0700

Increase rgw init timeout to 1200 seconds

This changeset increases the initialization timeout
for rgw instances from 300 -> 1200 seconds.

    This change decreases the chance that the rgw instance
    will timeout prior to OSD's actually joing the Ceph storage
    cluster as usable storage is required for this operation.

Change-Id: I6c5442edc2fb25ff37d7a4bd0bc49aabd6f2d24c
Closes-Bug: 1577519

Changed in ceph-radosgw (Juju Charms Collection):
status:	Confirmed → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-17: Fix proposed to charm-ceph-radosgw (stable/16.04)

#25

Fix proposed to branch: stable/16.04
Review: https://review.openstack.org/331197

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-17: Fix merged to charm-ceph-radosgw (stable/16.04)

#26

Reviewed: https://review.openstack.org/331197
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-radosgw/commit/?id=21b441810da0fd34ba6f304b23bc864767f63025
Submitter: Jenkins
Branch: stable/16.04

commit 21b441810da0fd34ba6f304b23bc864767f63025
Author: Chris Holcombe <email address hidden>
Date: Fri Jun 10 09:26:33 2016 -0700

Increase rgw init timeout to 1200 seconds

This changeset increases the initialization timeout
for rgw instances from 300 -> 1200 seconds.

    This change decreases the chance that the rgw instance
    will timeout prior to OSD's actually joing the Ceph storage
    cluster as usable storage is required for this operation.

    Change-Id: I6c5442edc2fb25ff37d7a4bd0bc49aabd6f2d24c
    Closes-Bug: 1577519
    (cherry picked from commit 3bacc91230dbc00305d0fac677b4022bda58e350)

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2016-06-22:

#27

Bumped radosgw trusty charm revision to 241 and xenial to 4 in landscape/16.06 r10084.

Changed in landscape:
status:	Confirmed → Fix Committed

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2016-06-30:

#28

ceph-radosgw-logs.tar.gz Edit (3.7 MiB, application/x-tar)

Unfortunately we are still seeing this with cs:trusty/ceph-radosgw-241 which is that last fix (a timeout bump). The evidence:
- juju status shows:
service-status: {current: blocked, message: 'Services not running that should
be: radosgw', since: '28 Jun 2016 05:26:55Z'}
- last entry in radosgw.log in all rados units:
$ tail ceph-radosgw-{0,1,2}/var/log/ceph/radosgw.log -n 1
==> ceph-radosgw-0/var/log/ceph/radosgw.log <==
2016-06-28 04:40:41.097768 7f4188335700 -1 Initialization timeout, failed to initialize

==> ceph-radosgw-1/var/log/ceph/radosgw.log <==
2016-06-28 04:40:30.564641 7fabb583f700 -1 Initialization timeout, failed to initialize

==> ceph-radosgw-2/var/log/ceph/radosgw.log <==
2016-06-28 04:40:30.608160 7fd42d3e5700 -1 Initialization timeout, failed to initialize

Process listing on all radosgw units confirms no rados process running (file /var/log/ps-fauxww.txt created by our log collector script)

I'm attaching a tarball with /var/log/* from the rados units, and next I will attach juju status.

Changed in ceph-radosgw (Juju Charms Collection):
status:	Fix Committed → Confirmed

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2016-06-30:

#29

juju-status.txt Edit (76.2 KiB, text/plain)

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2016-06-30:

#30

The logs confirmed it waited for 20min:
2016-06-28 04:20:30.607877 7fd4488d4900 0 pidfile_write: ignore empty --pid-file
2016-06-28 04:40:30.608160 7fd42d3e5700 -1 Initialization timeout, failed to initialize

Timeout setting:
$ grep timeout ceph-radosgw-{0,1,2}/etc/ceph/ceph.conf
ceph-radosgw-0/etc/ceph/ceph.conf:rgw init timeout = 1200
ceph-radosgw-1/etc/ceph/ceph.conf:rgw init timeout = 1200
ceph-radosgw-2/etc/ceph/ceph.conf:rgw init timeout = 1200

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2016-06-30:

#31

ceph-mon-and-osd.tar.gz Edit (19.7 MiB, application/x-tar)

ceph mon and osd from the same deployment

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2016-06-30:

#32

Landscape reference: above logs are from https://ci.lscape.net/job/landscape-system-tests/2964/artifact/

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2016-07-06:

#33

cloud-logs.tar.bz2 Edit (56.6 MiB, application/x-tar)

Another failed run: https://ci.lscape.net/job/landscape-system-tests/3045/

Logs attached.

Liam Young (gnuoy) on 2016-07-29

Changed in ceph-radosgw (Juju Charms Collection):
milestone:	16.07 → 16.10

Andreas Hasenack (ahasenack) on 2016-08-12

Changed in landscape:
status:	Fix Committed → Fix Released

James Page (james-page) on 2016-10-14

Changed in ceph-radosgw (Juju Charms Collection):
milestone:	16.10 → 17.01

Revision history for this message

heyao1992 (heyao2) wrote on 2016-10-25:

#34

i meet the same problem,and now is there some solution？

Revision history for this message

heyao1992 (heyao2) wrote on 2016-10-25:

#35

what situation will hit this bug, and what situation will avoid this bug？

James Page (james-page) on 2017-02-23

Changed in charm-ceph-radosgw:
importance:	Undecided → High
status:	New → Confirmed
Changed in ceph-radosgw (Juju Charms Collection):
status:	Confirmed → Invalid

James Page (james-page) on 2018-11-20

Changed in charm-ceph-radosgw:
milestone:	none → 19.04

Revision history for this message

James Page (james-page) wrote on 2019-01-25:

#36

This issue should no longer occur as the ceph-mon cluster will not provide keys until a minimal set of viable OSD's have been booted into the cluster which ensures the radosgw process can write to the pools that have been created at the point of first configuration.

Changed in charm-ceph-radosgw:
status:	Confirmed → Fix Released

Revision history for this message

yuqi hou (yuqihou) wrote on 2020-09-18:

#37

@James
This issue should no longer occur as the ceph-mon cluster will not provide keys until a minimal set of viable OSD's have been booted into the cluster which ensures the radosgw process can write to the pools that have been created at the point of first configuration.
---------------------
which version is ok?
v14.2.11 still occur。
ceph-deploy can deploy versions of Ceph newer than Nautilus?

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Juju Charms Collectionceph-radosgw package

ceph-radosgw "Initialization timeout, failed to initialize"

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Juju Charms Collection
ceph-radosgw package