radosgw charm + restrict-ceph-pools=True missing storage pool definitions for >= Jewel

Bug #1685536 reported by Dmitrii Shcherbakov
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceph RADOS Gateway Charm
Fix Released
High
James Page

Bug Description

/var/log/ceph/radosgw.log:

2017-04-22 16:22:44.814777 7f5cc1027a00 0 ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f), process radosgw, pid 27795
2017-04-22 16:22:44.816025 7f5cc1027a00 0 pidfile_write: ignore empty --pid-file
2017-04-22 16:22:44.831841 7f5cc1027a00 0 error in read_id for object name: default : (2) No such file or directory
2017-04-22 16:22:44.832557 7f5cc1027a00 0 error in read_id for object name: default : (2) No such file or directory
2017-04-22 16:22:44.890265 7f5cc1027a00 -1 Couldn't init storage provider (RADOS)
2017-04-22 16:22:45.025420 7fb73f299a00 0 ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f), process radosgw, pid 27868
2017-04-22 16:22:45.026252 7fb73f299a00 0 pidfile_write: ignore empty --pid-file
2017-04-22 16:22:45.089703 7fb73f299a00 -1 Couldn't init storage provider (RADOS)

root@juju-54661f-6-lxd-1:/home/ubuntu# systemctl status radosgw
● radosgw.service - LSB: radosgw RESTful rados gateway
   Loaded: loaded (/etc/init.d/radosgw; bad; vendor preset: enabled)
   Active: active (exited) since Sat 2017-04-22 16:22:45 UTC; 1h 29min ago
     Docs: man:systemd-sysv-generator(8)
    Tasks: 0
   Memory: 0B
      CPU: 0

Apr 22 16:22:44 juju-54661f-6-lxd-1 systemd[1]: Starting LSB: radosgw RESTful rados gateway...
Apr 22 16:22:45 juju-54661f-6-lxd-1 radosgw[27845]: Starting client.radosgw.gateway...
Apr 22 16:22:45 juju-54661f-6-lxd-1 systemd[1]: Started LSB: radosgw RESTful rados gateway.

openstack catalog list
http://paste.ubuntu.com/24435031/

dmitriis@maas-master:~/ha-bundle$ openstack endpoint list
+----------------------------------+-----------+--------------+-----------------+
| ID | Region | Service Name | Service Type |
+----------------------------------+-----------+--------------+-----------------+
| 0685e08ef30645998ee8a2777e7bd263 | RegionOne | keystone | identity |
| 37b170dccd704506a9fe6eed514fe4a6 | RegionOne | nova | compute |
| 1b95358a9f184cc0982ed0b81362ab0d | RegionOne | glance | image |
| a4934d9049824cd9859daea29d4c1be5 | RegionOne | neutron | network |
| 6c84c328eb6e4c2db788a160618b0543 | RegionOne | cinderv2 | volumev2 |
| 9955493f58c6406abecb1b771a1fed94 | RegionOne | cinder | volume |
| 648b6172a0a94078a1d69972ff68d559 | RegionOne | designate | dns |
| b68cc80b5d7d4ca8b99fcd66dc0c406f | RegionOne | image-stream | product-streams |
| 7f636c5c7d33478ebaabfb57792956ad | RegionOne | placement | placement |
+----------------------------------+-----------+--------------+-----------------+`

juju status
http://paste.ubuntu.com/24435045/

series: xenial
variables:
  worker-multiplier: &worker-multiplier .2
  # Common
  openstack-origin: &openstack-origin cloud:xenial-ocata
...
  ceph-radosgw:
    charm: cs:xenial/ceph-radosgw
    num_units: 3
    bindings:
      "": *oam-space
      public: *public-space
      admin: *admin-space
      internal: *internal-space
    options:
      source: *openstack-origin
      vip: *rados-gateway-vip
      region: *openstack-region
      restrict-ceph-pools: True
    to:
    - 'lxd:storage/0'
    - 'lxd:storage/1'
    - 'lxd:storage/2'

As a result, glance-simple-streams-sync doesn't work. It uses a swift endpoint by default and doesn't report much (I probably haven't waited long enough for it to timeout).

What you get as an indicator is an empty reply from a specific node or a VIP.

dmitriis@maas-master:~/ha-bundle$ curl http://10.0.6.210/swift/v1
curl: (52) Empty reply from server

# vip
dmitriis@maas-master:~/ha-bundle$ curl http://10.0.6.210/swift/v1
curl: (52) Empty reply from server

After doing a couple of restart attempts I found out that rados gateway simply exits with 0 exit status

strace -f -o /tmp/strace.log -s 2048 -p 1 & systemctl restart radosgw.service
# found a pid by error message and then grepped for it
grep 61930 /tmp/strace.log

...
61930 <... futex resumed> ) = 0
61930 write(2, "2017-04-22 18:22:05.101659 7f334fa58a00 -1 Couldn't init storage provider (RADOS)", 81 <unfinished ...>
61930 <... write resumed> ) = -1 EBADF (Bad file descriptor)
61930 write(4, "2017-04-22 18:22:05.101659 7f334fa58a00 -1 Couldn't init storage provider (RADOS)\n", 82 <unfinished ...>
61930 <... write resumed> ) = 82
61930 futex(0x55e95cf3738c, FUTEX_WAIT_PRIVATE, 339, NULL <unfinished ...>
61930 <... futex resumed> ) = 0
61930 futex(0x55e95cf37308, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
61930 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
61930 futex(0x55e95cf37308, FUTEX_WAKE_PRIVATE, 1) = 0
61930 futex(0x55e95cf3738c, FUTEX_WAIT_PRIVATE, 341, NULL <unfinished ...>
61930 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
61930 futex(0x55e95cf37308, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
61930 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
61930 futex(0x55e95cf37308, FUTEX_WAKE_PRIVATE, 1) = 0
61930 futex(0x55e95cf37308, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
61930 <... futex resumed> ) = 0
61930 futex(0x55e95cf3738c, FUTEX_WAIT_PRIVATE, 343, NULL <unfinished ...>
61930 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
61930 futex(0x55e95cf37308, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
61930 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
61930 futex(0x55e95cf37308, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
61930 <... futex resumed> ) = 0
61930 futex(0x55e95cf3738c, FUTEX_WAIT_PRIVATE, 345, NULL <unfinished ...>
61930 <... futex resumed> ) = 0
61930 futex(0x55e95cf37308, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
61930 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
61930 futex(0x55e95cf37308, FUTEX_WAKE_PRIVATE, 1) = 0
61930 futex(0x55e95cf3738c, FUTEX_WAIT_PRIVATE, 347, NULL <unfinished ...>
61930 <... futex resumed> ) = 0
61930 futex(0x55e95cf37308, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
61930 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
61930 futex(0x55e95cf37308, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
61929 futex(0x7f333489b9d0, FUTEX_WAIT, 61930, NULL <unfinished ...>
61930 <... futex resumed> ) = 0
61930 madvise(0x7f333409b000, 8364032, MADV_DONTNEED) = 0
61930 exit(0) = ?
61930 +++ exited with 0 +++

---

root@juju-54661f-6-lxd-1:/home/ubuntu# systemctl status radosgw.service
● radosgw.service - LSB: radosgw RESTful rados gateway
   Loaded: loaded (/etc/init.d/radosgw; bad; vendor preset: enabled)
   Active: active (exited) since Sat 2017-04-22 18:22:05 UTC; 31min ago
     Docs: man:systemd-sysv-generator(8)
  Process: 61891 ExecStop=/etc/init.d/radosgw stop (code=exited, status=0/SUCCESS) # <<-----
  Process: 61904 ExecStart=/etc/init.d/radosgw start (code=exited, status=0/SUCCESS) # <<----

Apr 22 18:22:04 juju-54661f-6-lxd-1 systemd[1]: Starting LSB: radosgw RESTful rados gateway...
Apr 22 18:22:04 juju-54661f-6-lxd-1 radosgw[61904]: Starting client.radosgw.gateway...
Apr 22 18:22:05 juju-54661f-6-lxd-1 systemd[1]: Started LSB: radosgw RESTful rados gateway.

root@juju-54661f-6-lxd-1:/home/ubuntu# ps 61891
    PID TTY STAT TIME COMMAND
root@juju-54661f-6-lxd-1:/home/ubuntu# ps 61904
    PID TTY STAT TIME COMMAND

So, ideally, radosgw should produce a non-zero exit code but it does not which leads to the systemd unit being active.

root@juju-54661f-6-lxd-1:/home/ubuntu# dpkg -l 'rados*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-=========================================-=========================-=========================-========================================================================================
ii radosgw 10.2.6-0ubuntu0.16.04.1 amd64 REST gateway for RADOS distributed object store

Could not find the root cause yet but this behavior deserves a bug (here and probably in the radosgw package as well).

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

It was pretty simple in the end: no relation with keystone.

/var/log/ceph/ceph-client.admin.log
2017-04-22 19:49:51.217395 7fadbd998a00 -1 auth: unable to find a keyring on /var/lib/ceph/radosgw/-admin/keyring: (2) No such file or directory
2017-04-22 19:49:51.217419 7fadbd998a00 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication
2017-04-22 19:49:51.217421 7fadbd998a00 0 librados: client.admin initialization error (2) No such file or directory
2017-04-22 19:49:51.217908 7fadbd998a00 -1 Couldn't init storage provider (RADOS)

However, the lack of an error condition is something to address.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

On #1 - not quite. The endpoint have appeared in keystone after creating a relation but the 'init storage provider' was still there. After switching from openstack-origin: cloud:xenial-ocata to 'distro' and redeploying the problem disappeared. I was able to reproduce the same issue on ocata on two different hw and network setups (which are otherwise 'all green').

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

So it appears to be that commenting out restrict-ceph-pools for ceph-radosgw affects the "Couldn't init storage provider" issue - not openstack-origin.

"restrict-ceph-pools: True"

Changing it via `juju config ...` doesn't help for some reason but a redeploy does.

Revision history for this message
James Page (james-page) wrote :

"restrict-ceph-pools: True" will limit the permissions of the radosgw key to the pools that it requests are created via its relation to ceph-mon; if a pool is missed, the radosgw process will try to create it but it won't be able to as the key does not have that level of permission when this feature is enabled.

Revision history for this message
James Page (james-page) wrote :

(FTR you're using exactly the same version of ceph whether you have the xenial-ocata UCA enabled or not as we've not shipped a new stable ceph version since xenial - so the charms will just pickup xenial's versions even of you enable the UCA).

Revision history for this message
James Page (james-page) wrote : Re: radosgw charm + restrict-ceph-pools=True missing storage pool definitions.

Just checking this one out now.

summary: - radosgw charm doesn't report an error even though a swift endpoint is
- not present due to storage provider init failure
+ radosgw charm + restrict-ceph-pools=True missing storage pool
+ definitions.
Changed in charm-ceph-radosgw:
status: New → Triaged
importance: Undecided → Medium
milestone: none → 18.02
Revision history for this message
James Page (james-page) wrote :

Confirmed with Ceph Jewel; trying to figure out which pool we're missing.

Revision history for this message
James Page (james-page) wrote :
Download full text (11.5 KiB)

Debug log output:

2017-09-21 13:36:30.723984 7f69f6197a00 0 ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185), process radosgw, pid 5514
2017-09-21 13:36:30.726302 7f69f6197a00 0 pidfile_write: ignore empty --pid-file
2017-09-21 13:36:30.741769 7f69f6197a00 0 librados: client.radosgw.gateway authentication error (1) Operation not permitted
2017-09-21 13:36:30.744773 7f69f6197a00 -1 Couldn't init storage provider (RADOS)
2017-09-21 13:36:30.984016 7f71898f4a00 0 ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185), process radosgw, pid 5577
2017-09-21 13:36:30.986986 7f71898f4a00 0 pidfile_write: ignore empty --pid-file
2017-09-21 13:36:31.000665 7f71898f4a00 0 librados: client.radosgw.gateway authentication error (1) Operation not permitted
2017-09-21 13:36:31.001245 7f71898f4a00 -1 Couldn't init storage provider (RADOS)
2017-09-21 13:39:10.903673 7fea9672fa00 0 ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185), process radosgw, pid 5785
2017-09-21 13:39:10.904933 7fea9672fa00 0 pidfile_write: ignore empty --pid-file
2017-09-21 13:39:10.925968 7fea9672fa00 0 librados: client.radosgw.gateway authentication error (1) Operation not permitted
2017-09-21 13:39:10.926743 7fea9672fa00 -1 Couldn't init storage provider (RADOS)
2017-09-21 13:39:48.910053 7f22ab346a00 0 ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185), process radosgw, pid 5884
2017-09-21 13:39:48.910639 7f22ab346a00 0 pidfile_write: ignore empty --pid-file
2017-09-21 13:39:48.932077 7f22ab346a00 20 get_system_obj_state: rctx=0x7ffd75b47970 obj=.rgw.root:default.realm state=0x55f77e9709a8 s->prefetch_data=0
2017-09-21 13:39:48.932217 7f22857fa700 2 RGWDataChangesLog::ChangesRenewThread: start
2017-09-21 13:39:48.940098 7f22ab346a00 20 get_system_obj_state: rctx=0x7ffd75b46970 obj=.rgw.root:converted state=0x55f77e970918 s->prefetch_data=0
2017-09-21 13:39:48.942256 7f22ab346a00 20 get_system_obj_state: rctx=0x7ffd75b45d70 obj=.rgw.root:default.realm state=0x55f77e975e78 s->prefetch_data=0
2017-09-21 13:39:48.944881 7f22ab346a00 10 could not read realm id: (2) No such file or directory
2017-09-21 13:39:48.946228 7f22ab346a00 20 RGWRados::pool_iterate: got zone_info.9693fd82-fc21-4479-8352-5758ba91004c
2017-09-21 13:39:48.947737 7f22ab346a00 20 RGWRados::pool_iterate: got zonegroup_info.d83bce5e-fb30-43d4-86e1-b3b137d87df3
2017-09-21 13:39:48.947756 7f22ab346a00 20 RGWRados::pool_iterate: got zone_names.default
2017-09-21 13:39:48.947759 7f22ab346a00 20 RGWRados::pool_iterate: got zonegroups_names.default
2017-09-21 13:39:48.947827 7f22ab346a00 20 get_system_obj_state: rctx=0x7ffd75b45ff0 obj=.rgw.root:zone_names.default state=0x55f77e977338 s->prefetch_data=0
2017-09-21 13:39:48.950924 7f22ab346a00 20 get_system_obj_state: s->obj_tag was set empty
2017-09-21 13:39:48.950945 7f22ab346a00 20 rados->read ofs=0 len=524288
2017-09-21 13:39:48.952095 7f22ab346a00 20 rados->read r=0 bl.length=46
2017-09-21 13:39:48.952144 7f22ab346a00 20 get_system_obj_state: rctx=0x7ffd75b45ff0 obj=.rgw.root:zone_info.9693fd82-fc21-4479-8352-5758ba91004c state=0x55f77e977338 s->prefetch_data=0
2017-09-21 13:39:48.953436 7f22ab346a00 20 get_system_...

Revision history for this message
James Page (james-page) wrote :

If I scrub all rgw pools and let the radosgw process create its own:

.rgw.root
default.rgw.control
default.rgw.data.root
default.rgw.gc
default.rgw.log
default.rgw.users.uid

which seems at odds with the documentation for jewel.

Revision history for this message
James Page (james-page) wrote :

(and what the charm requests to be setup)

Revision history for this message
James Page (james-page) wrote :

The radosgw process creates the following additional pools over those created by the charms:

default.rgw.data.root
default.rgw.log
default.rgw.users.uid

if the permissions are set to restricted, it can't do this so fails to init.

This is not inline with upstream docs

http://docs.ceph.com/docs/jewel/radosgw/config/

James Page (james-page)
Changed in charm-ceph-radosgw:
importance: Medium → High
summary: radosgw charm + restrict-ceph-pools=True missing storage pool
- definitions.
+ definitions for >= Jewel
Ryan Beisner (1chb1n)
Changed in charm-ceph-radosgw:
milestone: 18.02 → 18.05
David Ames (thedac)
Changed in charm-ceph-radosgw:
milestone: 18.05 → 18.08
James Page (james-page)
Changed in charm-ceph-radosgw:
milestone: 18.08 → 18.11
David Ames (thedac)
Changed in charm-ceph-radosgw:
milestone: 18.11 → 19.04
James Page (james-page)
Changed in charm-ceph-radosgw:
status: Triaged → In Progress
assignee: nobody → James Page (james-page)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-radosgw (master)

Reviewed: https://review.openstack.org/637502
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-radosgw/commit/?id=804deed0199772cd2baf990ad26451e007faa3e6
Submitter: Zuul
Branch: master

commit 804deed0199772cd2baf990ad26451e007faa3e6
Author: James Page <email address hidden>
Date: Mon Feb 18 09:28:14 2019 +0000

    Update pool creation for >= Jewel

    The ceph broker request missed some pools for later Ceph versions,
    and created pools which where no longer required.

    Update pool list and tweak weights inline with current best practice.

    Change-Id: I4ed7e08d557c33a05aa8f8c6305914ef9734bad6
    Closes-Bug: 1685536

Changed in charm-ceph-radosgw:
status: In Progress → Fix Committed
David Ames (thedac)
Changed in charm-ceph-radosgw:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.