systemd issues with bionic-rocky causing nagios alert and can't restart daemon

Bug #1825843 reported by Drew Freiberger
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceph RADOS Gateway Charm
Fix Released
High
Pen Gale
ceph (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

During deployment of a bionic-rocky cloud on 19.04 charms, we are seeing an issue with the ceph-radosgw units related to the systemd service definition for radosgw.service.

If you look through this pastebin, you'll notice that there is a running radosgw daemon and the local haproxy unit thinks all radosgw backend services are available (via nagios check), but systemd can't control radosgw properly (note that before a restart with systemd, systemd just showed the unit as loaded inactive, however, it now shows active exited, but that did not actually restart the radosgw service.

https://pastebin.ubuntu.com/p/Pn3sQ3zHXx/

charm: cs:ceph-radosgw-266
cloud:bionic-rocky
 *** 13.2.4+dfsg1-0ubuntu0.18.10.1~cloud0 500
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu bionic-updates/rocky/main amd64 Packages

ceph-radosgw/0 active idle 18/lxd/2 10.20.175.60 80/tcp Unit is ready
  hacluster-radosgw/2 active idle 10.20.175.60 Unit is ready and clustered
ceph-radosgw/1 active idle 19/lxd/2 10.20.175.48 80/tcp Unit is ready
  hacluster-radosgw/1 active idle 10.20.175.48 Unit is ready and clustered
ceph-radosgw/2* active idle 20/lxd/2 10.20.175.25 80/tcp Unit is ready
  hacluster-radosgw/0* active idle 10.20.175.25 Unit is ready and clustered

Revision history for this message
Drew Freiberger (afreiberger) wrote :

Subscribed field-high as this is an operational concern for go-live.

Workaround for managing service is to reboot the hosting lxd container which resets state to that of the first 43 lines of the pastebin.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

This appears most definitely to be an upstream packaging issue than an issue with the charm itself.

Ryan Beisner (1chb1n)
Changed in charm-ceph-radosgw:
importance: Undecided → High
assignee: nobody → Pete Vander Giessen (petevg)
Revision history for this message
Ryan Beisner (1chb1n) wrote :

Please add log files (pastebins are ephemeral and will vanish).

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Non-reproducing reproducer attempt attached.

Revision history for this message
Pen Gale (pengale) wrote :

I can verify that this happens with the minimal bionic-rocky bundle found in the ceph-radosgw charm's tests.

It's a bit tricky to spot at first, though, as the workload status is green.

Filed a separate bug about that: https://bugs.launchpad.net/charm-ceph-radosgw/+bug/1825884

Changed in charm-ceph-radosgw:
status: New → Triaged
Revision history for this message
Drew Freiberger (afreiberger) wrote :

Workaround to get temporary fix into radosgw init script. note, this will potentially restart all your rgw units at once, you may want to run one unit at a time.

juju run --application ceph-radosgw 'perl -pi -e "s/^PREFIX=.*/PREFIX=client.rgw./" /etc/init.d/radosgw; systemctl daemon-reload; systemctl restart radosgw'

David Britton (dpb)
tags: added: tracking
Revision history for this message
Pen Gale (pengale) wrote :

After further reading of code and testing, I think that I am completely wrong about what is or is not broken. radosgw is inactive, but that's deliberate. The charm has created a new service called "ceph-radosgw@rgw...", along with a service to keep track of it called ceph-radosgw.target. Those seem to be running just fine.

https://paste.ubuntu.com/p/szvhrRT9PM/

Revision history for this message
Angel Vargas (angelvargas) wrote :

We upgraded the charms and radosgw got broken, ceph-radosgw release 267.

After hours of debugging, We decided to test a fresh 4 node deployment to investigate the problem and try to revert, deploying a fresh openstack base juju is showing:

ceph-radosgw/0* blocked idle 0/lxd/0 10.100.0.61 80/tcp Services not running that should be: <email address hidden>

if we do a restart to the lxd container, when we execute:

sudo service radosgw status

we get:

Apr 23 00:18:05 juju-168b18-0-lxd-0 radosgw[36885]: Starting client.rgw.juju-168b18-0-lxd-0...
Apr 23 00:18:05 juju-168b18-0-lxd-0 systemd[1]: Started LSB: radosgw RESTful rados gateway.
Apr 23 00:19:22 juju-168b18-0-lxd-0 systemd[1]: Stopping LSB: radosgw RESTful rados gateway...
Apr 23 00:19:22 juju-168b18-0-lxd-0 systemd[1]: Stopped LSB: radosgw RESTful rados gateway.
Apr 23 00:19:26 juju-168b18-0-lxd-0 systemd[1]: radosgw.service: Failed to reset devices.list: Operation not permitted
Apr 23 00:19:26 juju-168b18-0-lxd-0 systemd[1]: Starting LSB: radosgw RESTful rados gateway...
Apr 23 00:19:26 juju-168b18-0-lxd-0 radosgw[37618]: Starting client.rgw.juju-168b18-0-lxd-0...
Apr 23 00:19:26 juju-168b18-0-lxd-0 systemd[1]: Started LSB: radosgw RESTful rados gateway.
Apr 23 00:21:48 juju-168b18-0-lxd-0 systemd[1]: Stopping LSB: radosgw RESTful rados gateway...
Apr 23 00:21:49 juju-168b18-0-lxd-0 systemd[1]: Stopped LSB: radosgw RESTful rados gateway.

that's the output after a fresh boot, then if we do:

sudo service radosgw start

we get the service running:

● radosgw.service - LSB: radosgw RESTful rados gateway
   Loaded: loaded (/etc/init.d/radosgw; generated)
   Active: active (running) since Tue 2019-04-23 00:22:47 UTC; 17min ago
     Docs: man:systemd-sysv-generator(8)
  Process: 811 ExecStart=/etc/init.d/radosgw start (code=exited, status=0/SUCCESS)
    Tasks: 582 (limit: 7372)
   CGroup: /system.slice/radosgw.service
           └─850 /usr/bin/radosgw -n client.rgw.juju-168b18-0-lxd-0

Apr 23 00:22:46 juju-168b18-0-lxd-0 systemd[1]: Starting LSB: radosgw RESTful rados gateway...
Apr 23 00:22:46 juju-168b18-0-lxd-0 radosgw[811]: Starting client.rgw.juju-168b18-0-lxd-0...
Apr 23 00:22:47 juju-168b18-0-lxd-0 systemd[1]: Started LSB: radosgw RESTful rados gateway.

but juju still keep showing the unit blocked.

This is the juju log for ceph-radosgw:

https://paste.ubuntu.com/p/kb3g9XZ7nb/

We are getting the same behaviour in our production and test environment. Even if we get the service running, the unit doesn't seem to work from the openstack perspective, e.g. try to create a bucket the api doesn't connect.

How can I help?

Revision history for this message
James Page (james-page) wrote :

The switch from the init.d radosgw script to actual systemd units was intentional and was part of a tech debt payoff to support the implementation of RGW mirroring, which requires that all daemons have specific keys.

I think there may be an NRPE check that needs to be updates to support the switch to named systemd units (as I note in the original bug report that that units all showed healthy status from juju itself).

Revision history for this message
James Page (james-page) wrote :

The nrpe check code is using the new code paths for generating the service unit names in the new style.

Revision history for this message
James Page (james-page) wrote :

@angelvargas

I'm not sure your issue is the same problem - in the original bug report, the application units are reporting that all services are running, whereas in yours they are not.

Please can you determine why the radosgw daemon is not running - /var/log/ceph and systemctl status radosgw@rgw.`hostname` will tell you more.

Changed in ceph (Ubuntu):
status: New → Invalid
Revision history for this message
Angel Vargas (angelvargas) wrote :

@james-page

In my case, after the upgrade the charm unit showed running, but the openstack api/horizon wasn't able to talk to the service, we went to the logs and there wasn't obvious information telling why the service wasn't up, only from the juju controller logs we saw for some reason the radosgw was going to failed state, after reboot the container, the service wasn't starting, so we decide to deploy a lean openstack-base r59 (bionic-stein) bundle as usual, so the charm was showing the exact same problem/behavior. Not sure what is wrong with the update we did, to get back radosgw working we have to downgrade to (bionic-rocky):

ceph-mon 13.2.4+dfsg1 active 3 ceph-mon jujucharms 32 ubuntu
ceph-osd 13.2.4+dfsg1 active 3 ceph-osd jujucharms 275 ubuntu
ceph-radosgw 13.2.4+dfsg1 active 1 ceph-radosgw jujucharms 263 ubuntu

And we got the services up again. We are currently working to fix the production environment.

When I wrote first time, I thought the initial way how the bug affect us was related to the current report.

Revision history for this message
Pen Gale (pengale) wrote :

After IRC conversations and more testing, I think that I have a clean reproduction of this bug, along with a root cause.

The root cause: the charm takes control of the radosgw service, and changes the name, but doesn't remove the old nrpe check.

To reproduce:

1) juju deploy the following bundle: https://paste.ubuntu.com/p/wpVt447Vwz/
2) juju ssh into ceph-radosgw/0 and note that there is a "check_radosgw.cfg" in /etc/nagios/nrpe.d.
3) Trigger the config-changed hooked on the ceph-radosgw charm. You might change the number of ceph replicas, for example.
4) Note that there is now a "check_ceph-radosgw@<hostname>.cfg" check, in addition to the check_radosgw.cfg check.
5) Run both checks (cat the files to get the command). Note that the new, hostname based check succeeds, but the old check does not.

The original check will also fail if you run it during step 2, suggesting that the service has been changed, but the nagios monitoring is not updated until the config-changed hook runs.

This bug can be closed once the charm places checks in /etc/nagios/nrpe.d that accurately reflect the running services, and cleans up outdated checks as well.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-radosgw (master)

Fix proposed to branch: master
Review: https://review.opendev.org/655434

Changed in charm-ceph-radosgw:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/655574

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-ceph-radosgw (master)

Change abandoned by Pete Vander Giessen (<email address hidden>) on branch: master
Review: https://review.opendev.org/655574
Reason: Accidentally created extra review due to bad squash.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-radosgw (master)

Reviewed: https://review.opendev.org/655434
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-radosgw/commit/?id=ff90c0f058073292739bab33309d700d9bcfc4c6
Submitter: Zuul
Branch: master

commit ff90c0f058073292739bab33309d700d9bcfc4c6
Author: Pete Vander Giessen <email address hidden>
Date: Wed Apr 24 09:39:26 2019 -0400

    Fix spurious nagios alerts for radosgw service.

    Currently, when the charm tears down the default radosgw daemon in
    order to make way for per host daemons, it does not remove the nrpe
    check for the daemon. This PR fixes the issue.

    It also closes a gap where alerts for the per host daemons are not
    setup until a hook that happens to call update_nrpe_checks as a
    side-effect is run.

    Change-Id: I7621b9671b010a77bb3e94bdd1e80f45274c73e5
    Closes-Bug: #1825843

Changed in charm-ceph-radosgw:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-radosgw (stable/19.04)

Fix proposed to branch: stable/19.04
Review: https://review.opendev.org/655900

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-ceph-radosgw (stable/19.04)

Change abandoned by Pete Vander Giessen (<email address hidden>) on branch: stable/19.04
Review: https://review.opendev.org/655900
Reason: Cherry picked the wrong change!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-radosgw (stable/19.04)

Fix proposed to branch: stable/19.04
Review: https://review.opendev.org/655913

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-radosgw (stable/19.04)

Reviewed: https://review.opendev.org/655913
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-radosgw/commit/?id=e034d5e18681fc2952093afe20a78d202ed5be62
Submitter: Zuul
Branch: stable/19.04

commit e034d5e18681fc2952093afe20a78d202ed5be62
Author: Pete Vander Giessen <email address hidden>
Date: Wed Apr 24 09:39:26 2019 -0400

    Fix spurious nagios alerts for radosgw service.

    Currently, when the charm tears down the default radosgw daemon in
    order to make way for per host daemons, it does not remove the nrpe
    check for the daemon. This PR fixes the issue.

    It also closes a gap where alerts for the per host daemons are not
    setup until a hook that happens to call update_nrpe_checks as a
    side-effect is run.

    Change-Id: I7621b9671b010a77bb3e94bdd1e80f45274c73e5
    Closes-Bug: #1825843
    (cherry picked from commit ff90c0f058073292739bab33309d700d9bcfc4c6)

Revision history for this message
Angel Vargas (angelvargas) wrote :
Download full text (4.1 KiB)

Currently trying to get charm ceph-radosgw rev 268, Ubuntu Bionic, OpenStack Stein (base bundle like)

I'm having the issue where the service seems running, then after few minutes the unit change to blocked and report the service is not running, this is in a baremetal deployment with multiples network spaces (handled by maas).

For this charm I deployed using single commands per unit required:

juju deploy --to lxd:0 --config ceph-radosgw.yaml ceph-radosgw --bind="public admin=admin cluster=cluster internal=internal public=public"

then the config file looks like:
---
ceph-radosgw:
  ceph-osd-replication-count: 2
  cache-size: 1200
  os-admin-network: 10.101.0.0/24
  os-internal-network: 10.50.0.0/24
  os-public-network: 10.100.0.0/24
  vip: 10.100.0.210 10.101.0.210 10.50.0.210
  pool-prefix: sc
  source: 'cloud:bionic-stein'

juju shows this:

ceph-radosgw/0* blocked idle 0/lxd/0 10.100.0.64 80/tcp Services not running that should be: <email address hidden>
  ha-radosgw/0* active executing 10.100.0.64 Unit is ready and clustered
ceph-radosgw/1 blocked executing 6/lxd/0 10.100.0.77 80/tcp Services not running that should be: <email address hidden>
  ha-radosgw/1 active idle 10.100.0.77 Unit is ready and clustered
ceph-radosgw/2 blocked executing 1/lxd/3 10.100.0.78 80/tcp Services not running that should be: <email address hidden>

then going to each LXD container and checking the service I got this output in every unit:

ubuntu@juju-a2d93a-0-lxd-0:~$ sudo service jujud-unit-ceph-radosgw-0 status
● jujud-unit-ceph-radosgw-0.service - juju unit agent for ceph-radosgw/0
   Loaded: loaded (/lib/systemd/system/jujud-unit-ceph-radosgw-0/jujud-unit-ceph-radosgw-0.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2019-05-02 19:59:07 UTC; 2h 2min ago
 Main PID: 3820 (bash)
    Tasks: 67 (limit: 7372)
   CGroup: /system.slice/jujud-unit-ceph-radosgw-0.service
           ├─3820 bash /lib/systemd/system/jujud-unit-ceph-radosgw-0/exec-start.sh
           └─3824 /var/lib/juju/tools/unit-ceph-radosgw-0/jujud unit --data-dir /var/lib/juju --unit-name ceph-radosgw/0 --debug

May 02 21:57:27 juju-a2d93a-0-lxd-0 systemd[1]: jujud-unit-ceph-radosgw-0.service: Failed to reset devices.list: Operation not permitted
May 02 22:00:14 juju-a2d93a-0-lxd-0 systemd[1]: jujud-unit-ceph-radosgw-0.service: Failed to reset devices.list: Operation not permitted
May 02 22:00:15 juju-a2d93a-0-lxd-0 systemd[1]: jujud-unit-ceph-radosgw-0.service: Failed to reset devices.list: Operation not permitted
May 02 22:00:15 juju-a2d93a-0-lxd-0 systemd[1]: jujud-unit-ceph-radosgw-0.service: Failed to reset devices.list: Operation not permitted
May 02 22:00:15 juju-a2d93a-0-lxd-0 systemd[1]: jujud-unit-ceph-radosgw-0.service: Failed to reset devices.list: Operation not permitted
May 02 22:00:15 juju-a2d93a-0-lxd-0 systemd[1]: jujud-unit-ceph-radosgw-0.service: Failed to reset de...

Read more...

James Page (james-page)
Changed in charm-ceph-radosgw:
milestone: none → 19.07
David Ames (thedac)
Changed in charm-ceph-radosgw:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.