ceph-radosgw charm fails deploy on first unit every time

Bug #1905985 reported by Andre Ruiz
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceph RADOS Gateway Charm
Fix Released
High
Unassigned

Bug Description

I'm deploying openstack with all latest charms as of now, specifically ceph-radosgw charm at cs:ceph-radosgw-291.

I have repeated this at least 10 times, and the result is exactly the same every time:

- first unit (the /0 one) goes into error state (log below)
- other two units (/1 and /2) get into Ready state
- then I remove-unit ceph-radosgw/0 and add-unit again
- it comes back with no problem (as /3) and clusters ok with other two

Here is some logs:

ubuntu@brtlvmrs0637co:~$ juju status ceph-radosgw
Model Controller Cloud/Region Version SLA Timestamp
openstack foundations-maas maas_cloud 2.8.6 unsupported 02:41:48Z

App Version Status Scale Charm Store Rev OS Notes
ceph-radosgw 12.2.13 error 3 ceph-radosgw jujucharms 291 ubuntu
filebeat 5.6.16 waiting 3 filebeat jujucharms 33 ubuntu
hacluster-radosgw active 3 hacluster jujucharms 72 ubuntu
landscape-client maintenance 3 landscape-client jujucharms 33 ubuntu
logrotate active 3 logrotated jujucharms 2 ubuntu
nrpe-container active 3 nrpe jujucharms 69 ubuntu
public-policy-routing active 3 advanced-routing jujucharms 3 ubuntu
telegraf active 3 telegraf jujucharms 39 ubuntu

Unit Workload Agent Machine Public address Ports Message
ceph-radosgw/0* error idle 18/lxd/3 10.243.168.77 80/tcp hook failed: "mon-relation-changed"
  filebeat/31 active idle 10.243.168.77 Filebeat ready.
  hacluster-radosgw/0* active idle 10.243.168.77 Unit is ready and clustered
  landscape-client/33 maintenance idle 10.243.168.77 Need computer-title and juju-info to proceed
  logrotate/33 active idle 10.243.168.77 Unit is ready.
  nrpe-container/0* active idle 10.243.168.77 icmp,5666/tcp ready
  public-policy-routing/0* active idle 10.243.168.77 Unit is ready
  telegraf/33 active idle 10.243.168.77 9103/tcp Monitoring ceph-radosgw/0
ceph-radosgw/1 active idle 19/lxd/3 10.243.168.68 80/tcp Unit is ready
  filebeat/61 active idle 10.243.168.68 Filebeat ready.
  hacluster-radosgw/2 active idle 10.243.168.68 Unit is ready and clustered
  landscape-client/64 maintenance idle 10.243.168.68 Need computer-title and juju-info to proceed
  logrotate/64 active idle 10.243.168.68 Unit is ready.
  nrpe-container/29 active idle 10.243.168.68 icmp,5666/tcp ready
  public-policy-routing/19 active idle 10.243.168.68 Unit is ready
  telegraf/63 active idle 10.243.168.68 9103/tcp Monitoring ceph-radosgw/1
ceph-radosgw/2 active idle 20/lxd/3 10.243.168.104 80/tcp Unit is ready
  filebeat/43 active idle 10.243.168.104 Filebeat ready.
  hacluster-radosgw/1 active idle 10.243.168.104 Unit is ready and clustered
  landscape-client/45 maintenance idle 10.243.168.104 Need computer-title and juju-info to proceed
  logrotate/45 active idle 10.243.168.104 Unit is ready.
  nrpe-container/12 active idle 10.243.168.104 icmp,5666/tcp ready
  public-policy-routing/5 active idle 10.243.168.104 Unit is ready
  telegraf/45 active idle 10.243.168.104 9103/tcp Monitoring ceph-radosgw/2

Machine State DNS Inst id Series AZ Message
18 started 10.243.168.59 brtlvmrs0642co bionic zone3 Deployed
18/lxd/3 started 10.243.168.77 juju-d51b21-18-lxd-3 bionic zone3 Container started
19 started 10.243.168.53 brtlvmrs0641co bionic zone2 Deployed
19/lxd/3 started 10.243.168.68 juju-d51b21-19-lxd-3 bionic zone2 Container started
20 started 10.243.168.36 brtlvmrs0640co bionic zone1 Deployed
20/lxd/3 started 10.243.168.104 juju-d51b21-20-lxd-3 bionic zone1 Container started

root@juju-d51b21-18-lxd-3:/var/lib/juju/agents/unit-ceph-radosgw-0/charm# ./hooks/mon-relation-changed
Synchronizing state of radosgw.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable radosgw
Unit /etc/systemd/system/radosgw.service is masked, ignoring.
Reading package lists... Done
Building dependency tree
Reading state information... Done
python-dbus is already the newest version (1.2.6-1).
The following package was automatically installed and is no longer required:
  libfreetype6
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
parse error setting 'err_to_syslog' to '' (Expected option value to be integer, got '')
parse error setting 'log_to_syslog' to '' (Expected option value to be integer, got '')
unable to get monitor info from DNS SRV with service name: ceph-mon
no monitors specified to connect to.
couldn't init storage provider
2020-11-26 02:39:39.406753 7f21fb27ce40 -1 failed for service _ceph-mon._tcp
parse error setting 'err_to_syslog' to '' (Expected option value to be integer, got '')
parse error setting 'log_to_syslog' to '' (Expected option value to be integer, got '')
unable to get monitor info from DNS SRV with service name: ceph-mon
no monitors specified to connect to.
couldn't init storage provider
2020-11-26 02:39:42.444537 7f64cc084e40 -1 failed for service _ceph-mon._tcp
parse error setting 'err_to_syslog' to '' (Expected option value to be integer, got '')
parse error setting 'log_to_syslog' to '' (Expected option value to be integer, got '')
unable to get monitor info from DNS SRV with service name: ceph-mon
no monitors specified to connect to.
couldn't init storage provider
2020-11-26 02:39:48.488368 7f6f70fd8e40 -1 failed for service _ceph-mon._tcp
parse error setting 'err_to_syslog' to '' (Expected option value to be integer, got '')
parse error setting 'log_to_syslog' to '' (Expected option value to be integer, got '')
unable to get monitor info from DNS SRV with service name: ceph-mon
no monitors specified to connect to.
couldn't init storage provider
2020-11-26 02:39:57.534079 7fb6c8f9fe40 -1 failed for service _ceph-mon._tcp
parse error setting 'err_to_syslog' to '' (Expected option value to be integer, got '')
parse error setting 'log_to_syslog' to '' (Expected option value to be integer, got '')
unable to get monitor info from DNS SRV with service name: ceph-mon
no monitors specified to connect to.
couldn't init storage provider
2020-11-26 02:40:09.634685 7f18a2bece40 -1 failed for service _ceph-mon._tcp
parse error setting 'err_to_syslog' to '' (Expected option value to be integer, got '')
parse error setting 'log_to_syslog' to '' (Expected option value to be integer, got '')
unable to get monitor info from DNS SRV with service name: ceph-mon
no monitors specified to connect to.
couldn't init storage provider
2020-11-26 02:40:24.686832 7f65dac70e40 -1 failed for service _ceph-mon._tcp
Traceback (most recent call last):
  File "./hooks/mon-relation-changed", line 704, in <module>
    hooks.execute(sys.argv)
  File "/var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/charmhelpers/core/hookenv.py", line 945, in execute
    self._hooks[hook_name]()
  File "./hooks/mon-relation-changed", line 336, in mon_relation
    _mon_relation()
  File "/var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/charmhelpers/contrib/openstack/utils.py", line 1720, in wrapped_f
    stopstart, restart_functions)
  File "/var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/charmhelpers/core/host.py", line 747, in restart_on_change_helper
    r = lambda_f()
  File "/var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/charmhelpers/contrib/openstack/utils.py", line 1719, in <lambda>
    (lambda: f(*args, **kwargs)), __restart_map_cache['cache'],
  File "./hooks/mon-relation-changed", line 328, in _mon_relation
    if zone not in multisite.list_zones():
  File "/var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/multisite.py", line 68, in _list
    result = json.loads(_check_output(cmd))
  File "/var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/charmhelpers/core/decorators.py", line 40, in _retry_on_exception_inner_2
    return f(*args, **kwargs)
  File "/var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/multisite.py", line 32, in _check_output
    return subprocess.check_output(cmd).decode('UTF-8')
  File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['radosgw-admin', '--id=rgw.juju-d51b21-18-lxd-3', 'zone', 'list']' returned non-zero exit status 5.

root@juju-d51b21-18-lxd-3:/var/lib/juju/agents/unit-ceph-radosgw-0/charm# radosgw-admin --id=rgw.juju-d51b21-18-lxd-3 zone list
parse error setting 'err_to_syslog' to '' (Expected option value to be integer, got '')
parse error setting 'log_to_syslog' to '' (Expected option value to be integer, got '')
unable to get monitor info from DNS SRV with service name: ceph-mon
no monitors specified to connect to.
couldn't init storage provider
2020-11-26 02:38:31.161470 7fb1169f7e40 -1 failed for service _ceph-mon._tcp

Tags: ps5
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :
Changed in charm-ceph-radosgw:
status: New → Triaged
importance: Undecided → High
Changed in charm-ceph-radosgw:
assignee: nobody → Aurelien Lourot (aurelien-lourot)
status: Triaged → In Progress
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :
Revision history for this message
James Page (james-page) wrote :

Slightly different approach to avoiding the same race:

https://review.opendev.org/c/openstack/charm-ceph-radosgw/+/765907

Nobuto Murata (nobuto)
tags: added: ps5
Changed in charm-ceph-radosgw:
status: In Progress → Fix Committed
milestone: none → 20.10
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :
Changed in charm-ceph-radosgw:
status: Fix Committed → Fix Released
assignee: Aurelien Lourot (aurelien-lourot) → nobody
Revision history for this message
Frode Nordahl (fnordahl) wrote :
Download full text (5.6 KiB)

I hit a variant of this bug where the fix in #4 has a uncovered failure scenario:

2021-08-26 07:32:53 INFO unit.ceph-radosgw/0.juju-log server.go:314 mon:86: Retrying '_check_output' 5 more times (delay=3)
2021-08-26 07:32:56 INFO unit.ceph-radosgw/0.juju-log server.go:314 mon:86: Retrying '_check_output' 4 more times (delay=6)
2021-08-26 07:33:02 INFO unit.ceph-radosgw/0.juju-log server.go:314 mon:86: Retrying '_check_output' 3 more times (delay=9)
2021-08-26 07:33:12 INFO unit.ceph-radosgw/0.juju-log server.go:314 mon:86: Retrying '_check_output' 2 more times (delay=12)
2021-08-26 07:33:24 INFO unit.ceph-radosgw/0.juju-log server.go:314 mon:86: Retrying '_check_output' 1 more times (delay=15)
2021-08-26 07:33:39 WARNING unit.ceph-radosgw/0.mon-relation-changed logger.go:60 Traceback (most recent call last):
2021-08-26 07:33:39 WARNING unit.ceph-radosgw/0.mon-relation-changed logger.go:60 File "/var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/mon-relation-changed", line 748, in <module>
2021-08-26 07:33:39 WARNING unit.ceph-radosgw/0.mon-relation-changed logger.go:60 hooks.execute(sys.argv)
2021-08-26 07:33:39 WARNING unit.ceph-radosgw/0.mon-relation-changed logger.go:60 File "/var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/charmhelpers/core/hookenv.py", line 963, in execute
2021-08-26 07:33:39 WARNING unit.ceph-radosgw/0.mon-relation-changed logger.go:60 self._hooks[hook_name]()
2021-08-26 07:33:39 WARNING unit.ceph-radosgw/0.mon-relation-changed logger.go:60 File "/var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/mon-relation-changed", line 372, in mon_relation
2021-08-26 07:33:39 WARNING unit.ceph-radosgw/0.mon-relation-changed logger.go:60 _mon_relation()
2021-08-26 07:33:39 WARNING unit.ceph-radosgw/0.mon-relation-changed logger.go:60 File "/var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/charmhelpers/contrib/openstack/utils.py", line 1903, in wrapped_f
2021-08-26 07:33:39 WARNING unit.ceph-radosgw/0.mon-relation-changed logger.go:60 return restart_on_change_helper(
2021-08-26 07:33:39 WARNING unit.ceph-radosgw/0.mon-relation-changed logger.go:60 File "/var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/charmhelpers/core/host.py", line 837, in restart_on_change_helper
2021-08-26 07:33:39 WARNING unit.ceph-radosgw/0.mon-relation-changed logger.go:60 r = lambda_f()
2021-08-26 07:33:39 WARNING unit.ceph-radosgw/0.mon-relation-changed logger.go:60 File "/var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/charmhelpers/contrib/openstack/utils.py", line 1904, in <lambda>
2021-08-26 07:33:39 WARNING unit.ceph-radosgw/0.mon-relation-changed logger.go:60 (lambda: f(*args, **kwargs)),
2021-08-26 07:33:39 WARNING unit.ceph-radosgw/0.mon-relation-changed logger.go:60 File "/var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/mon-relation-changed", line 343, in _mon_relation
2021-08-26 07:33:39 WARNING unit.ceph-radosgw/0.mon-relation-changed logger.go:60 existing_zones = multisite.list_zones()
2021-08-26 07:33:39 WARNING unit.ceph-radosgw/0.mon-relation-changed logger.go:60 File "/var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/charmhelpers/core/decorators.py", line 40, in _retry_on_...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-ceph-radosgw (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-ceph-radosgw/+/806114

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-ceph-radosgw (master)

Change abandoned by "James Page <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-ceph-radosgw/+/806114
Reason: This review is > 12 weeks without comment, and failed testing the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.