ceph-rbd-mirror function tests fail on train

Bug #1851869 reported by Corey Bryant
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceph Monitor Charm
Fix Released
High
Liam Young
Ceph RBD Mirror Charm
Invalid
Undecided
Liam Young
OpenStack Charm Test Infra
Invalid
Undecided
Unassigned

Bug Description

charm-ceph-rbd-mirror function tests for train are failing. See gerrit review: https://review.opendev.org/#/c/691022/

In the train bundle we had to switch to juju storage for ceph-osds which I think has something to do with the failure. Switching bionic-stein to juju storage also results in the same initial timeout issues with test_cinder_volume_mirrored (see more below).

The only bundle difference between bionic-stein.yaml and bionic-train.yaml, other than source/openstack-origin changes, is using juju storage for ceph-osd:

  # bionic-stein
  ceph-osd:
    charm: cs:~openstack-charmers-next/ceph-osd
    num_units: 3
    options:
      source: cloud:bionic-stein
      bluestore: False
      use-direct-io: False
      osd-devices: /opt

  # bionic-train
  ceph-osd:
    charm: cs:~openstack-charmers-next/ceph-osd
    num_units: 3
    storage:
      osd-devices: 'cinder,10G'
    options:
      source: cloud:bionic-train
      bluestore: False
      use-direct-io: False
      osd-devices: '/dev/test-non-existent'

The first failure I came across was timeouts with test_cinder_volume_mirrored. Note this also occurs with bionic-stein if using juju storage as above. So I tried with create_volume_from_image() updated as follows:

--- a/zaza/openstack/charm_tests/ceph/rbd_mirror/tests.py
+++ b/zaza/openstack/charm_tests/ceph/rbd_mirror/tests.py
@@ -210,8 +210,13 @@ class CephRBDMirrorTest(CephRBDMirrorBase):
                 return
             volume = cinder.volumes.create(8, name='zaza', imageRef=image.id)
             try:
+ # Note(coreycb): stop_after_attempt is increased because using
+ # juju storage for ceph-osd backed by cinder on undercloud
+ # takes longer than the prior method of directory-backed OSD
+ # devices.
                 openstack.resource_reaches_status(
- cinder.volumes, volume.id, msg='volume')
+ cinder.volumes, volume.id, msg='volume',
+ stop_after_attempt=30)
                 return volume
             except AssertionError:
                 logging.info('retrying')

That allows the volume to become available after a long time:

(func-smoke) ubuntu@coreycb-bastion:~/charms/bionic/charm-ceph-rbd-mirror/build/builds/ceph-rbd-mirror$ functest-test -m zaza-381ba6318cf6
2019-11-08 15:07:15 [INFO] ## Running Test zaza.openstack.charm_tests.ceph.rbd_mirror.tests.CephRBDMirrorTest ##
2019-11-08 15:07:24 [INFO] Using keystone API V3 (or later) for overcloud auth
2019-11-08 15:07:26 [INFO] test_cinder_volume_mirrored (zaza.openstack.charm_tests.ceph.rbd_mirror.tests.CephRBDMirrorTest)
2019-11-08 15:07:26 [INFO] Validate that a volume created through Cinder is mirrored.
2019-11-08 15:07:26 [INFO] ...
2019-11-08 15:07:30 [INFO] Using keystone API V3 (or later) for overcloud auth
2019-11-08 15:07:31 [WARNING] Version 2 is deprecated, use alternative version 3 instead.
2019-11-08 15:07:34 [INFO] creating
2019-11-08 15:07:35 [INFO] creating
2019-11-08 15:07:37 [INFO] creating
2019-11-08 15:07:42 [INFO] downloading
2019-11-08 15:07:50 [INFO] downloading
2019-11-08 15:08:06 [INFO] downloading
2019-11-08 15:08:38 [INFO] downloading
2019-11-08 15:09:38 [INFO] downloading
2019-11-08 15:10:39 [INFO] downloading
2019-11-08 15:11:39 [INFO] downloading
2019-11-08 15:12:39 [INFO] downloading
2019-11-08 15:13:40 [INFO] downloading
2019-11-08 15:14:40 [INFO] downloading
2019-11-08 15:15:40 [INFO] downloading
2019-11-08 15:16:40 [INFO] downloading
2019-11-08 15:17:41 [INFO] downloading
2019-11-08 15:18:41 [INFO] available

But the tests hang after that. At some point during that test the workload status switched to a Pools WARNING for ceph-rbd-mirror-b/0, which may explain the hang:

Unit Workload Agent Machine Public address Ports Message
ceph-mon-b/0 active idle 3 10.5.0.48 Unit is ready and clustered
ceph-mon-b/1* active idle 4 10.5.0.9 Unit is ready and clustered
ceph-mon-b/2 active idle 5 10.5.0.36 Unit is ready and clustered
ceph-mon/0* active idle 0 10.5.0.8 Unit is ready and clustered
ceph-mon/1 active idle 1 10.5.0.45 Unit is ready and clustered
ceph-mon/2 active idle 2 10.5.0.17 Unit is ready and clustered
ceph-osd-b/0* active idle 9 10.5.0.10 Unit is ready (1 OSD)
ceph-osd-b/1 active idle 10 10.5.0.13 Unit is ready (1 OSD)
ceph-osd-b/2 active idle 11 10.5.0.21 Unit is ready (1 OSD)
ceph-osd/0 active idle 6 10.5.0.47 Unit is ready (1 OSD)
ceph-osd/1 active idle 7 10.5.0.52 Unit is ready (1 OSD)
ceph-osd/2* active idle 8 10.5.0.14 Unit is ready (1 OSD)
ceph-rbd-mirror-b/0* blocked idle 13 10.5.0.22 Unit is ready (Pools WARNING (1) OK (1) Images unknown (1) Secondary (1))
ceph-rbd-mirror/0* active idle 12 10.5.0.26 Unit is ready (Pools OK (2) Images Primary (2))
cinder/0* active idle 14 10.5.0.40 8776/tcp Unit is ready
  cinder-ceph/0* active idle 10.5.0.40 Unit is ready
glance/0* active idle 15 10.5.0.38 9292/tcp Unit is ready
keystone/0* active idle 16 10.5.0.30 5000/tcp Unit is ready
mysql/0* active idle 17 10.5.0.27 3306/tcp Unit is ready
rabbitmq-server/0* active idle 18 10.5.0.39 5672/tcp Unit is ready

Looking deeper at rbd mirror pool commands:

ubuntu@juju-697320-zaza-381ba6318cf6-13:/var/lib/juju/agents/unit-ceph-rbd-mirror-b-0/charm$ sudo rbd --id rbd-mirror.juju-697320-zaza-381ba6318cf6-13 mirror pool info cinder-ceph
Mode: pool
Peers:
  UUID NAME CLIENT
  7e39c18b-9ef8-435e-a671-5b5b27ebc99f remote client.rbd-mirror.juju-697320-zaza-381ba6318cf6-13

ubuntu@juju-697320-zaza-381ba6318cf6-13:/var/lib/juju/agents/unit-ceph-rbd-mirror-b-0/charm$ sudo rbd --id rbd-mirror.juju-697320-zaza-381ba6318cf6-13 mirror pool status --verbose cinder-ceph
health: WARNING
images: 1 total
    1 unknown

volume-9b2afcd8-2851-4b27-93b5-e9c96717b0d4:
  global_id: ad3dd728-4505-4d16-a445-775c878afcfc
  state: down+replaying
  description: replaying, master_position=[object_number=65, tag_tid=2, entry_tid=1], mirror_position=[object_number=65, tag_tid=2, entry_tid=1], entries_behind_master=0
  last_update: 2019-11-08 15:03:28

On ceph-rbd-mirror-b/0 I'm seeing 'image no longer exists' in ceph log:

2019-11-07 21:24:11.640 7f5d7f316700 0 rbd::mirror::ImageReplayer: 0x7f5da8028e70 [2/3b318528-c009-4524-99e8-2d57ab092d97] handle_shut_down: remote image no longer exists: scheduling deletion
2019-11-07 21:24:27.664 7f5d7f316700 0 rbd::mirror::ImageReplayer: 0x7f5da8028e70 [2/3b318528-c009-4524-99e8-2d57ab092d97] handle_shut_down: mirror image no longer exists

Also on ceph-rbd-mirror-b/0 seeing a lot of timeout messages like this in ceph log:

2019-11-08 15:10:08.110 7f5da7fff700 -1 rbd::mirror::InstanceWatcher: C_NotifyInstanceRequest: 0x7f5da805def0 finish: resending after timeout

On ceph-rbd-mirror/0 I'm also seeing 'image no longer exists' in ceph log:

2019-11-07 21:07:53.949 7f3233a6a700 0 rbd::mirror::PoolReplayer: 0x5640bcde86b0 handle_init_remote_pool_watcher: remote peer does not have mirroring configured
2019-11-07 21:24:38.562 7f320b316700 0 rbd::mirror::ImageReplayer: 0x7f3238070740 [2/3b318528-c009-4524-99e8-2d57ab092d97] handle_shut_down: mirror image no longer exists

Restarting services 'sudo systemctl restart ceph-rbd-mirror*' gets things looking better after a little while:

Unit Workload Agent Machine Public address Ports Message
ceph-rbd-mirror-b/0* active idle 13 10.5.0.22 Unit is ready (Pools OK (2) Images Secondary (4))
ceph-rbd-mirror/0* active idle 12 10.5.0.26 Unit is ready (Pools OK (2) Images Primary (4))

And rbd mirror pool status looks good:

ubuntu@juju-697320-zaza-381ba6318cf6-13:~$ sudo rbd --id rbd-mirror.juju-697320-zaza-381ba6318cf6-13 mirror pool status --verbose cinder-ceph
health: OK
images: 3 total
    3 replaying
rbd: failed to get service dump: (13) Permission denied

volume-9b2afcd8-2851-4b27-93b5-e9c96717b0d4:
  global_id: ad3dd728-4505-4d16-a445-775c878afcfc
  state: up+replaying
  description: replaying, master_position=[object_number=65, tag_tid=2, entry_tid=1], mirror_position=[object_number=65, tag_tid=2, entry_tid=1], entries_behind_master=0
  service:
  last_update: 2019-11-08 17:31:11

volume-aea850e0-066d-4489-9a1b-b7fd6ba58643:
  global_id: ac2b3be5-6734-4731-8bf4-9de26a8602fe
  state: up+replaying
  description: replaying, master_position=[object_number=65, tag_tid=2, entry_tid=1], mirror_position=[object_number=2, tag_tid=1, entry_tid=2982], entries_behind_master=63568
  service:
  last_update: 2019-11-08 17:31:14

volume-b3baefa0-5f9a-4933-8aa7-625b80f9108d:
  global_id: 903617e5-e578-40e1-aa76-e30f3a9e841e
  state: up+replaying
  description: replaying, master_position=[object_number=65, tag_tid=2, entry_tid=1], mirror_position=[object_number=65, tag_tid=2, entry_tid=1], entries_behind_master=0
  service:
  last_update: 2019-11-08 17:31:10

But re-running the test gets it back to "Pools WARNING" state.

summary: - ceph-rbd-mirror function tests fail with ceph-osd using juju storage
+ ceph-rbd-mirror function tests fail with ceph-osd juju storage
Revision history for this message
Corey Bryant (corey.bryant) wrote : Re: ceph-rbd-mirror function tests fail with ceph-osd juju storage

I tried bionic-stein with juju storage and while do appear to run fine they take a very long time. Anyway I think the juju storage issue might be separate from the rbd-mirror issue on train.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

*while tests do appear to run fine ^

summary: - ceph-rbd-mirror function tests fail with ceph-osd juju storage
+ ceph-rbd-mirror function tests fail on train
Revision history for this message
Corey Bryant (corey.bryant) wrote :
Revision history for this message
Corey Bryant (corey.bryant) wrote :

I re-ran with all 6 ceph-osd's updated to be similar to the following and that didn't change the behavior. It's still failing the same way.

  ceph-osd:
    charm: cs:~openstack-charmers-next/ceph-osd
    num_units: 3
    storage:
      osd-devices: 'cinder,20G'
    options:
      source: cloud:bionic-train

Revision history for this message
Corey Bryant (corey.bryant) wrote :

jamespage mentioned that rbd mirror worked ok on hardware at ceph day a few weeks back

Revision history for this message
Frode Nordahl (fnordahl) wrote :
Download full text (85.8 KiB)

The test does indeed succeed on Stein with volume backed OSDs, albeit it does take a horrendously long time.

Is this a CI infrastructure issue?

$ tox -e func-smoke
func-smoke installed: aodhclient==1.4.0,appdirs==1.4.3,async-generator==1.10,attrs==19.3.0,Babel==2.7.0,bcrypt==3.1.7,certifi==2019.9.11,cffi==1.13.2,chardet==3.0.4,cliff==2.16.0,cmd2==0.8.9,cryptography==2.8,debtcollector==1.22.0,decorator==4.4.1,dogpile.cache==0.9.0,fasteners==0.15,hvac==0.6.4,idna==2.8,importlib-metadata==0.23,iso8601==0.1.12,Jinja2==2.10.3,jmespath==0.9.4,jsonpatch==1.24,jsonpointer==2.0,jsonschema==3.1.1,juju==2.6.3,juju-wait==2.7.0,jujubundlelib==0.5.6,keystoneauth1==3.18.0,lxml==4.4.1,macaroonbakery==1.2.3,MarkupSafe==1.1.1,monotonic==1.5,more-itertools==7.2.0,msgpack==0.6.2,munch==2.5.0,netaddr==0.7.19,netifaces==0.10.9,openstacksdk==0.37.0,os-client-config==1.33.0,os-service-types==1.7.0,osc-lib==1.14.1,oslo.concurrency==3.30.0,oslo.config==6.11.1,oslo.context==2.23.0,oslo.i18n==3.24.0,oslo.log==3.44.1,oslo.serialization==2.29.2,oslo.utils==3.41.2,osprofiler==2.8.2,paramiko==2.6.0,pbr==5.4.3,pkg-resources==0.0.0,prettytable==0.7.2,protobuf==3.10.0,pyasn1==0.4.7,pycparser==2.19,pyinotify==0.9.6,pymacaroons==0.13.0,PyNaCl==1.3.0,pyOpenSSL==19.0.0,pyparsing==2.4.5,pyperclip==1.7.0,pyRFC3339==1.1,pyrsistent==0.15.5,python-cinderclient==5.0.0,python-dateutil==2.8.1,python-glanceclient==2.17.0,python-heatclient==1.18.0,python-keystoneclient==3.22.0,python-neutronclient==6.14.0,python-novaclient==16.0.0,python-octaviaclient==1.11.0,python-openstackclient==4.0.0,python-swiftclient==3.8.1,pytz==2019.3,PyYAML==5.1.2,requests==2.22.0,requestsexceptions==1.4.0,rfc3986==1.3.2,simplejson==3.16.0,six==1.13.0,stevedore==1.31.0,tenacity==6.0.0,theblues==0.5.2,toposort==1.5,urllib3==1.25.7,warlock==1.3.3,wcwidth==0.1.7,WebOb==1.8.5,websockets==7.0,wrapt==1.11.2,zaza==0.0.2.dev1,zaza.openstack==0.0.1.dev1,zipp==0.6.0
func-smoke runtests: PYTHONHASHSEED='0'
func-smoke runtests: commands[0] | functest-run-suite --keep-model --smoke
2019-11-13 05:49:03 [INFO] Deploying bundle './tests/bundles/bionic-stein.yaml' on to 'zaza-0317e0a8c505' model
2019-11-13 05:49:03 [INFO] Rendered template '<Template memory:7fa39a4e9b38>' to file '/tmp/tmpfb9dffum/local-charm-overlay.yaml'
2019-11-13 05:49:03 [INFO] Deploying overlay '/tmp/tmpfb9dffum/local-charm-overlay.yaml' on to 'zaza-0317e0a8c505' model
2019-11-13 05:49:07 [INFO] Resolving charm: cs:~openstack-charmers-next/ceph-mon
2019-11-13 05:49:07 [INFO] Resolving charm: cs:~openstack-charmers-next/ceph-mon
2019-11-13 05:49:07 [INFO] Resolving charm: cs:~openstack-charmers-next/ceph-osd
2019-11-13 05:49:07 [INFO] Resolving charm: cs:~openstack-charmers-next/ceph-osd
2019-11-13 05:49:07 [INFO] Resolving charm: cs:~openstack-charmers-next/cinder
2019-11-13 05:49:08 [INFO] Resolving charm: cs:~openstack-charmers-next/cinder-ceph
2019-11-13 05:49:08 [INFO] Resolving charm: cs:~openstack-charmers-next/glance
2019-11-13 05:49:08 [INFO] Resolving charm: cs:~openstack-charmers-next/keystone
2019-11-13 05:49:08 [INFO] Resolving charm: cs:~openstack-charmers-next/percona-cluster
2019-11-13 05:49:08 [INFO] Resolving charm: cs:~openstack-charmers-...

Revision history for this message
Corey Bryant (corey.bryant) wrote :
Download full text (4.0 KiB)

I have a deployment up on MAAS (icarus virt nodes, using /dev/vdb for OSDs) and while it's a lot faster:

2019-11-13 15:08:40 [INFO] Validate that a volume created through Cinder is mirrored.
2019-11-13 15:08:40 [INFO] ...
2019-11-13 15:08:44 [INFO] Using keystone API V3 (or later) for overcloud auth
2019-11-13 15:08:45 [WARNING] Version 2 is deprecated, use alternative version 3 instead.
2019-11-13 15:08:57 [INFO] creating
2019-11-13 15:09:01 [INFO] creating
2019-11-13 15:09:03 [INFO] downloading
2019-11-13 15:09:08 [INFO] downloading
2019-11-13 15:09:16 [INFO] downloading
2019-11-13 15:09:32 [INFO] downloading
2019-11-13 15:10:04 [INFO] available

it still hangs after this point with ceph-rbd-mirror-b/0 Pools WARNING.

$ sudo rbd --id rbd-mirror.juju-c2ffd9-6-lxd-1 mirror pool status --verbose cinder-ceph
health: WARNING
images: 1 total
    1 unknown

volume-61605fa3-59a4-42ae-81f9-863e0fb05598:
  global_id: f397d9e3-2a20-487c-b70f-54cd1e207f8f
  state: down+replaying
  description: replaying, master_position=[object_number=65, tag_tid=2, entry_tid=1], mirror_position=[object_number=65, tag_tid=2, entry_tid=1], entries_behind_master=0
  last_update: 2019-11-13 15:18:23

/var/log/ceph/ceph*.log from ceph-rbd-mirror-b/0 has:

2019-11-13 15:01:12.918 7f069f9fb300 1 mgrc service_daemon_register rbd-mirror.4813 metadata {arch=x86_64,ceph_release=nautilus,ceph_version=ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable),ceph_version_short=14.2.2,cpu=Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz,distro=ubuntu,distro_description=Ubuntu 18.04.3 LTS,distro_version=18.04,hostname=juju-c2ffd9-6-lxd-1,id=juju-c2ffd9-6-lxd-1,instance_id=4813,kernel_description=#78-Ubuntu SMP Wed Nov 6 11:30:24 UTC 2019,kernel_version=4.15.0-69-generic,mem_swap_kb=4194300,mem_total_kb=16424724,os=Linux}
2019-11-13 15:09:12.651 7f05ea7fc700 0 rbd::mirror::ImageReplayer: 0x7f067401fff0 [2/2d19d024-6352-4634-8699-5b83e82b1f8d] handle_shut_down: remote image no longer exists: scheduling deletion
2019-11-13 15:09:13.147 7f05ea7fc700 0 rbd::mirror::ImageReplayer: 0x7f067401fff0 [2/2d19d024-6352-4634-8699-5b83e82b1f8d] handle_shut_down: mirror image no longer exists

/var/log/ceph/ceph*.log from ceph-rbd-mirror/0 has:
2019-11-13 15:01:12.971 7f35e43db300 1 mgrc service_daemon_register rbd-mirror.5028 metadata {arch=x86_64,ceph_release=nautilus,ceph_version=ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable),ceph_version_short=14.2.2,cpu=Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz,distro=ubuntu,distro_description=Ubuntu 18.04.3 LTS,distro_version=18.04,hostname=juju-c2ffd9-3-lxd-1,id=juju-c2ffd9-3-lxd-1,instance_id=5028,kernel_description=#78-Ubuntu SMP Wed Nov 6 11:30:24 UTC 2019,kernel_version=4.15.0-69-generic,mem_swap_kb=4194300,mem_total_kb=16424724,os=Linux}
2019-11-13 15:01:43.059 7f35e43db300 -1 rbd::mirror::PoolReplayer: 0x55c43a3dc500 init: error accessing remote pool glance: (2) No such file or directory
2019-11-13 15:02:13.064 7f35e43db300 -1 rbd::mirror::Mirror: 0x55c43a270b70 update_pool_replayers: restarting failed pool replayer for uuid: 88c9d11b-e713-414a-abcd-b5d1a2c042db cluster: remote client: cl...

Read more...

Revision history for this message
Corey Bryant (corey.bryant) wrote :
Revision history for this message
Liam Young (gnuoy) wrote :

There appears to be a problem with permissions when running the 'status' action.
Included in stdout is the non-fatal string 'rbd: failed to get service dump: (13) Permission denied' but that string prefixes the json dump of the status and hence breaks the json parseing.

It can be fixed by granting the client ro against mgr but I'm not sure at this point whether that is acceptable.

Changed in charm-ceph-rbd-mirror:
assignee: nobody → Liam Young (gnuoy)
Revision history for this message
James Page (james-page) wrote :

I think that's OK

Revision history for this message
Liam Young (gnuoy) wrote :

I think the change in behaviour upstream is from here: https://github.com/ceph/ceph/pull/24717/

Liam Young (gnuoy)
Changed in charm-ceph-mon:
assignee: nobody → Liam Young (gnuoy)
Changed in charm-ceph-rbd-mirror:
status: New → Invalid
Changed in charm-test-infra:
status: New → Invalid
Changed in charm-ceph-mon:
importance: Undecided → High
milestone: none → 20.05
status: New → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-mon (master)

Reviewed: https://review.opendev.org/729170
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-mon/commit/?id=59160b08151a4a8e5a743c141f45c529547254af
Submitter: Zuul
Branch: master

commit 59160b08151a4a8e5a743c141f45c529547254af
Author: Liam Young <email address hidden>
Date: Tue May 19 08:18:36 2020 +0000

    Sync charms.ceph

    Sync charms.ceph to pick up fix for LP #1851869

    Change-Id: Ie3027d030c71faec0108fcde7fdd624b29623d49
    Closes-Bug: #1851869
    Partial-Bug: #1879464

David Ames (thedac)
Changed in charm-ceph-mon:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.