ceph-rbd-mirror function tests fail on train

Bug #1851869 reported by Corey Bryant on 2019-11-08
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Charm Test Infra
Undecided
Unassigned
charm-ceph-rbd-mirror
Undecided
Unassigned

Bug Description

charm-ceph-rbd-mirror function tests for train are failing. See gerrit review: https://review.opendev.org/#/c/691022/

In the train bundle we had to switch to juju storage for ceph-osds which I think has something to do with the failure. Switching bionic-stein to juju storage also results in the same initial timeout issues with test_cinder_volume_mirrored (see more below).

The only bundle difference between bionic-stein.yaml and bionic-train.yaml, other than source/openstack-origin changes, is using juju storage for ceph-osd:

  # bionic-stein
  ceph-osd:
    charm: cs:~openstack-charmers-next/ceph-osd
    num_units: 3
    options:
      source: cloud:bionic-stein
      bluestore: False
      use-direct-io: False
      osd-devices: /opt

  # bionic-train
  ceph-osd:
    charm: cs:~openstack-charmers-next/ceph-osd
    num_units: 3
    storage:
      osd-devices: 'cinder,10G'
    options:
      source: cloud:bionic-train
      bluestore: False
      use-direct-io: False
      osd-devices: '/dev/test-non-existent'

The first failure I came across was timeouts with test_cinder_volume_mirrored. Note this also occurs with bionic-stein if using juju storage as above. So I tried with create_volume_from_image() updated as follows:

--- a/zaza/openstack/charm_tests/ceph/rbd_mirror/tests.py
+++ b/zaza/openstack/charm_tests/ceph/rbd_mirror/tests.py
@@ -210,8 +210,13 @@ class CephRBDMirrorTest(CephRBDMirrorBase):
                 return
             volume = cinder.volumes.create(8, name='zaza', imageRef=image.id)
             try:
+ # Note(coreycb): stop_after_attempt is increased because using
+ # juju storage for ceph-osd backed by cinder on undercloud
+ # takes longer than the prior method of directory-backed OSD
+ # devices.
                 openstack.resource_reaches_status(
- cinder.volumes, volume.id, msg='volume')
+ cinder.volumes, volume.id, msg='volume',
+ stop_after_attempt=30)
                 return volume
             except AssertionError:
                 logging.info('retrying')

That allows the volume to become available after a long time:

(func-smoke) ubuntu@coreycb-bastion:~/charms/bionic/charm-ceph-rbd-mirror/build/builds/ceph-rbd-mirror$ functest-test -m zaza-381ba6318cf6
2019-11-08 15:07:15 [INFO] ## Running Test zaza.openstack.charm_tests.ceph.rbd_mirror.tests.CephRBDMirrorTest ##
2019-11-08 15:07:24 [INFO] Using keystone API V3 (or later) for overcloud auth
2019-11-08 15:07:26 [INFO] test_cinder_volume_mirrored (zaza.openstack.charm_tests.ceph.rbd_mirror.tests.CephRBDMirrorTest)
2019-11-08 15:07:26 [INFO] Validate that a volume created through Cinder is mirrored.
2019-11-08 15:07:26 [INFO] ...
2019-11-08 15:07:30 [INFO] Using keystone API V3 (or later) for overcloud auth
2019-11-08 15:07:31 [WARNING] Version 2 is deprecated, use alternative version 3 instead.
2019-11-08 15:07:34 [INFO] creating
2019-11-08 15:07:35 [INFO] creating
2019-11-08 15:07:37 [INFO] creating
2019-11-08 15:07:42 [INFO] downloading
2019-11-08 15:07:50 [INFO] downloading
2019-11-08 15:08:06 [INFO] downloading
2019-11-08 15:08:38 [INFO] downloading
2019-11-08 15:09:38 [INFO] downloading
2019-11-08 15:10:39 [INFO] downloading
2019-11-08 15:11:39 [INFO] downloading
2019-11-08 15:12:39 [INFO] downloading
2019-11-08 15:13:40 [INFO] downloading
2019-11-08 15:14:40 [INFO] downloading
2019-11-08 15:15:40 [INFO] downloading
2019-11-08 15:16:40 [INFO] downloading
2019-11-08 15:17:41 [INFO] downloading
2019-11-08 15:18:41 [INFO] available

But the tests hang after that. At some point during that test the workload status switched to a Pools WARNING for ceph-rbd-mirror-b/0, which may explain the hang:

Unit Workload Agent Machine Public address Ports Message
ceph-mon-b/0 active idle 3 10.5.0.48 Unit is ready and clustered
ceph-mon-b/1* active idle 4 10.5.0.9 Unit is ready and clustered
ceph-mon-b/2 active idle 5 10.5.0.36 Unit is ready and clustered
ceph-mon/0* active idle 0 10.5.0.8 Unit is ready and clustered
ceph-mon/1 active idle 1 10.5.0.45 Unit is ready and clustered
ceph-mon/2 active idle 2 10.5.0.17 Unit is ready and clustered
ceph-osd-b/0* active idle 9 10.5.0.10 Unit is ready (1 OSD)
ceph-osd-b/1 active idle 10 10.5.0.13 Unit is ready (1 OSD)
ceph-osd-b/2 active idle 11 10.5.0.21 Unit is ready (1 OSD)
ceph-osd/0 active idle 6 10.5.0.47 Unit is ready (1 OSD)
ceph-osd/1 active idle 7 10.5.0.52 Unit is ready (1 OSD)
ceph-osd/2* active idle 8 10.5.0.14 Unit is ready (1 OSD)
ceph-rbd-mirror-b/0* blocked idle 13 10.5.0.22 Unit is ready (Pools WARNING (1) OK (1) Images unknown (1) Secondary (1))
ceph-rbd-mirror/0* active idle 12 10.5.0.26 Unit is ready (Pools OK (2) Images Primary (2))
cinder/0* active idle 14 10.5.0.40 8776/tcp Unit is ready
  cinder-ceph/0* active idle 10.5.0.40 Unit is ready
glance/0* active idle 15 10.5.0.38 9292/tcp Unit is ready
keystone/0* active idle 16 10.5.0.30 5000/tcp Unit is ready
mysql/0* active idle 17 10.5.0.27 3306/tcp Unit is ready
rabbitmq-server/0* active idle 18 10.5.0.39 5672/tcp Unit is ready

Looking deeper at rbd mirror pool commands:

ubuntu@juju-697320-zaza-381ba6318cf6-13:/var/lib/juju/agents/unit-ceph-rbd-mirror-b-0/charm$ sudo rbd --id rbd-mirror.juju-697320-zaza-381ba6318cf6-13 mirror pool info cinder-ceph
Mode: pool
Peers:
  UUID NAME CLIENT
  7e39c18b-9ef8-435e-a671-5b5b27ebc99f remote client.rbd-mirror.juju-697320-zaza-381ba6318cf6-13

ubuntu@juju-697320-zaza-381ba6318cf6-13:/var/lib/juju/agents/unit-ceph-rbd-mirror-b-0/charm$ sudo rbd --id rbd-mirror.juju-697320-zaza-381ba6318cf6-13 mirror pool status --verbose cinder-ceph
health: WARNING
images: 1 total
    1 unknown

volume-9b2afcd8-2851-4b27-93b5-e9c96717b0d4:
  global_id: ad3dd728-4505-4d16-a445-775c878afcfc
  state: down+replaying
  description: replaying, master_position=[object_number=65, tag_tid=2, entry_tid=1], mirror_position=[object_number=65, tag_tid=2, entry_tid=1], entries_behind_master=0
  last_update: 2019-11-08 15:03:28

On ceph-rbd-mirror-b/0 I'm seeing 'image no longer exists' in ceph log:

2019-11-07 21:24:11.640 7f5d7f316700 0 rbd::mirror::ImageReplayer: 0x7f5da8028e70 [2/3b318528-c009-4524-99e8-2d57ab092d97] handle_shut_down: remote image no longer exists: scheduling deletion
2019-11-07 21:24:27.664 7f5d7f316700 0 rbd::mirror::ImageReplayer: 0x7f5da8028e70 [2/3b318528-c009-4524-99e8-2d57ab092d97] handle_shut_down: mirror image no longer exists

Also on ceph-rbd-mirror-b/0 seeing a lot of timeout messages like this in ceph log:

2019-11-08 15:10:08.110 7f5da7fff700 -1 rbd::mirror::InstanceWatcher: C_NotifyInstanceRequest: 0x7f5da805def0 finish: resending after timeout

On ceph-rbd-mirror/0 I'm also seeing 'image no longer exists' in ceph log:

2019-11-07 21:07:53.949 7f3233a6a700 0 rbd::mirror::PoolReplayer: 0x5640bcde86b0 handle_init_remote_pool_watcher: remote peer does not have mirroring configured
2019-11-07 21:24:38.562 7f320b316700 0 rbd::mirror::ImageReplayer: 0x7f3238070740 [2/3b318528-c009-4524-99e8-2d57ab092d97] handle_shut_down: mirror image no longer exists

Restarting services 'sudo systemctl restart ceph-rbd-mirror*' gets things looking better after a little while:

Unit Workload Agent Machine Public address Ports Message
ceph-rbd-mirror-b/0* active idle 13 10.5.0.22 Unit is ready (Pools OK (2) Images Secondary (4))
ceph-rbd-mirror/0* active idle 12 10.5.0.26 Unit is ready (Pools OK (2) Images Primary (4))

And rbd mirror pool status looks good:

ubuntu@juju-697320-zaza-381ba6318cf6-13:~$ sudo rbd --id rbd-mirror.juju-697320-zaza-381ba6318cf6-13 mirror pool status --verbose cinder-ceph
health: OK
images: 3 total
    3 replaying
rbd: failed to get service dump: (13) Permission denied

volume-9b2afcd8-2851-4b27-93b5-e9c96717b0d4:
  global_id: ad3dd728-4505-4d16-a445-775c878afcfc
  state: up+replaying
  description: replaying, master_position=[object_number=65, tag_tid=2, entry_tid=1], mirror_position=[object_number=65, tag_tid=2, entry_tid=1], entries_behind_master=0
  service:
  last_update: 2019-11-08 17:31:11

volume-aea850e0-066d-4489-9a1b-b7fd6ba58643:
  global_id: ac2b3be5-6734-4731-8bf4-9de26a8602fe
  state: up+replaying
  description: replaying, master_position=[object_number=65, tag_tid=2, entry_tid=1], mirror_position=[object_number=2, tag_tid=1, entry_tid=2982], entries_behind_master=63568
  service:
  last_update: 2019-11-08 17:31:14

volume-b3baefa0-5f9a-4933-8aa7-625b80f9108d:
  global_id: 903617e5-e578-40e1-aa76-e30f3a9e841e
  state: up+replaying
  description: replaying, master_position=[object_number=65, tag_tid=2, entry_tid=1], mirror_position=[object_number=65, tag_tid=2, entry_tid=1], entries_behind_master=0
  service:
  last_update: 2019-11-08 17:31:10

But re-running the test gets it back to "Pools WARNING" state.

summary: - ceph-rbd-mirror function tests fail with ceph-osd using juju storage
+ ceph-rbd-mirror function tests fail with ceph-osd juju storage

I tried bionic-stein with juju storage and while do appear to run fine they take a very long time. Anyway I think the juju storage issue might be separate from the rbd-mirror issue on train.

Corey Bryant (corey.bryant) wrote :

*while tests do appear to run fine ^

summary: - ceph-rbd-mirror function tests fail with ceph-osd juju storage
+ ceph-rbd-mirror function tests fail on train
Corey Bryant (corey.bryant) wrote :

I re-ran with all 6 ceph-osd's updated to be similar to the following and that didn't change the behavior. It's still failing the same way.

  ceph-osd:
    charm: cs:~openstack-charmers-next/ceph-osd
    num_units: 3
    storage:
      osd-devices: 'cinder,20G'
    options:
      source: cloud:bionic-train

Corey Bryant (corey.bryant) wrote :

jamespage mentioned that rbd mirror worked ok on hardware at ceph day a few weeks back

Frode Nordahl (fnordahl) wrote :
Download full text (85.8 KiB)

The test does indeed succeed on Stein with volume backed OSDs, albeit it does take a horrendously long time.

Is this a CI infrastructure issue?

$ tox -e func-smoke
func-smoke installed: aodhclient==1.4.0,appdirs==1.4.3,async-generator==1.10,attrs==19.3.0,Babel==2.7.0,bcrypt==3.1.7,certifi==2019.9.11,cffi==1.13.2,chardet==3.0.4,cliff==2.16.0,cmd2==0.8.9,cryptography==2.8,debtcollector==1.22.0,decorator==4.4.1,dogpile.cache==0.9.0,fasteners==0.15,hvac==0.6.4,idna==2.8,importlib-metadata==0.23,iso8601==0.1.12,Jinja2==2.10.3,jmespath==0.9.4,jsonpatch==1.24,jsonpointer==2.0,jsonschema==3.1.1,juju==2.6.3,juju-wait==2.7.0,jujubundlelib==0.5.6,keystoneauth1==3.18.0,lxml==4.4.1,macaroonbakery==1.2.3,MarkupSafe==1.1.1,monotonic==1.5,more-itertools==7.2.0,msgpack==0.6.2,munch==2.5.0,netaddr==0.7.19,netifaces==0.10.9,openstacksdk==0.37.0,os-client-config==1.33.0,os-service-types==1.7.0,osc-lib==1.14.1,oslo.concurrency==3.30.0,oslo.config==6.11.1,oslo.context==2.23.0,oslo.i18n==3.24.0,oslo.log==3.44.1,oslo.serialization==2.29.2,oslo.utils==3.41.2,osprofiler==2.8.2,paramiko==2.6.0,pbr==5.4.3,pkg-resources==0.0.0,prettytable==0.7.2,protobuf==3.10.0,pyasn1==0.4.7,pycparser==2.19,pyinotify==0.9.6,pymacaroons==0.13.0,PyNaCl==1.3.0,pyOpenSSL==19.0.0,pyparsing==2.4.5,pyperclip==1.7.0,pyRFC3339==1.1,pyrsistent==0.15.5,python-cinderclient==5.0.0,python-dateutil==2.8.1,python-glanceclient==2.17.0,python-heatclient==1.18.0,python-keystoneclient==3.22.0,python-neutronclient==6.14.0,python-novaclient==16.0.0,python-octaviaclient==1.11.0,python-openstackclient==4.0.0,python-swiftclient==3.8.1,pytz==2019.3,PyYAML==5.1.2,requests==2.22.0,requestsexceptions==1.4.0,rfc3986==1.3.2,simplejson==3.16.0,six==1.13.0,stevedore==1.31.0,tenacity==6.0.0,theblues==0.5.2,toposort==1.5,urllib3==1.25.7,warlock==1.3.3,wcwidth==0.1.7,WebOb==1.8.5,websockets==7.0,wrapt==1.11.2,zaza==0.0.2.dev1,zaza.openstack==0.0.1.dev1,zipp==0.6.0
func-smoke runtests: PYTHONHASHSEED='0'
func-smoke runtests: commands[0] | functest-run-suite --keep-model --smoke
2019-11-13 05:49:03 [INFO] Deploying bundle './tests/bundles/bionic-stein.yaml' on to 'zaza-0317e0a8c505' model
2019-11-13 05:49:03 [INFO] Rendered template '<Template memory:7fa39a4e9b38>' to file '/tmp/tmpfb9dffum/local-charm-overlay.yaml'
2019-11-13 05:49:03 [INFO] Deploying overlay '/tmp/tmpfb9dffum/local-charm-overlay.yaml' on to 'zaza-0317e0a8c505' model
2019-11-13 05:49:07 [INFO] Resolving charm: cs:~openstack-charmers-next/ceph-mon
2019-11-13 05:49:07 [INFO] Resolving charm: cs:~openstack-charmers-next/ceph-mon
2019-11-13 05:49:07 [INFO] Resolving charm: cs:~openstack-charmers-next/ceph-osd
2019-11-13 05:49:07 [INFO] Resolving charm: cs:~openstack-charmers-next/ceph-osd
2019-11-13 05:49:07 [INFO] Resolving charm: cs:~openstack-charmers-next/cinder
2019-11-13 05:49:08 [INFO] Resolving charm: cs:~openstack-charmers-next/cinder-ceph
2019-11-13 05:49:08 [INFO] Resolving charm: cs:~openstack-charmers-next/glance
2019-11-13 05:49:08 [INFO] Resolving charm: cs:~openstack-charmers-next/keystone
2019-11-13 05:49:08 [INFO] Resolving charm: cs:~openstack-charmers-next/percona-cluster
2019-11-13 05:49:08 [INFO] Resolving charm: cs:~openstack-charmers-...

Corey Bryant (corey.bryant) wrote :
Download full text (4.0 KiB)

I have a deployment up on MAAS (icarus virt nodes, using /dev/vdb for OSDs) and while it's a lot faster:

2019-11-13 15:08:40 [INFO] Validate that a volume created through Cinder is mirrored.
2019-11-13 15:08:40 [INFO] ...
2019-11-13 15:08:44 [INFO] Using keystone API V3 (or later) for overcloud auth
2019-11-13 15:08:45 [WARNING] Version 2 is deprecated, use alternative version 3 instead.
2019-11-13 15:08:57 [INFO] creating
2019-11-13 15:09:01 [INFO] creating
2019-11-13 15:09:03 [INFO] downloading
2019-11-13 15:09:08 [INFO] downloading
2019-11-13 15:09:16 [INFO] downloading
2019-11-13 15:09:32 [INFO] downloading
2019-11-13 15:10:04 [INFO] available

it still hangs after this point with ceph-rbd-mirror-b/0 Pools WARNING.

$ sudo rbd --id rbd-mirror.juju-c2ffd9-6-lxd-1 mirror pool status --verbose cinder-ceph
health: WARNING
images: 1 total
    1 unknown

volume-61605fa3-59a4-42ae-81f9-863e0fb05598:
  global_id: f397d9e3-2a20-487c-b70f-54cd1e207f8f
  state: down+replaying
  description: replaying, master_position=[object_number=65, tag_tid=2, entry_tid=1], mirror_position=[object_number=65, tag_tid=2, entry_tid=1], entries_behind_master=0
  last_update: 2019-11-13 15:18:23

/var/log/ceph/ceph*.log from ceph-rbd-mirror-b/0 has:

2019-11-13 15:01:12.918 7f069f9fb300 1 mgrc service_daemon_register rbd-mirror.4813 metadata {arch=x86_64,ceph_release=nautilus,ceph_version=ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable),ceph_version_short=14.2.2,cpu=Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz,distro=ubuntu,distro_description=Ubuntu 18.04.3 LTS,distro_version=18.04,hostname=juju-c2ffd9-6-lxd-1,id=juju-c2ffd9-6-lxd-1,instance_id=4813,kernel_description=#78-Ubuntu SMP Wed Nov 6 11:30:24 UTC 2019,kernel_version=4.15.0-69-generic,mem_swap_kb=4194300,mem_total_kb=16424724,os=Linux}
2019-11-13 15:09:12.651 7f05ea7fc700 0 rbd::mirror::ImageReplayer: 0x7f067401fff0 [2/2d19d024-6352-4634-8699-5b83e82b1f8d] handle_shut_down: remote image no longer exists: scheduling deletion
2019-11-13 15:09:13.147 7f05ea7fc700 0 rbd::mirror::ImageReplayer: 0x7f067401fff0 [2/2d19d024-6352-4634-8699-5b83e82b1f8d] handle_shut_down: mirror image no longer exists

/var/log/ceph/ceph*.log from ceph-rbd-mirror/0 has:
2019-11-13 15:01:12.971 7f35e43db300 1 mgrc service_daemon_register rbd-mirror.5028 metadata {arch=x86_64,ceph_release=nautilus,ceph_version=ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable),ceph_version_short=14.2.2,cpu=Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz,distro=ubuntu,distro_description=Ubuntu 18.04.3 LTS,distro_version=18.04,hostname=juju-c2ffd9-3-lxd-1,id=juju-c2ffd9-3-lxd-1,instance_id=5028,kernel_description=#78-Ubuntu SMP Wed Nov 6 11:30:24 UTC 2019,kernel_version=4.15.0-69-generic,mem_swap_kb=4194300,mem_total_kb=16424724,os=Linux}
2019-11-13 15:01:43.059 7f35e43db300 -1 rbd::mirror::PoolReplayer: 0x55c43a3dc500 init: error accessing remote pool glance: (2) No such file or directory
2019-11-13 15:02:13.064 7f35e43db300 -1 rbd::mirror::Mirror: 0x55c43a270b70 update_pool_replayers: restarting failed pool replayer for uuid: 88c9d11b-e713-414a-abcd-b5d1a2c042db cluster: remote client: cl...

Read more...

Corey Bryant (corey.bryant) wrote :
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers