test_rebuild_volume_backed_server failing 100% on ceph job

Bug #2025096 reported by Ghanshyam Mann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cinder
Invalid
Medium
Unassigned
OpenStack Compute (nova)
Invalid
Undecided
Unassigned
devstack
Fix Released
Undecided
Unassigned
devstack-plugin-ceph
Invalid
Undecided
Unassigned
tempest
Invalid
Undecided
Unassigned

Bug Description

There is some issue in ceph job during password injection
during the rebuild operation, and due to that test is failing 100% failure on ceph job.

These test pass on other jobs like tempest-full-py3

Failure logs:

- https://b932a1446345e101b3ef-4740624f0848c8c3257f704064a4516f.ssl.cf5.rackcdn.com/883557/2/check/nova-ceph-multistore/d7db64f/testr_results.html

    Response - Headers: {'date': 'Mon, 26 Jun 2023 01:07:28 GMT', 'server': 'Apache/2.4.52 (Ubuntu)', 'x-openstack-request-id': 'req-f707a2bb-a7c6-4e65-93e2-7cb8195dd331', 'connection': 'close', 'status': '204', 'content-location': 'https://10.209.99.44:9696/networking/v2.0/security-groups/63dc9e50-2d05-4cfa-912d-92a3c9283e42'}
        Body: b''
2023-06-26 01:07:28,442 108489 INFO [tempest.lib.common.rest_client] Request (ServerActionsV293TestJSON:_run_cleanups): 404 GET https://10.209.99.44:9696/networking/v2.0/security-groups/63dc9e50-2d05-4cfa-912d-92a3c9283e42 0.034s
2023-06-26 01:07:28,442 108489 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: None
    Response - Headers: {'date': 'Mon, 26 Jun 2023 01:07:28 GMT', 'server': 'Apache/2.4.52 (Ubuntu)', 'content-type': 'application/json', 'content-length': '146', 'x-openstack-request-id': 'req-ae967163-b104-4ddf-b1e8-bb6da298b498', 'connection': 'close', 'status': '404', 'content-location': 'https://10.209.99.44:9696/networking/v2.0/security-groups/63dc9e50-2d05-4cfa-912d-92a3c9283e42'}
        Body: b'{"NeutronError": {"type": "SecurityGroupNotFound", "message": "Security group 63dc9e50-2d05-4cfa-912d-92a3c9283e42 does not exist", "detail": ""}}'
2023-06-26 01:07:29,135 108489 INFO [tempest.lib.common.rest_client] Request (ServerActionsV293TestJSON:_run_cleanups): 204 DELETE https://10.209.99.44:9696/networking/v2.0/floatingips/c6cc0747-06bd-4783-811d-2408a72db3cc 0.692s
2023-06-26 01:07:29,135 108489 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: None
    Response - Headers: {'date': 'Mon, 26 Jun 2023 01:07:29 GMT', 'server': 'Apache/2.4.52 (Ubuntu)', 'x-openstack-request-id': 'req-e0797282-5cc1-4d86-b2ec-696feed6369a', 'connection': 'close', 'status': '204', 'content-location': 'https://10.209.99.44:9696/networking/v2.0/floatingips/c6cc0747-06bd-4783-811d-2408a72db3cc'}
        Body: b''
}}}

Traceback (most recent call last):
  File "/opt/stack/tempest/tempest/lib/common/ssh.py", line 136, in _get_ssh_connection
    ssh.connect(self.host, port=self.port, username=self.username,
  File "/opt/stack/tempest/.tox/tempest/lib/python3.10/site-packages/paramiko/client.py", line 365, in connect
    sock.connect(addr)
TimeoutError: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/stack/tempest/tempest/common/utils/__init__.py", line 70, in wrapper
    return f(*func_args, **func_kwargs)
  File "/opt/stack/tempest/tempest/api/compute/servers/test_server_actions.py", line 927, in test_rebuild_volume_backed_server
    linux_client.validate_authentication()
  File "/opt/stack/tempest/tempest/lib/common/utils/linux/remote_client.py", line 31, in wrapper
    return function(self, *args, **kwargs)
  File "/opt/stack/tempest/tempest/lib/common/utils/linux/remote_client.py", line 123, in validate_authentication
    self.ssh_client.test_connection_auth()
  File "/opt/stack/tempest/tempest/lib/common/ssh.py", line 245, in test_connection_auth
    connection = self._get_ssh_connection()
  File "/opt/stack/tempest/tempest/lib/common/ssh.py", line 155, in _get_ssh_connection
    raise exceptions.SSHTimeout(host=self.host,
tempest.lib.exceptions.SSHTimeout: Connection to the 172.24.5.20 via SSH timed out.
User: cirros, Password: rebuildPassw0rd

Revision history for this message
yatin (yatinkarel) wrote :
tags: added: gate-failure
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :
Changed in nova:
status: New → Confirmed
Revision history for this message
Dan Smith (danms) wrote :

Hopefully this series will fix it:

https://review.opendev.org/q/topic:reimage-timeout

Looks to be like we were too impatient with slow CI workers with very limited IO for the reimage.

tags: added: ceph drivers
Changed in cinder:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Rajat Dhasmana (whoami-rajat) wrote :

Hi,

Dan has already proposed the patches but putting my analysis here just to keep a record of RCA and in case anyone is interested in knowing what happened :)

Using this job result to analyse logs:
https://b932a1446345e101b3ef-4740624f0848c8c3257f704064a4516f.ssl.cf5.rackcdn.com/883557/2/check/nova-ceph-multistore/d7db64f/

Nova logs state that we failed with a timeout

Jun 26 01:03:11.721921 np0034441113 nova-compute[97944]: ERROR oslo_messaging.rpc.server result = hub.switch()
Jun 26 01:03:11.721921 np0034441113 nova-compute[97944]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/eventlet/hubs/hub.py", line 313, in switch
Jun 26 01:03:11.721921 np0034441113 nova-compute[97944]: ERROR oslo_messaging.rpc.server return self.greenlet.switch()
Jun 26 01:03:11.721921 np0034441113 nova-compute[97944]: ERROR oslo_messaging.rpc.server eventlet.timeout.Timeout: 20 seconds

Looking at the cinder logs, we can analyse the time taken with the request ID for the reimage request

Request ID: req-25e4ae38-b629-4698-97ea-d5b712cb4209

First log:
Jun 26 01:02:49.782746 np0034441113 cinder-volume[99943]: DEBUG cinder.volume.volume_utils [req-25e4ae38-b629-4698-97ea-d5b712cb4209 req-e4bf7950-b372-4010-9fd0-80f48fdd26c1 tempest-ServerActionsV293TestJSON-85724339 None] Attempting download of e7228a06-ab6d-4ddd-9fa5-d23219175760 ((None, [{'url': 'file:///opt/stack/data/glance/images/e7228a06-ab6d-4ddd-9fa5-d23219175760', 'metadata': {'store': 'cheap'}}, {'url': 'rbd://937d1bdf-6a12-4028-b011-5e2cfa4ba014/images/e7228a06-ab6d-4ddd-9fa5-d23219175760/snap', 'metadata': {'store': 'robust'}}])) to volume c0dd890d-0f87-4409-996e-7d071475acfd. {{(pid=99943) copy_image_to_volume /opt/stack/cinder/cinder/volume/volume_utils.py:1190}}

Last Log:
Jun 26 01:03:23.661182 np0034441113 cinder-volume[99943]: DEBUG cinder.volume.manager [req-25e4ae38-b629-4698-97ea-d5b712cb4209 req-e4bf7950-b372-4010-9fd0-80f48fdd26c1 tempest-ServerActionsV293TestJSON-85724339 None] Re-imaged e7228a06-ab6d-4ddd-9fa5-d23219175760 to volume c0dd890d-0f87-4409-996e-7d071475acfd successfully. {{(pid=99943) reimage /opt/stack/cinder/cinder/volume/manager.py:5391}}

Total time taken ~34 seconds

The timeout for the reimage operation is set by the config option "reimage_timeout_per_gb" which defaults to 20 seconds per GB which causes the failure.

Revision history for this message
Ghanshyam Mann (ghanshyammann) wrote :

as fix was in devstack, i am marking rest other projects as invalid. Thanks Dan, Rajat for working on this.

Changed in tempest:
status: New → Invalid
Changed in devstack:
status: New → Fix Committed
Changed in devstack-plugin-ceph:
status: New → Invalid
Changed in cinder:
status: Confirmed → Invalid
Changed in nova:
status: Confirmed → Invalid
Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

Please use "Fix released" instead of "Fix committed", the latter is a half-pending state that never get's resolved

Changed in devstack:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.