Comment 4 for bug 2025096

Revision history for this message
Rajat Dhasmana (whoami-rajat) wrote :

Hi,

Dan has already proposed the patches but putting my analysis here just to keep a record of RCA and in case anyone is interested in knowing what happened :)

Using this job result to analyse logs:
https://b932a1446345e101b3ef-4740624f0848c8c3257f704064a4516f.ssl.cf5.rackcdn.com/883557/2/check/nova-ceph-multistore/d7db64f/

Nova logs state that we failed with a timeout

Jun 26 01:03:11.721921 np0034441113 nova-compute[97944]: ERROR oslo_messaging.rpc.server result = hub.switch()
Jun 26 01:03:11.721921 np0034441113 nova-compute[97944]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.10/dist-packages/eventlet/hubs/hub.py", line 313, in switch
Jun 26 01:03:11.721921 np0034441113 nova-compute[97944]: ERROR oslo_messaging.rpc.server return self.greenlet.switch()
Jun 26 01:03:11.721921 np0034441113 nova-compute[97944]: ERROR oslo_messaging.rpc.server eventlet.timeout.Timeout: 20 seconds

Looking at the cinder logs, we can analyse the time taken with the request ID for the reimage request

Request ID: req-25e4ae38-b629-4698-97ea-d5b712cb4209

First log:
Jun 26 01:02:49.782746 np0034441113 cinder-volume[99943]: DEBUG cinder.volume.volume_utils [req-25e4ae38-b629-4698-97ea-d5b712cb4209 req-e4bf7950-b372-4010-9fd0-80f48fdd26c1 tempest-ServerActionsV293TestJSON-85724339 None] Attempting download of e7228a06-ab6d-4ddd-9fa5-d23219175760 ((None, [{'url': 'file:///opt/stack/data/glance/images/e7228a06-ab6d-4ddd-9fa5-d23219175760', 'metadata': {'store': 'cheap'}}, {'url': 'rbd://937d1bdf-6a12-4028-b011-5e2cfa4ba014/images/e7228a06-ab6d-4ddd-9fa5-d23219175760/snap', 'metadata': {'store': 'robust'}}])) to volume c0dd890d-0f87-4409-996e-7d071475acfd. {{(pid=99943) copy_image_to_volume /opt/stack/cinder/cinder/volume/volume_utils.py:1190}}

Last Log:
Jun 26 01:03:23.661182 np0034441113 cinder-volume[99943]: DEBUG cinder.volume.manager [req-25e4ae38-b629-4698-97ea-d5b712cb4209 req-e4bf7950-b372-4010-9fd0-80f48fdd26c1 tempest-ServerActionsV293TestJSON-85724339 None] Re-imaged e7228a06-ab6d-4ddd-9fa5-d23219175760 to volume c0dd890d-0f87-4409-996e-7d071475acfd successfully. {{(pid=99943) reimage /opt/stack/cinder/cinder/volume/manager.py:5391}}

Total time taken ~34 seconds

The timeout for the reimage operation is set by the config option "reimage_timeout_per_gb" which defaults to 20 seconds per GB which causes the failure.