Comment 2 for bug 1880777

Revision history for this message
Bart Wensley (bartwensley) wrote : Re: VM live-migration command hangs

Although the failure in the VIM is the same as bug 1862049, I am re-opening this LP to address the problem that led to the VIM hang - there appears to be message loss between the VIM and openstack pods. From the AIO-DX system where this problem was seen, I can see the VIM is having trouble communicating with Cinder (not all the time - but several times an hour). The logs look like this:

2020-05-27T18:43:29.485 controller-0 VIM_Thread[2249636] INFO _task_worker_pool.py.73 Timeout worker BlockStorage-Worker-0
2020-05-27T18:43:30.490 controller-0 VIM_Thread[2249636] ERROR _task.py.200 Task(get_volumes) work (get_volumes) timed out, id=16520.
2020-05-27T18:43:30.492 controller-0 VIM_Thread[2249636] ERROR _vim_nfvi_audits.py.635 Audit-Volumes callback, not completed, responses={'completed': False, 'reason': '', 'page-request-id': '3e598409-1207-47b3-9dc2-3d8eccb2d1cb'}.
2020-05-27T18:45:16.438 controller-0 VIM_BlockStorage-Worker-0_Thread[1236728] INFO _timer_scheduler.py.57 Not scheduling on time, elapsed=127499 ms.
2020-05-27T18:45:16.439 controller-0 VIM_BlockStorage-Worker-0_Thread[1236728] INFO _thread.py.218 Thread BlockStorage-Worker-0: not scheduling on time
2020-05-27T18:45:16.439 controller-0 VIM_BlockStorage-Worker-0_Thread[1236728] INFO _thread.py.218 Thread BlockStorage-Worker-0: not scheduling on time
2020-05-27T18:45:16.439 controller-0 VIM_BlockStorage-Worker-0_Thread[1236728] INFO _thread.py.232 Thread BlockStorage-Worker-0: shutting down.
2020-05-27T18:45:16.440 controller-0 VIM_BlockStorage-Worker-0_Thread[1236728] INFO _thread.py.236 Thread BlockStorage-Worker-0: shutdown.

Over a year ago, I noticed there was message loss between the VIM and the openstack pods and raised https://bugs.launchpad.net/starlingx/+bug/1817936.

I spent significant time on that bug, then Matt spent time on it. Finally, Austin Sun debugged further and came up with a fix. His fix was to change the sysctl setting net.ipv4.tcp_tw_reuse to 0. Looking in the system where this issue occurred, I see that this setting is still at 0. This leads me to conclude one of the following:
- The fix for 1817936 did not fix the problem.
- Something changed after the fix was done and it is broken again (e.g. new kernel, new k8s version).
- We have another completely different bug.

The issue of message loss between the VIM and the openstack pods should be debugged under this LP.