Although the failure in the VIM is the same as bug 1862049, I am re-opening this LP to address the problem that led to the VIM hang - there appears to be message loss between the VIM and openstack pods. From the AIO-DX system where this problem was seen, I can see the VIM is having trouble communicating with Cinder (not all the time - but several times an hour). The logs look like this:
2020-05-27T18:43:29.485 controller-0 VIM_Thread[2249636] INFO _task_worker_pool.py.73 Timeout worker BlockStorage-Worker-0
2020-05-27T18:43:30.490 controller-0 VIM_Thread[2249636] ERROR _task.py.200 Task(get_volumes) work (get_volumes) timed out, id=16520.
2020-05-27T18:43:30.492 controller-0 VIM_Thread[2249636] ERROR _vim_nfvi_audits.py.635 Audit-Volumes callback, not completed, responses={'completed': False, 'reason': '', 'page-request-id': '3e598409-1207-47b3-9dc2-3d8eccb2d1cb'}.
2020-05-27T18:45:16.438 controller-0 VIM_BlockStorage-Worker-0_Thread[1236728] INFO _timer_scheduler.py.57 Not scheduling on time, elapsed=127499 ms.
2020-05-27T18:45:16.439 controller-0 VIM_BlockStorage-Worker-0_Thread[1236728] INFO _thread.py.218 Thread BlockStorage-Worker-0: not scheduling on time
2020-05-27T18:45:16.439 controller-0 VIM_BlockStorage-Worker-0_Thread[1236728] INFO _thread.py.218 Thread BlockStorage-Worker-0: not scheduling on time
2020-05-27T18:45:16.439 controller-0 VIM_BlockStorage-Worker-0_Thread[1236728] INFO _thread.py.232 Thread BlockStorage-Worker-0: shutting down.
2020-05-27T18:45:16.440 controller-0 VIM_BlockStorage-Worker-0_Thread[1236728] INFO _thread.py.236 Thread BlockStorage-Worker-0: shutdown.
I spent significant time on that bug, then Matt spent time on it. Finally, Austin Sun debugged further and came up with a fix. His fix was to change the sysctl setting net.ipv4.tcp_tw_reuse to 0. Looking in the system where this issue occurred, I see that this setting is still at 0. This leads me to conclude one of the following:
- The fix for 1817936 did not fix the problem.
- Something changed after the fix was done and it is broken again (e.g. new kernel, new k8s version).
- We have another completely different bug.
The issue of message loss between the VIM and the openstack pods should be debugged under this LP.
Although the failure in the VIM is the same as bug 1862049, I am re-opening this LP to address the problem that led to the VIM hang - there appears to be message loss between the VIM and openstack pods. From the AIO-DX system where this problem was seen, I can see the VIM is having trouble communicating with Cinder (not all the time - but several times an hour). The logs look like this:
2020-05- 27T18:43: 29.485 controller-0 VIM_Thread[2249636] INFO _task_worker_ pool.py. 73 Timeout worker BlockStorage- Worker- 0 27T18:43: 30.490 controller-0 VIM_Thread[2249636] ERROR _task.py.200 Task(get_volumes) work (get_volumes) timed out, id=16520. 27T18:43: 30.492 controller-0 VIM_Thread[2249636] ERROR _vim_nfvi_ audits. py.635 Audit-Volumes callback, not completed, responses= {'completed' : False, 'reason': '', 'page-request-id': '3e598409- 1207-47b3- 9dc2-3d8eccb2d1 cb'}. 27T18:45: 16.438 controller-0 VIM_BlockStorag e-Worker- 0_Thread[ 1236728] INFO _timer_ scheduler. py.57 Not scheduling on time, elapsed=127499 ms. 27T18:45: 16.439 controller-0 VIM_BlockStorag e-Worker- 0_Thread[ 1236728] INFO _thread.py.218 Thread BlockStorage- Worker- 0: not scheduling on time 27T18:45: 16.439 controller-0 VIM_BlockStorag e-Worker- 0_Thread[ 1236728] INFO _thread.py.218 Thread BlockStorage- Worker- 0: not scheduling on time 27T18:45: 16.439 controller-0 VIM_BlockStorag e-Worker- 0_Thread[ 1236728] INFO _thread.py.232 Thread BlockStorage- Worker- 0: shutting down. 27T18:45: 16.440 controller-0 VIM_BlockStorag e-Worker- 0_Thread[ 1236728] INFO _thread.py.236 Thread BlockStorage- Worker- 0: shutdown.
2020-05-
2020-05-
2020-05-
2020-05-
2020-05-
2020-05-
2020-05-
Over a year ago, I noticed there was message loss between the VIM and the openstack pods and raised https:/ /bugs.launchpad .net/starlingx/ +bug/1817936.
I spent significant time on that bug, then Matt spent time on it. Finally, Austin Sun debugged further and came up with a fix. His fix was to change the sysctl setting net.ipv4. tcp_tw_ reuse to 0. Looking in the system where this issue occurred, I see that this setting is still at 0. This leads me to conclude one of the following:
- The fix for 1817936 did not fix the problem.
- Something changed after the fix was done and it is broken again (e.g. new kernel, new k8s version).
- We have another completely different bug.
The issue of message loss between the VIM and the openstack pods should be debugged under this LP.