Periodic message loss between VIM and Openstack pods

Bug #1880777 reported by Frank Miller on 2020-05-27
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
High
Angie Wang

Bug Description

Brief Description
-----------------
The following command hung indefinitely. The command was issued:
2020-05-26T22:02:57.000 controller-0 -sh: info HISTORY: PID=2755566 UID=42425 openstack server migrate --live-migration b657ba58-aa77-4945-979b-8e2b9d9a8111

And was still hanging 4 hours later. Initial debugging indicates the nfv-vim process seems to no longer be functioning (no logs generated) despite the nfv-vim processes showing to be alive via the ps -ef |grep nfv command.

Severity
--------
Major - VM or any operations that go through the nfv-vim process no longer function.

Steps to Reproduce
------------------
Unknown.
The system was installed on May 21 and left mostly idle over the next 5 days.
On May 26 3 VMs were launched via Horizon.
VM migration from CLI resulted in command hanging:
[sysadmin@controller-0 ~(keystone_admin)]$ openstack server migrate --live-migration b657ba58-aa77-4945-979b-8e2b9d9a8111

Expected Behavior
------------------
VM migration should have been attempted and CLI returned with pass or fail result.

Actual Behavior
----------------
CLI command hung indefinitely.

Reproducibility
---------------
Seen once

System Configuration
--------------------
AIO-DX

Branch/Pull Time/Commit
-----------------------
2020-05-18_20-00-00

Last Pass
---------
n/a

Timestamp/Logs
--------------
CLI command was run at:
2020-05-26T22:02:57.000 controller-0 -sh: info HISTORY: PID=2755566 UID=42425 openstack server migrate --live-migration b657ba58-aa77-4945-979b-8e2b9d9a8111

But last nfv-vim logs were from 7 hours earlier:
2020-05-26T07:17:42.712 controller-0 VIM_Thread[776005] INFO _vim_nfvi_audits.py.873 Audit instances called, timer_id=15.
2020-05-26T07:18:30.970 controller-0 VIM_Thread[776005] INFO _task_worker_pool.py.73 Timeout worker BlockStorage-Worker-0
2020-05-26T07:18:31.978 controller-0 VIM_Thread[776005] ERROR _task.py.200 Task(get_volume) work (get_volume) timed out, id=215342.
2020-05-26T07:18:31.981 controller-0 VIM_Thread[776005] ERROR _vim_nfvi_audits.py.1153 Audit-Volume callback, not completed, response={'completed': False, 'reason': ''}.
2020-05-26T07:18:38.207 controller-0 VIM_Thread[776005] ERROR (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely) (sqlite3.OperationalError) unable to open database file [SQL: u'UPDATE hosts_v6 SET uptime=?, elapsed_time_in_state=? WHERE hosts_v6.uuid = ?'] [parameters: (371994, 364573, u'3ff0ded8-6de9-4928-a2c1-2269527c98ef')]
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/nfv_vim/vim.py", line 167, in process_main
  File "/usr/lib64/python2.7/site-packages/nfv_common/selobj/_selobj_module.py", line 105, in selobj_dispatch
  File "/usr/lib64/python2.7/site-packages/nfv_common/tasks/_task_scheduler.py", line 332, in task_work_complete
  File "/usr/lib64/python2.7/site-packages/nfv_common/tasks/_task.py", line 193, in task_work_complete
  File "/usr/lib64/python2.7/site-packages/nfv_plugins/nfvi_plugins/nfvi_infrastructure_api.py", line 495, in get_hosts
  File "/usr/lib64/python2.7/site-packages/nfv_vim/audits/_vim_nfvi_audits.py", line 148, in _audit_nfvi_hosts_callback
  File "/usr/lib64/python2.7/site-packages/nfv_vim/objects/_host.py", line 721, in nfvi_host_update
  File "/usr/lib64/python2.7/site-packages/nfv_vim/objects/_host.py", line 863, in _persist
  File "/usr/lib64/python2.7/site-packages/nfv_vim/database/_database_infrastructure_module.py", line 68, in database_host_add
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2755, in first
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2547, in __getitem__
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2854, in __iter__
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1407, in _autoflush
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1397, in _autoflush
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2171, in flush
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2291, in _flush
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 66, in __exit__
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2255, in _flush
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 389, in execute
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 548, in execute
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 177, in save_obj
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 737, in _emit_update_statements
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 945, in execute
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/sql/elements.py", line 263, in _execute_on_connection
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1053, in _execute_clauseelement
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1189, in _execute_context
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1402, in _handle_dbapi_exception
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1182, in _execute_context
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 470, in do_execute
OperationalError: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely) (sqlite3.OperationalError) unable to open database file [SQL: u'UPDATE hosts_v6 SET uptime=?, elapsed_time_in_state=? WHERE hosts_v6.uuid = ?'] [parameters: (371994, 364573, u'3ff0ded8-6de9-4928-a2c1-2269527c98ef')]
2020-05-26T07:20:18.198 controller-0 VIM_BlockStorage-Worker-0_Thread[2931601] INFO _timer_scheduler.py.57 Not scheduling on time, elapsed=127281 ms.
2020-05-26T07:20:18.203 controller-0 VIM_BlockStorage-Worker-0_Thread[2931601] INFO _thread.py.218 Thread BlockStorage-Worker-0: not scheduling on time
2020-05-26T07:20:18.203 controller-0 VIM_BlockStorage-Worker-0_Thread[2931601] INFO _thread.py.218 Thread BlockStorage-Worker-0: not scheduling on time
2020-05-26T07:20:18.203 controller-0 VIM_BlockStorage-Worker-0_Thread[2931601] INFO _thread.py.232 Thread BlockStorage-Worker-0: shutting down.
2020-05-26T07:20:18.204 controller-0 VIM_BlockStorage-Worker-0_Thread[2931601] INFO _thread.py.236 Thread BlockStorage-Worker-0: shutdown.

Test Activity
-------------
Regular use

Workaround
----------
Unknown - killing the nfv-vim process may result in it recovering.

Frank Miller (sensfan22) wrote :
tags: added: stx.nfv
Changed in starlingx:
assignee: nobody → Bart Wensley (bartwensley)
Bart Wensley (bartwensley) wrote :

Although the failure in the VIM is the same as bug 1862049, I am re-opening this LP to address the problem that led to the VIM hang - there appears to be message loss between the VIM and openstack pods. From the AIO-DX system where this problem was seen, I can see the VIM is having trouble communicating with Cinder (not all the time - but several times an hour). The logs look like this:

2020-05-27T18:43:29.485 controller-0 VIM_Thread[2249636] INFO _task_worker_pool.py.73 Timeout worker BlockStorage-Worker-0
2020-05-27T18:43:30.490 controller-0 VIM_Thread[2249636] ERROR _task.py.200 Task(get_volumes) work (get_volumes) timed out, id=16520.
2020-05-27T18:43:30.492 controller-0 VIM_Thread[2249636] ERROR _vim_nfvi_audits.py.635 Audit-Volumes callback, not completed, responses={'completed': False, 'reason': '', 'page-request-id': '3e598409-1207-47b3-9dc2-3d8eccb2d1cb'}.
2020-05-27T18:45:16.438 controller-0 VIM_BlockStorage-Worker-0_Thread[1236728] INFO _timer_scheduler.py.57 Not scheduling on time, elapsed=127499 ms.
2020-05-27T18:45:16.439 controller-0 VIM_BlockStorage-Worker-0_Thread[1236728] INFO _thread.py.218 Thread BlockStorage-Worker-0: not scheduling on time
2020-05-27T18:45:16.439 controller-0 VIM_BlockStorage-Worker-0_Thread[1236728] INFO _thread.py.218 Thread BlockStorage-Worker-0: not scheduling on time
2020-05-27T18:45:16.439 controller-0 VIM_BlockStorage-Worker-0_Thread[1236728] INFO _thread.py.232 Thread BlockStorage-Worker-0: shutting down.
2020-05-27T18:45:16.440 controller-0 VIM_BlockStorage-Worker-0_Thread[1236728] INFO _thread.py.236 Thread BlockStorage-Worker-0: shutdown.

Over a year ago, I noticed there was message loss between the VIM and the openstack pods and raised https://bugs.launchpad.net/starlingx/+bug/1817936.

I spent significant time on that bug, then Matt spent time on it. Finally, Austin Sun debugged further and came up with a fix. His fix was to change the sysctl setting net.ipv4.tcp_tw_reuse to 0. Looking in the system where this issue occurred, I see that this setting is still at 0. This leads me to conclude one of the following:
- The fix for 1817936 did not fix the problem.
- Something changed after the fix was done and it is broken again (e.g. new kernel, new k8s version).
- We have another completely different bug.

The issue of message loss between the VIM and the openstack pods should be debugged under this LP.

Changed in starlingx:
assignee: Bart Wensley (bartwensley) → Frank Miller (sensfan22)
tags: added: stx.openstack
tags: added: stx.containers stx.networking
removed: stx.openstack
summary: - VM live-migration command hangs
+ Periodic message loss between VIM and Openstack pods
Frank Miller (sensfan22) wrote :

Austin: Requesting that you start the triage and investigation of the current message loss as you debugged and fixed a similar issue in 2019 in LP: https://bugs.launchpad.net/starlingx/+bug/1817936

Can you see if the current issue is due to the original fix no longer working?

Changed in starlingx:
assignee: Frank Miller (sensfan22) → Austin Sun (sunausti)
Ghada Khalil (gkhalil) on 2020-06-03
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
tags: added: stx.4.0
Austin Sun (sunausti) wrote :

Hi, Frank & Ghada:
    I took a look at this issue.
    from the log frank shared , the kernel version is kernel-3.10.0-1127.el7.1.tis.x86_64, but stx does not integrated w/ this kernel version.
    BTW:tested Duplex on latest stx images, running scripts from 1817936, no connection lose found between VIM and openstack. only Simplex have such issue.
   Simplex should be due to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=79e9fed460385a3d8ba0b5782e9e74405cb199b1. will raise new bug and fix it.

    Thanks.
    BR
Austin Sun.

Frank Miller (sensfan22) wrote :

Austin - thank-you for your quick turnaround and summary of your investigation.

As you have shown the message loss on AIO-DX does not occur on the 4.18 kernel and only occurs on the 3.10 kernel I will mark this LP as invalid.

Please open a separate LP for the AIO-SX issue you found when using the 4.18 kernel and fix that under your new LP.

Changed in starlingx:
status: Triaged → Invalid
assignee: Austin Sun (sunausti) → Frank Miller (sensfan22)
Frank Miller (sensfan22) on 2020-09-02
Changed in starlingx:
status: Invalid → Triaged
Frank Miller (sensfan22) wrote :

This issue is being re-opened as it has been seen to occur on recent STX loads built on the master branch. In a test done using a master branch load built on August 11th message loss was observed between openstack services. In addition the following script also reproduced the message loss which is the script used by Austin in https://bugs.launchpad.net/starlingx/+bug/1817936.

import urllib
import datetime
import traceback
for i in range(1,20000):
    now = datetime.datetime.now()
    print("start %s at: %s" % (i,now))
    try:
        #response = urllib.urlopen("http://nova.openstack.svc.cluster.local:80")
        response = urllib.urlopen("http://nova-api.openstack.svc.cluster.local:8774")
        print(response.read())
        response.close()
    except Exception as e:
        now = datetime.datetime.now()
        print("Error %s at: %s" % (i,now))
        print ("exception is %s" % str(e))
        traceback.print_exc()
        exit()
    now = datetime.datetime.now()
    print("end %s at: %s" % (i,now))

Results from the script which was launched in 2 different labs by Elena:
cgcs-r430-1_2 - Connection timed out at iteration 10390
cgcs-wildcat-106 - Connection timed out at iteration 12122

Changed in starlingx:
assignee: Frank Miller (sensfan22) → Angie Wang (angiewang)
Frank Miller (sensfan22) wrote :

Assigning to Angie to implement a solution. Changed the tag to stx.5.0 as this issue has not been reported by stx.4.0 users to occur frequently enough to warrant a cherry pick to stx.4.0

tags: added: stx.5.0 stx.distro.openstack
removed: stx.4.0 stx.nfv

Fix proposed to branch: master
Review: https://review.opendev.org/749624

Changed in starlingx:
status: Triaged → In Progress

Reviewed: https://review.opendev.org/749624
Committed: https://git.openstack.org/cgit/starlingx/openstack-armada-app/commit/?id=c14bed004392f39283fab6b3039ad9333e4a3c5b
Submitter: Zuul
Branch: master

commit c14bed004392f39283fab6b3039ad9333e4a3c5b
Author: Angie Wang <email address hidden>
Date: Wed Sep 2 19:18:07 2020 -0400

    Update openstack requests to go through ingress

    The requests to Openstack services hang/fail sometimes
    due to message loss when connecting to internal service
    endpoints. This issue was observed before and fixed in
    commit https://review.opendev.org/#/c/683818/ by setting
    net.ipv4.tcp_tw_reuse to 0, however, it's still being
    seen on recent STX loads.

    It has been tested and proved that requests go through
    ingress pod do not have the issue. This commit updates
    helm charts and manifest to make all requests sending
    to openstack services go to ingress pod and then ingress
    would forward requests to the corresponding api service.

    Changes included:
    - update helm-toolkit manifest job-ks-endpoint.yaml
      to provide an ability to conditionally configure
      all types of openstack endpoints with public endpoint
      url when endpoints.identity.force_public_endpoint is
      true. Same update for keystone and keystone-api-proxy.
      With the update, for example, the admin,internal
      and public endpoints for neutron will be
      neutron.openstack.svc.cluster.local:80
    - update armada manifest to make neccessary overrides
      in openstack service configuration file to make
      communications between services go through ingress

    Change-Id: Icd646dd07d544da420a75f920bd7c3e8fc931327
    Closes-Bug: 1880777
    Signed-off-by: Angie Wang <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/750055
Committed: https://git.openstack.org/cgit/starlingx/openstack-armada-app/commit/?id=bcadcf5218ba1cd6d99b31fdd5ccd97517858013
Submitter: Zuul
Branch: master

commit bcadcf5218ba1cd6d99b31fdd5ccd97517858013
Author: Angie Wang <email address hidden>
Date: Fri Sep 4 17:16:09 2020 -0400

    Update openstack requests to go through ingress

    A follow-up commit of https://review.opendev.org/#/c/749624/
    to update the requests from nova-api-proxy to nova
    via ingress. This requires to enable jobs in nova
    to create an additional ingress and ingress-service
    for nova-api.

    The created ingress for nova-api is called "nova-api-internal"
    and nova-api-proxy configuration file is overrided to
    listen on nova-api-internal.openstack.svc.cluster.local:80

    Change-Id: I2275be8e9458addbf1aedb203a7960f5f8d0b0de
    Partial-Bug: 1880777
    Signed-off-by: Angie Wang <email address hidden>

Reviewed: https://review.opendev.org/750001
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=342e72155df3ea070f8263178fe6133207bf5e3d
Submitter: Zuul
Branch: master

commit 342e72155df3ea070f8263178fe6133207bf5e3d
Author: Angie Wang <email address hidden>
Date: Fri Sep 4 13:19:33 2020 -0400

    Periodic message loss between VIM and Openstack pods

    The requests to Openstack services hang/fail sometimes
    due to message loss when connecting to internal service
    endpoints. This issue was observed before and fixed in
    commit https://review.opendev.org/#/c/683818/ by setting
    net.ipv4.tcp_tw_reuse to 0, however, it's still being
    seen on recent STX loads.

    It has been tested and proved that requests go through
    ingress pod do not have the issue.

    Update the default keystone auth endpoint to public url
    in sysinv so VIM and pci-irq-affinity-agent can be
    configured to talk to keystone using public urls.
    Aslo update the requests from VIM to nova to go through
    ingress as well.

    Change-Id: Ib5a49ede44f76c649cd64ed027a4d8eb559a61fe
    Partial-Bug: 1880777
    Depends-On: https://review.opendev.org/750055
    Signed-off-by: Angie Wang <email address hidden>

Reviewed: https://review.opendev.org/750002
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=85e6aed832fb0f67d4ba56676c6da250196581f3
Submitter: Zuul
Branch: master

commit 85e6aed832fb0f67d4ba56676c6da250196581f3
Author: Angie Wang <email address hidden>
Date: Fri Sep 4 13:16:02 2020 -0400

    Periodic message loss between VIM and Openstack pods

    The requests to Openstack services hang/fail sometimes
    due to message loss when connecting to internal service
    endpoints. This issue was observed before and fixed in
    commit https://review.opendev.org/#/c/683818/ by setting
    net.ipv4.tcp_tw_reuse to 0, however, it's still being
    seen on recent STX loads.

    It has been tested and proved that requests go through
    ingress pod do not have the issue. In the commit
    https://review.opendev.org/750001, pci affinity agent
    has been updated to talk to keystone public host,
    so this commit updates its default keystone auth
    port to 80 to align with keystone public host.

    Change-Id: I10897c7148a07f42f76b4a74258f736245bf9dca
    Closes-Bug: 1880777
    Depends-On: https://review.opendev.org/750001
    Signed-off-by: Angie Wang <email address hidden>

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers