tempest.api.compute.servers.test_novnc.NoVNCConsoleTestJSON.test_novnc fails intermittently in neutron multinode nv job

Bug #1669468 reported by Markus Zoeller (markus_z)
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
devstack
Fix Released
Medium
melanie witt

Bug Description

Example output:

2017-02-21 06:42:10.010442 | ==============================
2017-02-21 06:42:10.010458 | Failed 1 tests - output below:
2017-02-21 06:42:10.010471 | ==============================
2017-02-21 06:42:10.010477 |
2017-02-21 06:42:10.010507 | tempest.api.compute.servers.test_novnc.NoVNCConsoleTestJSON.test_novnc[id-c640fdff-8ab4-45a4-a5d8-7e6146cbd0dc]
2017-02-21 06:42:10.010542 | ---------------------------------------------------------------------------------------------------------------
2017-02-21 06:42:10.010548 |
2017-02-21 06:42:10.010558 | Captured traceback:
2017-02-21 06:42:10.010569 | ~~~~~~~~~~~~~~~~~~~
2017-02-21 06:42:10.010583 | Traceback (most recent call last):
2017-02-21 06:42:10.010606 | File "tempest/api/compute/servers/test_novnc.py", line 152, in test_novnc
2017-02-21 06:42:10.010621 | self._validate_rfb_negotiation()
2017-02-21 06:42:10.010646 | File "tempest/api/compute/servers/test_novnc.py", line 77, in _validate_rfb_negotiation
2017-02-21 06:42:10.010665 | 'Token must be invalid because the connection '
2017-02-21 06:42:10.010721 | File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/unittest2/case.py", line 696, in assertFalse
2017-02-21 06:42:10.010737 | raise self.failureException(msg)
2017-02-21 06:42:10.010762 | AssertionError: True is not false : Token must be invalid because the connection closed.
2017-02-21 06:42:10.010768 |
2017-02-21 06:42:10.010774 |
2017-02-21 06:42:10.010785 | Captured pythonlogging:
2017-02-21 06:42:10.010796 | ~~~~~~~~~~~~~~~~~~~~~~~
2017-02-21 06:42:10.010848 | 2017-02-21 06:07:18,545 16286 INFO [tempest.lib.common.rest_client] Request (NoVNCConsoleTestJSON:test_novnc): 200 POST https://10.27.33.58:8774/v2.1/servers/82d4d4ca-c263-4ac5-85bc-a33488af5ff5/action 0.165s
2017-02-21 06:42:10.010905 | 2017-02-21 06:07:18,545 16286 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Accept': 'application/json', 'X-Auth-Token': '<omitted>', 'Content-Type': 'application/json'}
2017-02-21 06:42:10.010925 | Body: {"os-getVNCConsole": {"type": "novnc"}}
2017-02-21 06:42:10.011109 | Response - Headers: {u'content-type': 'application/json', 'content-location': 'https://10.27.33.58:8774/v2.1/servers/82d4d4ca-c263-4ac5-85bc-a33488af5ff5/action', u'date': 'Tue, 21 Feb 2017 06:07:18 GMT', u'x-openstack-nova-api-version': '2.1', 'status': '200', u'content-length': '121', u'server': 'Apache/2.4.18 (Ubuntu)', u'connection': 'close', u'openstack-api-version': 'compute 2.1', u'vary': 'OpenStack-API-Version,X-OpenStack-Nova-API-Version', u'x-compute-request-id': 'req-d9681919-5b5e-4477-b38d-2734b660a099'}
2017-02-21 06:42:10.011153 | Body: {"console": {"url": "http://10.27.33.58:6080/vnc_auto.html?token=f8a52df3-8e0d-4d64-8877-07f607f84b74", "type": "novnc"}}
2017-02-21 06:42:10.011161 |
2017-02-21 06:42:10.011167 |
2017-02-21 06:42:10.011172 |

Full logs at: http://logs.openstack.org/38/431038/3/check/gate-tempest-dsvm-neutron-multinode-full-ubuntu-xenial-nv/5e1d485/console.html#_2017-02-21_06_07_18_740230

This started at 2017-02-21

The very first change which failed here was https://review.openstack.org/#/c/431038/ but is not related to the error.

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

The only novnc related merge I found in that timeframe is https://github.com/openstack/nova/commit/b726f2553c3eebdc4ab5753a45a99724e952b4ac but that merged a few hours after the first hit.

Revision history for this message
Matt Riedemann (mriedem) wrote :

This might have started before 2/21, logstash only goes back 10 days.

Revision history for this message
Matt Riedemann (mriedem) wrote :

http://logs.openstack.org/38/431038/3/check/gate-tempest-dsvm-neutron-multinode-full-ubuntu-xenial-nv/5e1d485/console.html#_2017-02-21_06_42_10_010507

2017-02-21 06:42:10.010507 | tempest.api.compute.servers.test_novnc.NoVNCConsoleTestJSON.test_novnc[id-c640fdff-8ab4-45a4-a5d8-7e6146cbd0dc]
2017-02-21 06:42:10.010542 | ---------------------------------------------------------------------------------------------------------------
2017-02-21 06:42:10.010548 |
2017-02-21 06:42:10.010558 | Captured traceback:
2017-02-21 06:42:10.010569 | ~~~~~~~~~~~~~~~~~~~
2017-02-21 06:42:10.010583 | Traceback (most recent call last):
2017-02-21 06:42:10.010606 | File "tempest/api/compute/servers/test_novnc.py", line 152, in test_novnc
2017-02-21 06:42:10.010621 | self._validate_rfb_negotiation()
2017-02-21 06:42:10.010646 | File "tempest/api/compute/servers/test_novnc.py", line 77, in _validate_rfb_negotiation
2017-02-21 06:42:10.010665 | 'Token must be invalid because the connection '
2017-02-21 06:42:10.010721 | File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/unittest2/case.py", line 696, in assertFalse
2017-02-21 06:42:10.010737 | raise self.failureException(msg)
2017-02-21 06:42:10.010762 | AssertionError: True is not false : Token must be invalid because the connection closed.
2017-02-21 06:42:10.010768 |
2017-02-21 06:42:10.010774 |
2017-02-21 06:42:10.010785 | Captured pythonlogging:
2017-02-21 06:42:10.010796 | ~~~~~~~~~~~~~~~~~~~~~~~
2017-02-21 06:42:10.010848 | 2017-02-21 06:07:18,545 16286 INFO [tempest.lib.common.rest_client] Request (NoVNCConsoleTestJSON:test_novnc): 200 POST https://10.27.33.58:8774/v2.1/servers/82d4d4ca-c263-4ac5-85bc-a33488af5ff5/action 0.165s
2017-02-21 06:42:10.010905 | 2017-02-21 06:07:18,545 16286 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Accept': 'application/json', 'X-Auth-Token': '<omitted>', 'Content-Type': 'application/json'}
2017-02-21 06:42:10.010925 | Body: {"os-getVNCConsole": {"type": "novnc"}}
2017-02-21 06:42:10.011109 | Response - Headers: {u'content-type': 'application/json', 'content-location': 'https://10.27.33.58:8774/v2.1/servers/82d4d4ca-c263-4ac5-85bc-a33488af5ff5/action', u'date': 'Tue, 21 Feb 2017 06:07:18 GMT', u'x-openstack-nova-api-version': '2.1', 'status': '200', u'content-length': '121', u'server': 'Apache/2.4.18 (Ubuntu)', u'connection': 'close', u'openstack-api-version': 'compute 2.1', u'vary': 'OpenStack-API-Version,X-OpenStack-Nova-API-Version', u'x-compute-request-id': 'req-d9681919-5b5e-4477-b38d-2734b660a099'}
2017-02-21 06:42:10.011153 | Body: {"console": {"url": "http://10.27.33.58:6080/vnc_auto.html?token=f8a52df3-8e0d-4d64-8877-07f607f84b74", "type": "novnc"}}

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22in%20_validate_rfb_negotiation%5C%22&from=7d

203 hits in 7 days, master and stable/ocata, this mostly shows up in the neutron multinode job which is non-voting.

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

The failing tempest test got added with https://github.com/openstack/tempest/commit/1f87a5611a983acac2208c3e0ba07eee75fe9a51 and merged 2017-02-03 16:39:00

Matt Riedemann (mriedem)
summary: tempest.api.compute.servers.test_novnc.NoVNCConsoleTestJSON.test_novnc
- fails intermittently
+ fails intermittently in neutron multinode nv job
Changed in nova:
status: New → Confirmed
Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

I tried to brute force it on a single node system:

    $ for i in {1..500}; do tempest run --regex tempest.api.compute.servers.test_novnc.NoVNCConsoleTestJSON.test_novnc >> brute_force_1669468.txt; done

No failing tests here. I'll try again with a multinode setup.

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

Michelle Mandel figured it out, all credits to her, I'm only summarizing things here:

The compute node in the multinode tests has a wrong value for the config
option "vnc.vncserver_proxyclient_address". It uses 127.0.0.1, but it
should be the IP address of the controller node. This is also described
in the help of that config option [1]:

    > "Private, internal IP address or hostname of VNC console proxy.

    > The VNC proxy is an OpenStack component that enables compute service
    > users to access their instances through VNC clients.

    > This option sets the private address to which proxy clients, such as
    > ``nova-xvpvncproxy``, should connect to."

As the multinode test job uses a controller node (*with* compute services)
+ compute node, this also explains why this fails intermittently. The test
passes if the instance gets scheduled onto the controller (*with* compute)
and fails if the instance gets scheduler onto the compute node.

Steps to check that assumption:
* check the failure in "screen-n-novnc.txt"
* note down the instance_uuid of the failing connection attempt
* check the file "subnode-2/screen-n-cpu.txt", the instance_uuid
  should be there

I checked this with these multiple failing test runs.

Options:
1) skip multinode tests for this test case
2) enforce scheduling onto the controller (*with* compute services)
3) change the multi node test job config to setup the vnc options correctly

Personally, I tend to option 3), as this is a more realistic scenario.
It should also not be that big of a change. Should be these 3 options (max,
I haven't yet tested the minimum to change)
* NOVNCPROXY_URL="http://$SERVICE_HOST:6080/vnc_auto.html"
* VNCSERVER_LISTEN=$HOST_IP
* VNCSERVER_PROXYCLIENT_ADDRESS=$VNCSERVER_LISTEN

References:
[1] https://github.com/openstack/nova/blob/cd3b57d0c0cb867ef48a6e9721d9b3e28cb08e84/nova/conf/vnc.py#L70-L78

Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

Ah, good find, I was just about to start looking into this myself.

Note that the multinode job was broken earlier until https://review.openstack.org/434996 added

DEVSTACK_LOCAL_CONFIG="NOVA_VNC_ENABLED=true"

but this probably needs amending.

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :
Changed in nova:
importance: Undecided → Medium
assignee: nobody → Markus Zoeller (markus_z) (mzoeller)
status: Confirmed → In Progress
Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

Change https://review.openstack.org/#/c/444164/6 didn't solve it completely.
The VNC settings are not complete. The controller node (with compute service)
is set up correctly:

    [vnc]
    xvpvncproxy_host = 0.0.0.0
    novncproxy_host = 0.0.0.0
    vncserver_proxyclient_address = 158.69.78.118
    vncserver_listen = 0.0.0.0
    xvpvncproxy_base_url = http://158.69.78.118:6081/console
    novncproxy_base_url = http://158.69.78.118:6080/vnc_auto.html

But the compute node (subnode-2) is not:

    [vnc]
    xvpvncproxy_host = 0.0.0.0
    novncproxy_host = 0.0.0.0
    vncserver_proxyclient_address =
    vncserver_listen = 0.0.0.0
    xvpvncproxy_base_url = http://158.69.78.118:6081/console
    novncproxy_base_url = http://158.69.78.118:6080/vnc_auto.html

Note the missing IP address for "vncserver_proxyclient_address". This is
probably due to the wrong order of the variables in the "local.conf" of
the subnode:

    NOVA_VNC_ENABLED=true
    VNCSERVER_LISTEN=0.0.0.0
    VNCSERVER_PROXYCLIENT_ADDRESS=$HOST_IP

    HOST_IP=158.69.78.119

The "HOST_IP" has to be *before* the "VNCSERVER_PROXYCLIENT_ADDRESS".
Logs: http://logs.openstack.org/76/400876/31/check/gate-tempest-dsvm-neutron-multinode-full-ubuntu-xenial-nv/63d2bdd/logs/

The Logstash query: http://logstash.openstack.org/#/dashboard/file/logstash.json?from=7d&query=message:%5C%22AssertionError:%20True%20is%20not%20false%20:%20Token%20must%20be%20invalid%20because%20the%20connection%20closed.%5C%22

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

An attempt to fix the order of localrc variables for the subnode: https://review.openstack.org/#/c/448078/1

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

No hits in logstash after https://review.openstack.org/#/c/448078/1 merged. => resolved

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

It's still hitting master and stable, including non-multinode jobs. Example: http://logs.openstack.org/27/461827/1/check/gate-tempest-dsvm-py35-ubuntu-xenial/6805498/logs/testr_results.html.gz

Revision history for this message
Matt Riedemann (mriedem) wrote :
Changed in nova:
status: Fix Released → Confirmed
assignee: Markus Zoeller (markus_z) (mzoeller) → nobody
Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

Hmm... that piece stands out:

    stderr: {{{
/opt/stack/new/tempest/tempest/api/compute/servers/test_novnc.py:166: ResourceWarning: unclosed <socket.socket fd=4, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.36.191.29', 60968), raddr=('10.36.191.29', 6080)>
  self._validate_novnc_html(body['url'])

Maybe I can take a look at it this week. If someone else is faster, be my guest.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Leo Henken (lh236s) wrote :

This bug is still prevalent in most Zuul checks and is a burden to developing teams. I will try to do some debugging and logging in Devstack to get some more information. If this is a Nova defect, I hope that the data I get from these logs will help us all work towards a solution.

Revision history for this message
Leo Henken (lh236s) wrote :

It looks like the request that fails is not supported in all microversions.

https://docs.openstack.org/api-ref/compute/?expanded=get-vnc-console-os-getvncconsole-action-deprecated-detail#get-vnc-console-os-getvncconsole-action-deprecated

Unclear if this has anything to do with this issue.

Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

Not sure whether this is really the same issue still or if we should rather have a new bug for that, but the novnc config in tempest-multinode-full looks wrong to me.

https://logs.opendev.org/92/651492/7/check/tempest-multinode-full/ac24db2/controller/logs/etc/nova/nova-cpu_conf.txt.gz
my_ip = 10.211.2.5
[vnc]
server_proxyclient_address = 127.0.0.1
server_listen = 127.0.0.1
xvpvncproxy_base_url = http://10.211.2.5:6081/console
novncproxy_base_url = http://10.211.2.5:6080/vnc_auto.html

https://logs.opendev.org/92/651492/7/check/tempest-multinode-full/ac24db2/compute1/logs/etc/nova/nova-cpu_conf.txt.gz
my_ip = 10.211.2.46
[vnc]
server_proxyclient_address = 127.0.0.1
server_listen = 127.0.0.1
xvpvncproxy_base_url = http://10.211.2.5:6081/console
novncproxy_base_url = http://10.211.2.5:6080/vnc_auto.html

The section that generates this config in devstack however says:

        # Address on which instance vncservers will listen on compute hosts.
        # For multi-host, this should be the management ip of the compute host.

https://opendev.org/openstack/devstack/src/branch/master/lib/nova#L621

So I'm assuming the job only passes when the test VM gets scheduled on the controller and fails when it gets scheduled on compute1. For fixing the issue, the above config should contain the local IPs as set in my_ip instead of 127.0.0.1.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Yeah I agree with the conclusion in comment #18.

Revision history for this message
Matt Riedemann (mriedem) wrote :

This might have been the regression:

https://review.opendev.org/#/c/672493/

Revision history for this message
Matt Riedemann (mriedem) wrote :
no longer affects: nova
Changed in devstack:
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to devstack (master)

Fix proposed to branch: master
Review: https://review.opendev.org/675721

Changed in devstack:
assignee: Matt Riedemann (mriedem) → melanie witt (melwitt)
Revision history for this message
Leo Henken (lh236s) wrote :

Change looks good, I will work on a client side logging patch set aimed to make debugging this easier in the future. I feel like this could have been an easier debug with some better logs.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on devstack (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/675652
Reason: https://review.opendev.org/#/c/675721/ works.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to devstack (master)

Reviewed: https://review.opendev.org/675721
Committed: https://git.openstack.org/cgit/openstack/devstack/commit/?id=0a3288c1b40a4338df351a3fef9a346e78e12191
Submitter: Zuul
Branch: master

commit 0a3288c1b40a4338df351a3fef9a346e78e12191
Author: melanie witt <email address hidden>
Date: Fri Aug 9 15:57:50 2019 +0000

    Set console server host/address in nova-cpu.conf for multi-host

    Currently, the console server host and listen address on the compute
    host is always being set to localhost. This works fine in a single
    node all-in-one deployment, but will not work properly when
    nova-compute is running on a separate host in a multi-node deployment.

    This sets the console server host and listen address on the compute
    host to the nova host IP and service listen address instead of the
    localhost.

    Co-Authored-By: Matt Riedemann <email address hidden>

    Closes-Bug: #1669468

    Change-Id: Id8b0b4159b98c7ff3c85ec3daa03d556d9897ce9

Changed in devstack:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.