gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-xenial-nv job has a very high failure rate

Bug #1707003 reported by Brian Haley
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Brian Haley

Bug Description

Looking at the Neutron Failure Rate dashboard, specifically the tempest jobs:

http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=10&fullscreen

One can see the gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-xenial-nv job has a very high failure rate, over 90% for the past 5 days.

Matt Riedemann did an analysis (which I'll paste below), but the summary is that setup of the 3-node job is failing a lot, and not discovering the third node, leading to a failure when that node is attempted to be used.

So the first step is to change devstack-gate (?) code to wait for all the subnodes to show up from a Nova perspective before proceeding. There was a previous attempt at a grenade change in https://review.openstack.org/#/c/426310/ that was abandoned, but that seems like a good start based on the analysis.

Matt's comment #1:

Looking at the gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-xenial-nv job failure, the subnode-2 and subnode-3 all look OK as far as their config. They use the same values for nova-cpu.conf pointing at the nova_cell1 MQ which points at the cell1 conductor and cell1 database. I see that the compute nodes are created for both subnode-2 and subnode-3 *after* discover_hosts runs:

2017-07-25 15:06:55.991684 | + /opt/stack/new/devstack-gate/devstack-vm-gate.sh:main:L777: discover_hosts

Jul 25 15:06:58.945371 ubuntu-xenial-3-node-rax-iad-10067333-744503 nova-compute[794]: INFO nova.compute.resource_tracker [None req-f69c76bf-0263-494b-8257-61617c90d799 None None] Compute node record created for ubuntu-xenial-3-node-rax-iad-10067333-744503:ubuntu-xenial-3-node-rax-iad-10067333-744503 with uuid: 1788fe0b-496c-4eda-b03a-2cf4a2733a94

Jul 25 15:07:02.323379 ubuntu-xenial-3-node-rax-iad-10067333-744504 nova-compute[827]: INFO nova.compute.resource_tracker [None req-95419fec-a2a7-467f-b167-d83755273a7a None None] Compute node record created for ubuntu-xenial-3-node-rax-iad-10067333-744504:ubuntu-xenial-3-node-rax-iad-10067333-744504 with uuid: ae3420a1-20d2-42a1-909d-fc9cf1b14248

And looking at the discover_hosts output, only subnode-2 is discovered as the unmapped host:

http://logs.openstack.org/56/477556/5/experimental/gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-xenial-nv/432c235/logs/devstack-gate-discover-hosts.txt.gz

The compute node from the primary host is discovered and mapped to cell1 as part of the devstack run on the primary host:

http://logs.openstack.org/56/477556/5/experimental/gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-xenial-nv/432c235/logs/devstacklog.txt.gz#_2017-07-25_14_50_45_831

So it seems that we are simply getting lucky by discovering the compute node from subnode-2 and mapping it to cell1 but missing the compute node from subnode-3, so it doesn't get mapped and then things fail when Tempest tries to use it. This could be a problem on any 3 node job, and might not just be related to this devstack change.

Matt's comment #2:

I've gone through the dvr-ha 3-node job failure and it just appears to be a latent issue that we could also hit in 2-node jobs, and I even noticed in a 2-node job that when the subnode compute node is created it actually happens after we start running discover_hosts from the primary via devstack-gate, so it just seems to be a race window, which we already have, and maybe expose more in 3-node jobs if they are slower, or slow down the traffic on the control node.

If you look at the cells v2 setup guide, it even says to make sure the computes are created before running discover_hosts:

https://docs.openstack.org/nova/latest/user/cells.html

"Configure and start your compute hosts. Before step 7, make sure you have compute hosts in the database by running nova service-list --binary nova-compute."

Step 7 is running 'nova-manage cell_v2 discover_hosts'.

Ideally what we should be doing is have devstack-gate pass a variable to the discover_hosts.sh script in devstack telling it how many compute services we expect (3 in the case of the dvr-ha job) and then have that discover_hosts.sh script run nova-compute service-list --binary nova-compute and count the results until the expected number is hit, or it times out, but then run discover_hosts. That's really what we expect from other deployment tools like triple-o and kolla.

But overall I'm not finding anything in this change that's killing these jobs outright, so let's get it in.

Matt's comment #3:

This is what I see for voting jobs that fail with the 'host is not mapped to any cell' error:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Host%5C%22%20AND%20message%3A%5C%22is%20not%20mapped%20to%20any%20cell%5C%22%20AND%20tags%3A%5C%22console%5C%22%20AND%20voting%3A1%20AND%20build_status%3A%5C%22FAILURE%5C%22&from=7d

Those are all grenade multinode jobs.

Likely https://review.openstack.org/#/c/426310/, or a variant thereof, would resolve it.

tags: added: gate-failure
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Not sure that's still a pressing issue. The job is currently at ~20% failure rate and seems on par with other tempest jobs.

Revision history for this message
Brian Haley (brian-haley) wrote :

The failure rate of this job is back to normal again, if you call normal <20% in the check queue. It was most likely a combination of fixes for https://bugs.launchpad.net/neutron/+bug/1713927 and https://review.openstack.org/#/c/488381/

So I'll close.

Changed in neutron:
status: Confirmed → Fix Committed
Changed in neutron:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.