gate-tempest-dsvm-neutron-dvr-ha-multinode-full-ubuntu-xenial-nv job has a very high failure rate
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Fix Released
|
High
|
Brian Haley |
Bug Description
Looking at the Neutron Failure Rate dashboard, specifically the tempest jobs:
http://
One can see the gate-tempest-
Matt Riedemann did an analysis (which I'll paste below), but the summary is that setup of the 3-node job is failing a lot, and not discovering the third node, leading to a failure when that node is attempted to be used.
So the first step is to change devstack-gate (?) code to wait for all the subnodes to show up from a Nova perspective before proceeding. There was a previous attempt at a grenade change in https:/
Matt's comment #1:
Looking at the gate-tempest-
2017-07-25 15:06:55.991684 | + /opt/stack/
Jul 25 15:06:58.945371 ubuntu-
Jul 25 15:07:02.323379 ubuntu-
And looking at the discover_hosts output, only subnode-2 is discovered as the unmapped host:
The compute node from the primary host is discovered and mapped to cell1 as part of the devstack run on the primary host:
So it seems that we are simply getting lucky by discovering the compute node from subnode-2 and mapping it to cell1 but missing the compute node from subnode-3, so it doesn't get mapped and then things fail when Tempest tries to use it. This could be a problem on any 3 node job, and might not just be related to this devstack change.
Matt's comment #2:
I've gone through the dvr-ha 3-node job failure and it just appears to be a latent issue that we could also hit in 2-node jobs, and I even noticed in a 2-node job that when the subnode compute node is created it actually happens after we start running discover_hosts from the primary via devstack-gate, so it just seems to be a race window, which we already have, and maybe expose more in 3-node jobs if they are slower, or slow down the traffic on the control node.
If you look at the cells v2 setup guide, it even says to make sure the computes are created before running discover_hosts:
https:/
"Configure and start your compute hosts. Before step 7, make sure you have compute hosts in the database by running nova service-list --binary nova-compute."
Step 7 is running 'nova-manage cell_v2 discover_hosts'.
Ideally what we should be doing is have devstack-gate pass a variable to the discover_hosts.sh script in devstack telling it how many compute services we expect (3 in the case of the dvr-ha job) and then have that discover_hosts.sh script run nova-compute service-list --binary nova-compute and count the results until the expected number is hit, or it times out, but then run discover_hosts. That's really what we expect from other deployment tools like triple-o and kolla.
But overall I'm not finding anything in this change that's killing these jobs outright, so let's get it in.
Matt's comment #3:
This is what I see for voting jobs that fail with the 'host is not mapped to any cell' error:
Those are all grenade multinode jobs.
Likely https:/
tags: | added: gate-failure |
Changed in neutron: | |
status: | Fix Committed → Fix Released |
Not sure that's still a pressing issue. The job is currently at ~20% failure rate and seems on par with other tempest jobs.