race between neutron port create and nova boot

Bug #1453350 reported by jason bishop
46
This bug affects 9 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Kevin Benton

Bug Description

I am doing load testing with tempest scenario tests and see what I think is a race condition between neutron dhcp standup and nova boot. I believe the scenario I am seeing to be a more general case of https://bugs.launchpad.net/neutron/+bug/1334447.

test environment: 5 compute nodes, 1 controller node running all api and neutron services. ubuntu juno hand patched 1382064 and 1385257 and my workaround in 1451492. standard neutron setup otherwise.

If I run tempest scenario test test_server_basic_ops 30 times in parallel things consistently work fine. If I increase to 60 in parallel I get lots of failures (see below). Upon investigation, it looks to me that neutron standup of netns and its dnsmasq process is too slow and loses the race with nova boot and the VM comes up without a (dhcp provided) IP address (causing ssh to timeout and fail).

Traceback (most recent call last):
  File "/home/aqua/tempest/tempest/test.py", line 125, in wrapper
    return f(self, *func_args, **func_kwargs)
  File "/home/aqua/tempest/tempest/scenario/test_server_basic_ops_38.py", line 105, in test_server_basicops
    self.verify_ssh()
  File "/home/aqua/tempest/tempest/scenario/test_server_basic_ops_38.py", line 95, in verify_ssh
    private_key=self.keypair['private_key'])
  File "/home/aqua/tempest/tempest/scenario/manager.py", line 310, in get_remote_client
    linux_client.validate_authentication()
  File "/home/aqua/tempest/tempest/common/utils/linux/remote_client.py", line 55, in validate_authentication
    self.ssh_client.test_connection_auth()
  File "/home/aqua/tempest/tempest/common/ssh.py", line 150, in test_connection_auth
    connection = self._get_ssh_connection()
  File "/home/aqua/tempest/tempest/common/ssh.py", line 87, in _get_ssh_connection
    password=self.password)
tempest.exceptions.SSHTimeout: Connection to the 172.17.205.21 via SSH timed out.
User: cirros, Password: None

Ran 60 tests in 742.931s

FAILED (failures=47)

To reproduce test environment:
1) checkout tempest and remove all tempest scenario tests except test_server_basic_ops
2) run this command to make 59 copies of the test: for i in {1..59}; do cp -p test_server_basic_ops.py test_server_basic_ops_$i.py; sed --in-place -e "s/class TestServerBasicOps(manager.ScenarioTest):/class TestServerBasicOps$i(manager.ScenarioTest):/" -e "s/ super(TestServerBasicOps, self).setUp()/ super(TestServerBasicOps$i, self).setUp()/" -e "s/ @test.idempotent_id('7fff3fb3-91d8-4fd0-bd7d-0204f1f180ba')/ @test.idempotent_id(\'$(uuidgen)\')/" test_server_basic_ops_$i.py; done
3) run 30 tests and observe successful run: OS_TEST_TIMEOUT=1200 ./run_tempest.sh tempest.scenario -- --concurrency=30
4) run 60 tests and observe failures: OS_TEST_TIMEOUT=1200 ./run_tempest.sh tempest.scenario -- --concurrency=60

Revision history for this message
Kevin Benton (kevinbenton) wrote :

This has been a known issue. We've be getting by so far on hoping the DHCP agent can keep up to speed. It's probably a good time to address this.

Changed in neutron:
assignee: nobody → Kevin Benton (kevinbenton)
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/181674

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
jason bishop (jason-bishop) wrote : Re: race between neutron port create and nova

This is great. I'll setup cluster and re-run the tests.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

That would be great. I want to make sure this approach doesn't introduce other issues under load.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Kyle Mestery (<email address hidden>) on branch: master
Review: https://review.openstack.org/181674
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Kevin Benton (<email address hidden>) on branch: master
Review: https://review.openstack.org/181674
Reason: test

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Kyle Mestery (<email address hidden>) on branch: master
Review: https://review.openstack.org/181674
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/219706

Changed in neutron:
assignee: Kevin Benton (kevinbenton) → Neil Jerram (neil-jerram)
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote : Re: race between neutron port create and nova

if we could kill this race once and for all, I think we all need to bow to the committer...

Changed in neutron:
milestone: none → liberty-rc1
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

at least the ones induced by lack of dhcp...

Changed in neutron:
importance: Undecided → High
Changed in neutron:
assignee: Neil Jerram (neil-jerram) → Kevin Benton (kevinbenton)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/220758

Revision history for this message
Nell Jerram (neil-jerram) wrote : Re: race between neutron port create and nova

For your perusal, I've posted a DHCP agent patch that one of my colleagues developed for a customer who was seeing the DHCP agent appear to get behind with its handling of new VM ports.

I've just rebased this up to current master with no further testing, so it's quite unlikely to be exactly correct as currently posted. But you may find the approach interesting.

I'm going to dig out and report more history of exactly what we were observing at the time, so that we can evaluate whether it's really the same scenario as this bug.

Revision history for this message
Nell Jerram (neil-jerram) wrote :

It turns out I already mentioned the scenario that https://review.openstack.org/#/c/220758/ addresses on the ML, here: http://lists.openstack.org/pipermail/openstack-dev/2015-June/066188.html:

"Another DHCP agent problem that my team noticed is that it
call_driver('reload_allocations') takes a bit of time (to regenerate the
Dnsmasq config files, and to spawn a shell that sends a HUP signal) -
enough so that if there is a fast steady rate of port-create and
port-delete notifications coming from the Neutron server, these can
build up in DHCPAgent's RPC queue, and then they still only get
dispatched one at a time. So the queue and the time delay become longer
and longer.

I have a fix pending for this, which uses an extra thread to read those
notifications off the RPC queue onto an internal queue, and then batches
the call_driver('reload_allocations') processing when there is a
contiguous sequence of such notifications - i.e. only does the config
regeneration and HUP once, instead of lots of times."

I think that could indeed be the same problem as this bug.

Revision history for this message
Kyle Mestery (mestery) wrote :

I really, really would love to see this solved by RC1. Leaving as targeted for now.

Revision history for this message
Kyle Mestery (mestery) wrote :

Kevin will have the final patch out later today and we'll decide if we want it in the RC at that time.

Revision history for this message
Kyle Mestery (mestery) wrote :

We're not reaching consensus on the bug here, so moving out to Mitaka-1. If we reach consensus in the background and this merges to master, we could look at back porting should we have an RC2.

Changed in neutron:
milestone: liberty-rc1 → mitaka-1
Jay Pipes (jaypipes)
summary: - race between neutron port create and nova
+ race between neutron port create and nova boot
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/219706
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b9cebd18cfc1c4fa0a221d4d2518f7f94d6fc938
Submitter: Jenkins
Branch: master

commit b9cebd18cfc1c4fa0a221d4d2518f7f94d6fc938
Author: Neil Jerram <email address hidden>
Date: Wed Sep 2 15:14:30 2015 +0100

    DHCP agent: log when reloading allocations for a new VM port

    This will help us to see more clearly if failing DHCP for a new VM -
    as seen in failures such as the following traceback from [1] - is
    caused by DHCP configuration being too slow, as suggested by the
    referenced bug.

    Traceback (most recent call last):
      File "/home/aqua/tempest/tempest/test.py", line 125, in wrapper
        return f(self, *func_args, **func_kwargs)
      File "/home/aqua/tempest/tempest/scenario/test_server_basic_ops_38.py", line 105, in test_server_basicops
        self.verify_ssh()
      File "/home/aqua/tempest/tempest/scenario/test_server_basic_ops_38.py", line 95, in verify_ssh
        private_key=self.keypair['private_key'])
      File "/home/aqua/tempest/tempest/scenario/manager.py", line 310, in get_remote_client
        linux_client.validate_authentication()
      File "/home/aqua/tempest/tempest/common/utils/linux/remote_client.py", line 55, in validate_authentication
        self.ssh_client.test_connection_auth()
      File "/home/aqua/tempest/tempest/common/ssh.py", line 150, in test_connection_auth
        connection = self._get_ssh_connection()
      File "/home/aqua/tempest/tempest/common/ssh.py", line 87, in _get_ssh_connection
        password=self.password)
    tempest.exceptions.SSHTimeout: Connection to the 172.17.205.21 via SSH timed out.
    User: cirros, Password: None

    [1] http://logs.openstack.org/39/194939/19/check/gate-grenade-dsvm-neutron/82a7e58/

    Change-Id: I5d77762bf8d4caa2e6f8b3527a30bf80bac62085
    Related-Bug: #1453350

Changed in neutron:
milestone: mitaka-1 → mitaka-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/181674
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Nell Jerram (neil-jerram) wrote :

Would just like to remind that I have a patch up for this: https://review.openstack.org/#/c/220758/. I believe this patch will reduce the current window - in scenarios where there are very many DHCP port updates in flight - between (1) when a VM has booted far enough to request its address over DHCP, and (2) when Dnsmasq is ready to answer that request, by making (2) happen more quickly. Please let me know what you think.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Davanum Srinivas (dims) (<email address hidden>) on branch: master
Review: https://review.openstack.org/265264

Changed in neutron:
milestone: mitaka-2 → mitaka-3
Revision history for this message
Takanori Miyagishi (miyagishi-t) wrote :

I also met this problem and really hope to solve this.
However, the patch-set <https://review.openstack.org/#/c/220758/> seems not updated from November.
I want to know the current status of this patch-set.
I hope this patch will be merged within mitaka cycle.

Revision history for this message
Takanori Miyagishi (miyagishi-t) wrote :

How to clarify whether the phenomenon that occurred in my environment is same as the issue?
Does someone have good ideas to judge that?
For example, can we find specific error messages in log files when the issue occured?

I want to judge whether the issue occurred when I don't using tempest.

Revision history for this message
Nell Jerram (neil-jerram) wrote :

Miyagishi-san, thank you for your comment.

Please look at the log file that contains dnsmasq logging. Usually this is /var/log/syslog or /var/log/messages. If you see logs saying "no address range available for DHCP request ...", I think these indicate that the VM's DHCP request cannot be satisfied because Dnsmasq is not correctly configured at that time.

Changed in neutron:
milestone: mitaka-3 → mitaka-rc1
Changed in neutron:
milestone: mitaka-rc1 → newton-1
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

nudge?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/220758
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Changed in neutron:
assignee: Kevin Benton (kevinbenton) → nobody
status: In Progress → Confirmed
tags: added: loadimpact
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Needs a new owner.

Changed in neutron:
milestone: newton-1 → none
Changed in neutron:
assignee: nobody → Kevin Benton (kevinbenton)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Davanum Srinivas (dims) (<email address hidden>) on branch: master
Review: https://review.openstack.org/265264

Changed in neutron:
milestone: none → newton-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/181674
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b672c26cb42ad3d9a17ed049b506b5622601e891
Submitter: Jenkins
Branch: master

commit b672c26cb42ad3d9a17ed049b506b5622601e891
Author: Kevin Benton <email address hidden>
Date: Fri Apr 15 06:05:56 2016 -0700

    Add provisioning blocks to status ACTIVE transition

    Sometimes an object requires multiple disjoint actors to complete
    a set of tasks before the status of the object should be transitioned
    to ACTIVE. The main example of this is when a port is being created.
    The L2 agent has to do its business to wire up the VIF, but at the same
    time the DHCP agent has to setup the DHCP reservation. This led to
    Nova booting the VM when the L2 agent was done even though the DHCP
    agent may have been nowhere near ready.

    This patch introduces a provisioning blocks mechansim that allows the
    entities to be tracked that need to be involved to make a transition
    to ACTIVE happen. See the devref in the dependent patch for a high-level
    view of how this works.

    The ML2 code is updated to use this new mechanism to prevent updating
    the port status to ACTIVE without both the DHCP agent and L2 agent
    reporting that the port is ready.

    The DHCP RPC API required a version bump to allow the port ready
    notification.

    This also adds a devref doc for the provisioning_blocks
    module with a high-level overview of how it works in addition
    to a detailed description of how it is used specifically with
    ML2, the L2 agents, and the DHCP agents.

    Closes-Bug: #1453350
    Change-Id: Id85ff6de1a14a550ab50baf4f79d3130af3680c8

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 9.0.0.0b1

This issue was fixed in the openstack/neutron 9.0.0.0b1 development milestone.

tags: added: neutron-proactive-backport-potential
tags: removed: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/440340

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/440340
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=4ba0e75254b087c232be18c007d50045606b20b5
Submitter: Jenkins
Branch: master

commit 4ba0e75254b087c232be18c007d50045606b20b5
Author: Kevin Benton <email address hidden>
Date: Thu Mar 2 02:08:20 2017 -0800

    Skip DHCP agent query in provisioning block setup

    Provisioning blocks merged in Newton so for Pike we can
    safely assume we are not running with Liberty agents that
    don't notify the server when the port is ready.

    This allows us to skip a query to the agents and configuration
    parsing on every port provisioning block setup.

    Related-Bug: #1453350
    Change-Id: I8111469ad4b0d88580bff7a77492ad95af8e9377

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/452009

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/455790

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/455790
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bc979efdb85b3ac2722d6607485d21e002718e02
Submitter: Jenkins
Branch: master

commit bc979efdb85b3ac2722d6607485d21e002718e02
Author: Jakub Libosvar <email address hidden>
Date: Tue Apr 11 12:27:46 2017 -0400

    fullstack: Don't let dhcp agents failover

    It turned out dhcp tests work only because agents are considered dead
    after 10 seconds while they report to server every 60 seconds. This led
    to calling network resync after agent revival and hiding the fact dhcp
    agent is not capable of receiving any amqp messages.

    This patch sets the report interval of agents to the half of
    agent_down_time on server side and uses eventlet dhcp agent in order to
    trigger eventlet monkey patching code.

    Eventlet was behind the failure with messages not getting processed. As
    [1] notes: "Note: If the “eventlet” executor is used, the threading and
    time library need to be monkeypatched."

    Because each port calls dhclient to obtain IP address and each dhclient
    instance overwrites /etc/resolv.conf there was added a script that
    generates fullstack-dhclient-script from an existing dhclient-script
    before starting fulltstack tests. This generated script is passed to
    each dhclient process running in fake fullstack machine using -sf
    parameter.

    [1] https://docs.openstack.org/developer/oslo.messaging/server.html

    Related-bug: 1453350

    Change-Id: I0336176b9c364fe3a95be5cef9e7a3af1ef9d7e9

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/467790

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Kevin Benton (<email address hidden>) on branch: master
Review: https://review.openstack.org/467790
Reason: will revisit if i get more time

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/452009
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=4094979777102dd36a79d7fe8c809f9e360a946a
Submitter: Jenkins
Branch: master

commit 4094979777102dd36a79d7fe8c809f9e360a946a
Author: Kevin Benton <email address hidden>
Date: Thu Mar 2 02:08:20 2017 -0800

    Drop 'notifies_port_ready' check for DHCP agents

    Provisioning blocks merged in Newton so for Pike we can
    safely assume we are not running with Liberty agents that
    don't notify the server when the port is ready.

    This also drops a block of logic in the agent that was providing
    forward compatibility with servers that didn't support the
    'dhcp_ready_on_ports' endpoint since servers have been supporting
    it for so long and we don't normally allow agents to be upgraded
    first anyway.

    Related-Bug: #1453350
    Change-Id: Ia86547fb4601915d7dd852b6f7a11c120089d6f6

tags: added: neutron-proactive-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.