neutron

race between neutron port create and nova boot

Bug #1453350 reported by jason bishop on 2015-05-09

This bug affects 9 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	High	Kevin Benton	neutron newton-1 "n1"

Bug Description

I am doing load testing with tempest scenario tests and see what I think is a race condition between neutron dhcp standup and nova boot. I believe the scenario I am seeing to be a more general case of https://bugs.launchpad.net/neutron/+bug/1334447.

test environment: 5 compute nodes, 1 controller node running all api and neutron services. ubuntu juno hand patched 1382064 and 1385257 and my workaround in 1451492. standard neutron setup otherwise.

If I run tempest scenario test test_server_basic_ops 30 times in parallel things consistently work fine. If I increase to 60 in parallel I get lots of failures (see below). Upon investigation, it looks to me that neutron standup of netns and its dnsmasq process is too slow and loses the race with nova boot and the VM comes up without a (dhcp provided) IP address (causing ssh to timeout and fail).

Traceback (most recent call last):
  File "/home/aqua/tempest/tempest/test.py", line 125, in wrapper
    return f(self, *func_args, **func_kwargs)
  File "/home/aqua/tempest/tempest/scenario/test_server_basic_ops_38.py", line 105, in test_server_basicops
    self.verify_ssh()
  File "/home/aqua/tempest/tempest/scenario/test_server_basic_ops_38.py", line 95, in verify_ssh
    private_key=self.keypair['private_key'])
  File "/home/aqua/tempest/tempest/scenario/manager.py", line 310, in get_remote_client
    linux_client.validate_authentication()
  File "/home/aqua/tempest/tempest/common/utils/linux/remote_client.py", line 55, in validate_authentication
    self.ssh_client.test_connection_auth()
  File "/home/aqua/tempest/tempest/common/ssh.py", line 150, in test_connection_auth
    connection = self._get_ssh_connection()
  File "/home/aqua/tempest/tempest/common/ssh.py", line 87, in _get_ssh_connection
    password=self.password)
tempest.exceptions.SSHTimeout: Connection to the 172.17.205.21 via SSH timed out.
User: cirros, Password: None

Ran 60 tests in 742.931s

FAILED (failures=47)

To reproduce test environment:
1) checkout tempest and remove all tempest scenario tests except test_server_basic_ops
2) run this command to make 59 copies of the test: for i in {1..59}; do cp -p test_server_basic_ops.py test_server_basic_ops_$i.py; sed --in-place -e "s/class TestServerBasicOps(manager.ScenarioTest):/class TestServerBasicOps$i(manager.ScenarioTest):/" -e "s/ super(TestServerBasicOps, self).setUp()/ super(TestServerBasicOps$i, self).setUp()/" -e "s/ @test.idempotent_id('7fff3fb3-91d8-4fd0-bd7d-0204f1f180ba')/ @test.idempotent_id(\'$(uuidgen)\')/" test_server_basic_ops_$i.py; done
3) run 30 tests and observe successful run: OS_TEST_TIMEOUT=1200 ./run_tempest.sh tempest.scenario -- --concurrency=30
4) run 60 tests and observe failures: OS_TEST_TIMEOUT=1200 ./run_tempest.sh tempest.scenario -- --concurrency=60

Tags:

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2015-05-09:

This has been a known issue. We've be getting by so far on hoping the DHCP agent can keep up to speed. It's probably a good time to address this.

Changed in neutron:
assignee:	nobody → Kevin Benton (kevinbenton)
status:	New → Confirmed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-09: Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/181674

Changed in neutron:
status:	Confirmed → In Progress

Revision history for this message

jason bishop (jason-bishop) wrote on 2015-05-09: Re: race between neutron port create and nova

This is great. I'll setup cluster and re-run the tests.

Revision history for this message

Kevin Benton (kevinbenton) wrote on 2015-05-11:

That would be great. I want to make sure this approach doesn't introduce other issues under load.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-06-10: Change abandoned on neutron (master)

Change abandoned by Kyle Mestery (<email address hidden>) on branch: master
Review: https://review.openstack.org/181674
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-07-13:

Change abandoned by Kevin Benton (<email address hidden>) on branch: master
Review: https://review.openstack.org/181674
Reason: test

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-08-11:

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-02: Fix proposed to neutron (master)

#10

Fix proposed to branch: master
Review: https://review.openstack.org/219706

Changed in neutron:
assignee:	Kevin Benton (kevinbenton) → Neil Jerram (neil-jerram)

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2015-09-04: Re: race between neutron port create and nova

#11

if we could kill this race once and for all, I think we all need to bow to the committer...

Changed in neutron:
milestone:	none → liberty-rc1

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2015-09-04:

#12

at least the ones induced by lack of dhcp...

Armando Migliaccio (armando-migliaccio) on 2015-09-04

Changed in neutron:
importance:	Undecided → High

OpenStack Infra (hudson-openstack) on 2015-09-04

Changed in neutron:
assignee:	Neil Jerram (neil-jerram) → Kevin Benton (kevinbenton)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-05: Related fix proposed to neutron (master)

#13

Related fix proposed to branch: master
Review: https://review.openstack.org/220758

Revision history for this message

Nell Jerram (neil-jerram) wrote on 2015-09-05: Re: race between neutron port create and nova

#14

For your perusal, I've posted a DHCP agent patch that one of my colleagues developed for a customer who was seeing the DHCP agent appear to get behind with its handling of new VM ports.

I've just rebased this up to current master with no further testing, so it's quite unlikely to be exactly correct as currently posted. But you may find the approach interesting.

I'm going to dig out and report more history of exactly what we were observing at the time, so that we can evaluate whether it's really the same scenario as this bug.

Revision history for this message

Nell Jerram (neil-jerram) wrote on 2015-09-05:

#15

It turns out I already mentioned the scenario that https://review.openstack.org/#/c/220758/ addresses on the ML, here: http://lists.openstack.org/pipermail/openstack-dev/2015-June/066188.html:

"Another DHCP agent problem that my team noticed is that it
call_driver('reload_allocations') takes a bit of time (to regenerate the
Dnsmasq config files, and to spawn a shell that sends a HUP signal) -
enough so that if there is a fast steady rate of port-create and
port-delete notifications coming from the Neutron server, these can
build up in DHCPAgent's RPC queue, and then they still only get
dispatched one at a time. So the queue and the time delay become longer
and longer.

I have a fix pending for this, which uses an extra thread to read those
notifications off the RPC queue onto an internal queue, and then batches
the call_driver('reload_allocations') processing when there is a
contiguous sequence of such notifications - i.e. only does the config
regeneration and HUP once, instead of lots of times."

I think that could indeed be the same problem as this bug.

Revision history for this message

Kyle Mestery (mestery) wrote on 2015-09-15:

#16

I really, really would love to see this solved by RC1. Leaving as targeted for now.

Revision history for this message

Kyle Mestery (mestery) wrote on 2015-09-22:

#17

Kevin will have the final patch out later today and we'll decide if we want it in the RC at that time.

Revision history for this message

Kyle Mestery (mestery) wrote on 2015-09-23:

#18

We're not reaching consensus on the bug here, so moving out to Mitaka-1. If we reach consensus in the background and this merges to master, we could look at back porting should we have an RC2.

Changed in neutron:
milestone:	liberty-rc1 → mitaka-1

Jay Pipes (jaypipes) on 2015-09-23

summary:

- race between neutron port create and nova
+ race between neutron port create and nova boot

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-11-02: Related fix merged to neutron (master)

#19

Reviewed: https://review.openstack.org/219706
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b9cebd18cfc1c4fa0a221d4d2518f7f94d6fc938
Submitter: Jenkins
Branch: master

commit b9cebd18cfc1c4fa0a221d4d2518f7f94d6fc938
Author: Neil Jerram <email address hidden>
Date: Wed Sep 2 15:14:30 2015 +0100

DHCP agent: log when reloading allocations for a new VM port

    This will help us to see more clearly if failing DHCP for a new VM -
    as seen in failures such as the following traceback from [1] - is
    caused by DHCP configuration being too slow, as suggested by the
    referenced bug.

    Traceback (most recent call last):
      File "/home/aqua/tempest/tempest/test.py", line 125, in wrapper
        return f(self, *func_args, **func_kwargs)
      File "/home/aqua/tempest/tempest/scenario/test_server_basic_ops_38.py", line 105, in test_server_basicops
        self.verify_ssh()
      File "/home/aqua/tempest/tempest/scenario/test_server_basic_ops_38.py", line 95, in verify_ssh
        private_key=self.keypair['private_key'])
      File "/home/aqua/tempest/tempest/scenario/manager.py", line 310, in get_remote_client
        linux_client.validate_authentication()
      File "/home/aqua/tempest/tempest/common/utils/linux/remote_client.py", line 55, in validate_authentication
        self.ssh_client.test_connection_auth()
      File "/home/aqua/tempest/tempest/common/ssh.py", line 150, in test_connection_auth
        connection = self._get_ssh_connection()
      File "/home/aqua/tempest/tempest/common/ssh.py", line 87, in _get_ssh_connection
        password=self.password)
    tempest.exceptions.SSHTimeout: Connection to the 172.17.205.21 via SSH timed out.
    User: cirros, Password: None

[1] http://logs.openstack.org/39/194939/19/check/gate-grenade-dsvm-neutron/82a7e58/

Change-Id: I5d77762bf8d4caa2e6f8b3527a30bf80bac62085
Related-Bug: #1453350

Armando Migliaccio (armando-migliaccio) on 2015-12-03

Changed in neutron:
milestone:	mitaka-1 → mitaka-2

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-12-04: Change abandoned on neutron (master)

#20

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/181674
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message

Nell Jerram (neil-jerram) wrote on 2015-12-04:

#21

Would just like to remind that I have a patch up for this: https://review.openstack.org/#/c/220758/. I believe this patch will reduce the current window - in scenarios where there are very many DHCP port updates in flight - between (1) when a VM has booted far enough to request its address over DHCP, and (2) when Dnsmasq is ready to answer that request, by making (2) happen more quickly. Please let me know what you think.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-19:

#22

Change abandoned by Davanum Srinivas (dims) (<email address hidden>) on branch: master
Review: https://review.openstack.org/265264

Armando Migliaccio (armando-migliaccio) on 2016-01-20

Changed in neutron:
milestone:	mitaka-2 → mitaka-3

Revision history for this message

Takanori Miyagishi (miyagishi-t) wrote on 2016-02-10:

#23

I also met this problem and really hope to solve this.
However, the patch-set <https://review.openstack.org/#/c/220758/> seems not updated from November.
I want to know the current status of this patch-set.
I hope this patch will be merged within mitaka cycle.

Revision history for this message

Takanori Miyagishi (miyagishi-t) wrote on 2016-02-25:

#24

How to clarify whether the phenomenon that occurred in my environment is same as the issue?
Does someone have good ideas to judge that?
For example, can we find specific error messages in log files when the issue occured?

I want to judge whether the issue occurred when I don't using tempest.

Revision history for this message

Nell Jerram (neil-jerram) wrote on 2016-02-25:

#25

Miyagishi-san, thank you for your comment.

Please look at the log file that contains dnsmasq logging. Usually this is /var/log/syslog or /var/log/messages. If you see logs saying "no address range available for DHCP request ...", I think these indicate that the VM's DHCP request cannot be satisfied because Dnsmasq is not correctly configured at that time.

Armando Migliaccio (armando-migliaccio) on 2016-03-03

Changed in neutron:
milestone:	mitaka-3 → mitaka-rc1

Armando Migliaccio (armando-migliaccio) on 2016-03-14

Changed in neutron:
milestone:	mitaka-rc1 → newton-1

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2016-03-31:

#26

nudge?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-14:

#27

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/220758
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Armando Migliaccio (armando-migliaccio) on 2016-04-14

Changed in neutron:
assignee:	Kevin Benton (kevinbenton) → nobody
status:	In Progress → Confirmed
tags:	added: loadimpact

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2016-04-14:

#28

Needs a new owner.

Changed in neutron:
milestone:	newton-1 → none

OpenStack Infra (hudson-openstack) on 2016-04-15

Changed in neutron:
assignee:	nobody → Kevin Benton (kevinbenton)
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-18:

#29

Change abandoned by Davanum Srinivas (dims) (<email address hidden>) on branch: master
Review: https://review.openstack.org/265264

Armando Migliaccio (armando-migliaccio) on 2016-05-13

Changed in neutron:
milestone:	none → newton-1

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-17: Fix merged to neutron (master)

#31

Reviewed: https://review.openstack.org/181674
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b672c26cb42ad3d9a17ed049b506b5622601e891
Submitter: Jenkins
Branch: master

commit b672c26cb42ad3d9a17ed049b506b5622601e891
Author: Kevin Benton <email address hidden>
Date: Fri Apr 15 06:05:56 2016 -0700

Add provisioning blocks to status ACTIVE transition

    Sometimes an object requires multiple disjoint actors to complete
    a set of tasks before the status of the object should be transitioned
    to ACTIVE. The main example of this is when a port is being created.
    The L2 agent has to do its business to wire up the VIF, but at the same
    time the DHCP agent has to setup the DHCP reservation. This led to
    Nova booting the VM when the L2 agent was done even though the DHCP
    agent may have been nowhere near ready.

    This patch introduces a provisioning blocks mechansim that allows the
    entities to be tracked that need to be involved to make a transition
    to ACTIVE happen. See the devref in the dependent patch for a high-level
    view of how this works.

    The ML2 code is updated to use this new mechanism to prevent updating
    the port status to ACTIVE without both the DHCP agent and L2 agent
    reporting that the port is ready.

The DHCP RPC API required a version bump to allow the port ready
notification.

    This also adds a devref doc for the provisioning_blocks
    module with a high-level overview of how it works in addition
    to a detailed description of how it is used specifically with
    ML2, the L2 agents, and the DHCP agents.

Closes-Bug: #1453350
Change-Id: Id85ff6de1a14a550ab50baf4f79d3130af3680c8

Changed in neutron:
status:	In Progress → Fix Released

Revision history for this message

Doug Hellmann (doug-hellmann) wrote on 2016-06-03: Fix included in openstack/neutron 9.0.0.0b1

#32

This issue was fixed in the openstack/neutron 9.0.0.0b1 development milestone.

Ihar Hrachyshka (ihar-hrachyshka) on 2016-06-30

tags:

added: neutron-proactive-backport-potential

Ihar Hrachyshka (ihar-hrachyshka) on 2016-07-02

tags:

removed: neutron-proactive-backport-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-03-02: Related fix proposed to neutron (master)

#33

Related fix proposed to branch: master
Review: https://review.openstack.org/440340

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-03-30: Related fix merged to neutron (master)

#34

Reviewed: https://review.openstack.org/440340
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=4ba0e75254b087c232be18c007d50045606b20b5
Submitter: Jenkins
Branch: master

commit 4ba0e75254b087c232be18c007d50045606b20b5
Author: Kevin Benton <email address hidden>
Date: Thu Mar 2 02:08:20 2017 -0800

Skip DHCP agent query in provisioning block setup

    Provisioning blocks merged in Newton so for Pike we can
    safely assume we are not running with Liberty agents that
    don't notify the server when the port is ready.

This allows us to skip a query to the agents and configuration
parsing on every port provisioning block setup.

Related-Bug: #1453350
Change-Id: I8111469ad4b0d88580bff7a77492ad95af8e9377

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-03-30: Related fix proposed to neutron (master)

#35

Related fix proposed to branch: master
Review: https://review.openstack.org/452009

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-04-11:

#36

Related fix proposed to branch: master
Review: https://review.openstack.org/455790

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-04-21: Related fix merged to neutron (master)

#37

Reviewed: https://review.openstack.org/455790
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bc979efdb85b3ac2722d6607485d21e002718e02
Submitter: Jenkins
Branch: master

commit bc979efdb85b3ac2722d6607485d21e002718e02
Author: Jakub Libosvar <email address hidden>
Date: Tue Apr 11 12:27:46 2017 -0400

fullstack: Don't let dhcp agents failover

    It turned out dhcp tests work only because agents are considered dead
    after 10 seconds while they report to server every 60 seconds. This led
    to calling network resync after agent revival and hiding the fact dhcp
    agent is not capable of receiving any amqp messages.

    This patch sets the report interval of agents to the half of
    agent_down_time on server side and uses eventlet dhcp agent in order to
    trigger eventlet monkey patching code.

    Eventlet was behind the failure with messages not getting processed. As
    [1] notes: "Note: If the “eventlet” executor is used, the threading and
    time library need to be monkeypatched."

    Because each port calls dhclient to obtain IP address and each dhclient
    instance overwrites /etc/resolv.conf there was added a script that
    generates fullstack-dhclient-script from an existing dhclient-script
    before starting fulltstack tests. This generated script is passed to
    each dhclient process running in fake fullstack machine using -sf
    parameter.

[1] https://docs.openstack.org/developer/oslo.messaging/server.html

Related-bug: 1453350

Change-Id: I0336176b9c364fe3a95be5cef9e7a3af1ef9d7e9

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-05-24: Related fix proposed to neutron (master)

#38

Related fix proposed to branch: master
Review: https://review.openstack.org/467790

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-05-24: Change abandoned on neutron (master)

#39

Change abandoned by Kevin Benton (<email address hidden>) on branch: master
Review: https://review.openstack.org/467790
Reason: will revisit if i get more time

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-05-26: Related fix merged to neutron (master)

#40

Reviewed: https://review.openstack.org/452009
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=4094979777102dd36a79d7fe8c809f9e360a946a
Submitter: Jenkins
Branch: master

commit 4094979777102dd36a79d7fe8c809f9e360a946a
Author: Kevin Benton <email address hidden>
Date: Thu Mar 2 02:08:20 2017 -0800

Drop 'notifies_port_ready' check for DHCP agents

    Provisioning blocks merged in Newton so for Pike we can
    safely assume we are not running with Liberty agents that
    don't notify the server when the port is ready.

    This also drops a block of logic in the agent that was providing
    forward compatibility with servers that didn't support the
    'dhcp_ready_on_ports' endpoint since servers have been supporting
    it for so long and we don't normally allow agents to be upgraded
    first anyway.

Related-Bug: #1453350
Change-Id: Ia86547fb4601915d7dd852b6f7a11c120089d6f6

Ihar Hrachyshka (ihar-hrachyshka) on 2017-06-13

tags:

added: neutron-proactive-backport-potential

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.