neutron

Duplicate entries in the hosts file

Bug #1191768 reported by Gary Kotton on 2013-06-17

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	Medium	Gary Kotton	neutron 2013.2 "havana"
	Grizzly	Fix Released	Medium	Gary Kotton	neutron 2013.1.3

Bug Description

When Quantum service is underload there are cases when the hosts file has duplicate entries:

There are cases when IP addresses may appear more than once in the hosts file, for example:
fa:16:3e:16:b1:62,10-0-0-3.openstacklocal,10.0.0.3
fa:16:3e:55:86:b2,10-0-0-1.openstacklocal,10.0.0.1
fa:16:3e:f9:ad:20,10-0-0-15.openstacklocal,10.0.0.15
fa:16:3e:d6:94:39,10-0-0-13.openstacklocal,10.0.0.13
fa:16:3e:7d:42:32,10-0-0-15.openstacklocal,10.0.0.15
fa:16:3e:f7:86:a6,10-0-0-16.openstacklocal,10.0.0.16
fa:16:3e:a8:d9:2e,10-0-0-19.openstacklocal,10.0.0.19

This is due to the fact that the service thinks that the DHCP agent is down and it does not send a deleteion message. The cause is actually due to the agent being unable to send the state message to the service.

This happens when using qpid as the message broker and deleting a considerable amount of VM's at one time

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-06-17: Fix proposed to quantum (master)

Fix proposed to branch: master
Review: https://review.openstack.org/33254

Changed in quantum:
assignee:	nobody → Gary Kotton (garyk)
status:	New → In Progress

Gary Kotton (garyk) on 2013-06-17

tags:

added: grizzly-backport-potential

Revision history for this message

Akihiro Motoki (amotoki) wrote on 2013-06-17:

In the default configuraiton agent_down_time is 5 sec and report_interval is 4 sec.
I think the cause is that report_interval and agent_down_time is too close.
If an agent report is delayed for 1 seconds, the server will regard the agent as down.

IMO, agent_down should be larger. At least it seems better it is larger than double of report_interval.
If so, a agent will not be regarded as down even if one report message is dropped.

Any thought?

Revision history for this message

Gary Kotton (garyk) wrote on 2013-06-17:

I have tried by playing around with the timeout of the agent. The problem here is that the RPC timeoute is 60 seconds. This means that the dhcp agent will not send an update in 60 seconds due to the timeout.

Revision history for this message

Akihiro Motoki (amotoki) wrote on 2013-06-17:

I see. The proposed solution sounds reasonable to me.

Revision history for this message

yong sheng gong (gongysh) wrote on 2013-06-18:

cast will flood the quantum server when the server cannot deal, which will make the situation worse and worse.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-06-28: Fix merged to quantum (master)

Reviewed: https://review.openstack.org/33254
Committed: http://github.com/openstack/quantum/commit/0407b59c259b2861b4541e31379e994ea8350b83
Submitter: Jenkins
Branch: master

commit 0407b59c259b2861b4541e31379e994ea8350b83
Author: Gary Kotton <email address hidden>
Date: Mon Jun 17 11:37:11 2013 +0000

Ensure that the report state is not a blocking call

Fixes bug 1191768

    For the dhcp and l3 agents the first state report will be done
    via a call. If this succeeds then subsequent calls will be done via
    the cast method.

Change-Id: I82a1d92fc84983b7bb46758db0aee3e3eca1d3be

Changed in neutron:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-06-29: Fix proposed to quantum (stable/grizzly)

Fix proposed to branch: stable/grizzly
Review: https://review.openstack.org/34979

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-06-29: Fix merged to quantum (stable/grizzly)

Reviewed: https://review.openstack.org/34979
Committed: http://github.com/openstack/quantum/commit/06f679df5d025e657b2204151688ffa60c97a3d3
Submitter: Jenkins
Branch: stable/grizzly

commit 06f679df5d025e657b2204151688ffa60c97a3d3
Author: Gary Kotton <email address hidden>
Date: Mon Jun 17 11:37:11 2013 +0000

Ensure that the report state is not a blocking call

Fixes bug 1191768

    For the dhcp and l3 agents the first state report will be done
    via a call. If this succeeds then subsequent calls will be done via
    the cast method.

Change-Id: I82a1d92fc84983b7bb46758db0aee3e3eca1d3be
(cherry picked from commit 0407b59c259b2861b4541e31379e994ea8350b83)

tags:

added: in-stable-grizzly

Thierry Carrez (ttx) on 2013-07-17

Changed in neutron:
milestone:	none → havana-2
status:	Fix Committed → Fix Released

Alan Pevec (apevec) on 2013-08-06

tags:	removed: grizzly-backport-potential in-stable-grizzly
Changed in neutron:
importance:	Undecided → Medium

Revision history for this message

Adam Gandelman (gandelman-a) wrote on 2013-08-23:

I still seem to be hitting this or something related on 2013.1.3 using kombu/rabbitmq. I end up with stale entries in the dhcp hosts file:

fa:16:3e:33:6b:f4,host-10-5-0-1.openstacklocal,10.5.0.1
fa:16:3e:41:79:76,host-10-5-0-2.openstacklocal,10.5.0.2
fa:16:3e:93:21:e3,host-10-5-0-3.openstacklocal,10.5.0.3
fa:16:3e:ae:46:62,host-10-5-0-4.openstacklocal,10.5.0.4
fa:16:3e:f8:44:82,host-10-5-0-10.openstacklocal,10.5.0.10
fa:16:3e:e7:d0:3b,host-10-5-0-25.openstacklocal,10.5.0.25
fa:16:3e:b0:f1:f4,host-10-5-0-28.openstacklocal,10.5.0.28
fa:16:3e:05:e5:55,host-10-5-0-4.openstacklocal,10.5.0.4

Restarting the DHCP agent re-renders the file correctly:

Thierry Carrez (ttx) on 2013-10-17

Changed in neutron:
milestone:	havana-2 → 2013.2

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.