Duplicate entries in the hosts file

Bug #1191768 reported by Gary Kotton
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Gary Kotton
Grizzly
Fix Released
Medium
Gary Kotton

Bug Description

When Quantum service is underload there are cases when the hosts file has duplicate entries:

There are cases when IP addresses may appear more than once in the hosts file, for example:
fa:16:3e:16:b1:62,10-0-0-3.openstacklocal,10.0.0.3
fa:16:3e:55:86:b2,10-0-0-1.openstacklocal,10.0.0.1
fa:16:3e:f9:ad:20,10-0-0-15.openstacklocal,10.0.0.15
fa:16:3e:d6:94:39,10-0-0-13.openstacklocal,10.0.0.13
fa:16:3e:7d:42:32,10-0-0-15.openstacklocal,10.0.0.15
fa:16:3e:f7:86:a6,10-0-0-16.openstacklocal,10.0.0.16
fa:16:3e:a8:d9:2e,10-0-0-19.openstacklocal,10.0.0.19

This is due to the fact that the service thinks that the DHCP agent is down and it does not send a deleteion message. The cause is actually due to the agent being unable to send the state message to the service.

This happens when using qpid as the message broker and deleting a considerable amount of VM's at one time

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to quantum (master)

Fix proposed to branch: master
Review: https://review.openstack.org/33254

Changed in quantum:
assignee: nobody → Gary Kotton (garyk)
status: New → In Progress
Gary Kotton (garyk)
tags: added: grizzly-backport-potential
Revision history for this message
Akihiro Motoki (amotoki) wrote :

In the default configuraiton agent_down_time is 5 sec and report_interval is 4 sec.
I think the cause is that report_interval and agent_down_time is too close.
If an agent report is delayed for 1 seconds, the server will regard the agent as down.

IMO, agent_down should be larger. At least it seems better it is larger than double of report_interval.
If so, a agent will not be regarded as down even if one report message is dropped.

Any thought?

Revision history for this message
Gary Kotton (garyk) wrote :

I have tried by playing around with the timeout of the agent. The problem here is that the RPC timeoute is 60 seconds. This means that the dhcp agent will not send an update in 60 seconds due to the timeout.

Revision history for this message
Akihiro Motoki (amotoki) wrote :

I see. The proposed solution sounds reasonable to me.

Revision history for this message
yong sheng gong (gongysh) wrote :

cast will flood the quantum server when the server cannot deal, which will make the situation worse and worse.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to quantum (master)

Reviewed: https://review.openstack.org/33254
Committed: http://github.com/openstack/quantum/commit/0407b59c259b2861b4541e31379e994ea8350b83
Submitter: Jenkins
Branch: master

commit 0407b59c259b2861b4541e31379e994ea8350b83
Author: Gary Kotton <email address hidden>
Date: Mon Jun 17 11:37:11 2013 +0000

    Ensure that the report state is not a blocking call

    Fixes bug 1191768

    For the dhcp and l3 agents the first state report will be done
    via a call. If this succeeds then subsequent calls will be done via
    the cast method.

    Change-Id: I82a1d92fc84983b7bb46758db0aee3e3eca1d3be

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to quantum (stable/grizzly)

Fix proposed to branch: stable/grizzly
Review: https://review.openstack.org/34979

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to quantum (stable/grizzly)

Reviewed: https://review.openstack.org/34979
Committed: http://github.com/openstack/quantum/commit/06f679df5d025e657b2204151688ffa60c97a3d3
Submitter: Jenkins
Branch: stable/grizzly

commit 06f679df5d025e657b2204151688ffa60c97a3d3
Author: Gary Kotton <email address hidden>
Date: Mon Jun 17 11:37:11 2013 +0000

    Ensure that the report state is not a blocking call

    Fixes bug 1191768

    For the dhcp and l3 agents the first state report will be done
    via a call. If this succeeds then subsequent calls will be done via
    the cast method.

    Change-Id: I82a1d92fc84983b7bb46758db0aee3e3eca1d3be
    (cherry picked from commit 0407b59c259b2861b4541e31379e994ea8350b83)

tags: added: in-stable-grizzly
Thierry Carrez (ttx)
Changed in neutron:
milestone: none → havana-2
status: Fix Committed → Fix Released
Alan Pevec (apevec)
tags: removed: grizzly-backport-potential in-stable-grizzly
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
Adam Gandelman (gandelman-a) wrote :

I still seem to be hitting this or something related on 2013.1.3 using kombu/rabbitmq. I end up with stale entries in the dhcp hosts file:

fa:16:3e:33:6b:f4,host-10-5-0-1.openstacklocal,10.5.0.1
fa:16:3e:41:79:76,host-10-5-0-2.openstacklocal,10.5.0.2
fa:16:3e:93:21:e3,host-10-5-0-3.openstacklocal,10.5.0.3
fa:16:3e:ae:46:62,host-10-5-0-4.openstacklocal,10.5.0.4
fa:16:3e:f8:44:82,host-10-5-0-10.openstacklocal,10.5.0.10
fa:16:3e:e7:d0:3b,host-10-5-0-25.openstacklocal,10.5.0.25
fa:16:3e:b0:f1:f4,host-10-5-0-28.openstacklocal,10.5.0.28
fa:16:3e:05:e5:55,host-10-5-0-4.openstacklocal,10.5.0.4

Restarting the DHCP agent re-renders the file correctly:

fa:16:3e:33:6b:f4,host-10-5-0-1.openstacklocal,10.5.0.1
fa:16:3e:41:79:76,host-10-5-0-2.openstacklocal,10.5.0.2
fa:16:3e:93:21:e3,host-10-5-0-3.openstacklocal,10.5.0.3
fa:16:3e:05:e5:55,host-10-5-0-4.openstacklocal,10.5.0.4

Thierry Carrez (ttx)
Changed in neutron:
milestone: havana-2 → 2013.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.