Fullstack test neutron.tests.fullstack.test_dhcp_agent.TestDhcpAgentHA.test_reschedule_network_on_new_agent timeout

Bug #1799555 reported by Hongbin Lu on 2018-10-23
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
High
Rodolfo Alonso

Bug Description

Job: neutron-fullstack-*
Failed test: neutron.tests.fullstack.test_dhcp_agent.TestDhcpAgentHA.test_reschedule_network_on_new_agent(Linux bridge agent)
Sample failure: http://logs.openstack.org/88/555088/29/gate/neutron-fullstack-python36/34299ab/job-output.txt.gz

 {0} neutron.tests.fullstack.test_dhcp_agent.TestDhcpAgentHA.test_reschedule_network_on_new_agent(Linux bridge agent) [152.363616s] ... FAILED

 Captured traceback:
 ~~~~~~~~~~~~~~~~~~~
     b'Traceback (most recent call last):'
     b' File "/opt/stack/new/neutron/neutron/common/utils.py", line 641, in wait_until_true'
     b' eventlet.sleep(sleep)'
     b' File "/opt/stack/new/neutron/.tox/dsvm-fullstack-python35/lib/python3.5/site-packages/eventlet/greenthread.py", line 36, in sleep'
     b' hub.switch()'
     b' File "/opt/stack/new/neutron/.tox/dsvm-fullstack-python35/lib/python3.5/site-packages/eventlet/hubs/hub.py", line 297, in switch'
     b' return self.greenlet.switch()'
     b'eventlet.timeout.Timeout: 60 seconds'
     b''
     b'During handling of the above exception, another exception occurred:'
     b''
     b'Traceback (most recent call last):'
     b' File "/opt/stack/new/neutron/neutron/tests/base.py", line 151, in func'
     b' return f(self, *args, **kwargs)'
     b' File "/opt/stack/new/neutron/neutron/tests/fullstack/test_dhcp_agent.py", line 168, in test_reschedule_network_on_new_agent'
     b' self._wait_until_network_rescheduled(network_dhcp_agents[0])'
     b' File "/opt/stack/new/neutron/neutron/tests/fullstack/test_dhcp_agent.py", line 137, in _wait_until_network_rescheduled'
     b' common_utils.wait_until_true(_agent_rescheduled)'
     b' File "/opt/stack/new/neutron/neutron/common/utils.py", line 646, in wait_until_true'
     b' raise WaitTimeout(_("Timed out after %d seconds") % timeout)'
     b'neutron.common.utils.WaitTimeout: Timed out after 60 seconds'
     b''

Logstash: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22test_reschedule_network_on_new_agent%5C%22%20

Hongbin Lu (hongbin.lu) on 2018-10-23
tags: added: gate-failure

Looks intermittent.

Changed in neutron:
status: New → Confirmed
importance: Undecided → Low
tags: added: fullstack
Changed in neutron:
importance: Medium → High
Changed in neutron:
assignee: nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)

Change abandoned by Rodolfo Alonso Hernandez (<email address hidden>) on branch: master
Review: https://review.openstack.org/643079

Reviewed: https://review.opendev.org/660785
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b3404d900ef142637869e6e8999d29d712185e95
Submitter: Zuul
Branch: master

commit b3404d900ef142637869e6e8999d29d712185e95
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed May 22 16:47:17 2019 +0000

    Add debug information to AutoScheduler and BaseScheduler

    These new debug lines can be helpful to resolve the mentioned bug.
    Sometimes the DHCP agent does not reschedule and the log does not
    contain enough information to debug the problem.

    Spotted error during fullstack tests:
    Traceback (most recent call last):
    File "/opt/stack/new/neutron/neutron/tests/base.py", line 151, in func
      return f(self, *args, **kwargs)
    File "/opt/stack/new/neutron/neutron/tests/fullstack/test_dhcp_agent.py",
      line 168, in test_reschedule_network_on_new_agent
      self._wait_until_network_rescheduled(network_dhcp_agents[0])'
    File "/opt/stack/new/neutron/neutron/tests/fullstack/test_dhcp_agent.py",
      line 137, in _wait_until_network_rescheduled
      common_utils.wait_until_true(_agent_rescheduled)
    File "/opt/stack/new/neutron/neutron/common/utils.py", line 646,
      in wait_until_true
      raise WaitTimeout(_("Timed out after %d seconds") % timeout)
      neutron.common.utils.WaitTimeout: Timed out after 60 seconds

    Change-Id: I794e737c30f597519fba873e36f26b82b6f2c799
    Related-Bug: #1799555

Hello:

We have found that in some cases, the DHCP agent [1] does not report to the Neutron server and the heartbeat is not updated. In those cases, when the Neutron server wants to reschedule the network from the dead DHCP agent to the second one, this second agent is reported as dead too. When the agent starts to send again the sync status, the server binds the network to this new DHCP agent [2].

In order to log this information, I'll propose a patch to record when an agent is declared as "inactive" because of the not updated heartbeat and what "_get_network_hostable_dhcp_agents" returns.

[1] http://logs.openstack.org/11/660611/19/check/neutron-fullstack/7d4f6c4/controller/logs/dsvm-fullstack-logs/TestDhcpAgentHA.test_reschedule_network_on_new_agent_Linux-bridge-agent_/neutron-dhcp-agent--2019-06-07--00-47-54-171494_log.txt.gz#_2019-06-07_00_49_35_268
[2] http://logs.openstack.org/11/660611/19/check/neutron-fullstack/7d4f6c4/controller/logs/dsvm-fullstack-logs/TestDhcpAgentHA.test_reschedule_network_on_new_agent_Linux-bridge-agent_/neutron-server--2019-06-07--00-47-45-692437_log.txt.gz#_2019-06-07_00_49_35_699

Reviewed: https://review.opendev.org/663903
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d01a1deca9a9ccc8c396da6ca88839fa0862dd65
Submitter: Zuul
Branch: master

commit d01a1deca9a9ccc8c396da6ca88839fa0862dd65
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Fri Jun 7 11:18:57 2019 +0000

    Add log information in agent schedulers

    Added log information about the hostable DHCP agents per network.
    Added log information when an agent is declared "down" because the
    heartbeat is too old.

    Those log messages are included according to [1].

    [1] https://bugs.launchpad.net/neutron/+bug/1799555/comments/8

    Change-Id: I7b30499a86e5ae5de49813dfca01788fd5fce139
    Related-Bug: #1799555

Hello:

Again in [1] we can see the second DHCP is declared as dead, although this agent is still running in the server [2]. As we can see in [3][4], the agent had several hiccups. But I'm not sure if the hiccup happened in the DHCP agent or in the server. IMO, this happened in the server during the "report_state" process, maybe because the RPC call was attended too late. However, we can see the call was received in time [5].

Recommendations:
- Increase the "agent_down_time" parameter in those tests to 30 secs, to avoid problems with loaded servers during the test executions.
- Activate "log_agent_heartbeats" to see when the server updates the agent heartbeats.

[1] http://logs.openstack.org/41/663141/4/check/neutron-fullstack/2f1cda7/controller/logs/dsvm-fullstack-logs/TestDhcpAgentHA.test_reschedule_network_on_new_agent_Open-vSwitch-agent_/neutron-server--2019-06-14--09-16-46-288014_log.txt.gz#_2019-06-14_09_18_53_011
[2] http://logs.openstack.org/41/663141/4/check/neutron-fullstack/2f1cda7/controller/logs/dsvm-fullstack-logs/TestDhcpAgentHA.test_reschedule_network_on_new_agent_Open-vSwitch-agent_/neutron-dhcp-agent--2019-06-14--09-16-55-029864_log.txt.gz

[3] http://logs.openstack.org/41/663141/4/check/neutron-fullstack/2f1cda7/controller/logs/dsvm-fullstack-logs/TestDhcpAgentHA.test_reschedule_network_on_new_agent_Open-vSwitch-agent_/neutron-dhcp-agent--2019-06-14--09-16-55-029864_log.txt.gz#_2019-06-14_09_17_51_850
[4] http://logs.openstack.org/41/663141/4/check/neutron-fullstack/2f1cda7/controller/logs/dsvm-fullstack-logs/TestDhcpAgentHA.test_reschedule_network_on_new_agent_Open-vSwitch-agent_/neutron-dhcp-agent--2019-06-14--09-16-55-029864_log.txt.gz#_2019-06-14_09_18_42_472

[5] http://logs.openstack.org/41/663141/4/check/neutron-fullstack/2f1cda7/controller/logs/dsvm-fullstack-logs/TestDhcpAgentHA.test_reschedule_network_on_new_agent_Open-vSwitch-agent_/neutron-server--2019-06-14--09-16-46-288014_log.txt.gz#_2019-06-14_09_18_22_504

Related fix proposed to branch: master
Review: https://review.opendev.org/665426

Reviewed: https://review.opendev.org/665426
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b76321f1d8864b013d15b236b1868a21ffb8b875
Submitter: Zuul
Branch: master

commit b76321f1d8864b013d15b236b1868a21ffb8b875
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Fri Jun 14 16:15:16 2019 +0000

    Add agent timestamp to "_log_heartbeat" method

    When an agent reports the state, the timestamp is sent along with the
    agent status. This timestamp now is logged if "log_agent_heartbeats" is
    activated.

    Change-Id: Ifc88dfb3041aa07b197f395172b69399796ba46a
    Related-Bug: #1799555

tags: added: neutron-proactive-backport-potential
tags: added: neutron-easy-proactive-backport-potential

Reviewed: https://review.opendev.org/665422
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d7c5ae8a0378fa50853e91241426b6470ffe7492
Submitter: Zuul
Branch: master

commit d7c5ae8a0378fa50853e91241426b6470ffe7492
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Fri Jun 14 15:50:39 2019 +0000

    Increase TestDhcpAgentHA.agent_down_time to 30 seconds

    Currently there is a delay (around 20 seconds) between the the agent
    update call and the server reply, due to the testing servers load. This
    time should be higher than the agent-server communication delay but
    still short enough to detect when the DHCP agent is dead during the
    active wait during the DHCP agent network rescheduling.

    "log_agent_heartbeats" is activated to add information about when the
    server has processed the agent report status call. This log will allow
    to check the different between the server updating time and the previous
    agent heartbeat timestamp.

    Change-Id: Icf9a8802585c908fd4a70d0508139a81d5ac90ee
    Related-Bug: #1799555

Slawek Kaplonski (slaweq) wrote :

Looks like we don't have those issues since https://review.opendev.org/665422 was merged. I will mark this bug as fixed now.

Changed in neutron:
status: Confirmed → Fix Committed

Reviewed: https://review.opendev.org/680659
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=57725fc7d4513c37de597a4b86f8d6878e2be034
Submitter: Zuul
Branch: stable/queens

commit 57725fc7d4513c37de597a4b86f8d6878e2be034
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Fri Jun 14 15:50:39 2019 +0000

    Increase TestDhcpAgentHA.agent_down_time to 30 seconds

    Currently there is a delay (around 20 seconds) between the the agent
    update call and the server reply, due to the testing servers load. This
    time should be higher than the agent-server communication delay but
    still short enough to detect when the DHCP agent is dead during the
    active wait during the DHCP agent network rescheduling.

    "log_agent_heartbeats" is activated to add information about when the
    server has processed the agent report status call. This log will allow
    to check the different between the server updating time and the previous
    agent heartbeat timestamp.

    Conflicts:
        neutron/tests/fullstack/resources/config.py

    Change-Id: Icf9a8802585c908fd4a70d0508139a81d5ac90ee
    Related-Bug: #1799555
    (cherry picked from commit d7c5ae8a0378fa50853e91241426b6470ffe7492)

tags: added: in-stable-queens

Reviewed: https://review.opendev.org/680658
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=68077a4cc9af638256a272cda42b44346d90589d
Submitter: Zuul
Branch: stable/rocky

commit 68077a4cc9af638256a272cda42b44346d90589d
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Fri Jun 14 15:50:39 2019 +0000

    Increase TestDhcpAgentHA.agent_down_time to 30 seconds

    Currently there is a delay (around 20 seconds) between the the agent
    update call and the server reply, due to the testing servers load. This
    time should be higher than the agent-server communication delay but
    still short enough to detect when the DHCP agent is dead during the
    active wait during the DHCP agent network rescheduling.

    "log_agent_heartbeats" is activated to add information about when the
    server has processed the agent report status call. This log will allow
    to check the different between the server updating time and the previous
    agent heartbeat timestamp.

    Conflicts:
        neutron/tests/fullstack/resources/config.py

    Change-Id: Icf9a8802585c908fd4a70d0508139a81d5ac90ee
    Related-Bug: #1799555
    (cherry picked from commit d7c5ae8a0378fa50853e91241426b6470ffe7492)

tags: added: in-stable-rocky

Reviewed: https://review.opendev.org/680657
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5b07cb21e3d8abe15c6322616b0b8976b54aef8d
Submitter: Zuul
Branch: stable/stein

commit 5b07cb21e3d8abe15c6322616b0b8976b54aef8d
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Fri Jun 14 15:50:39 2019 +0000

    Increase TestDhcpAgentHA.agent_down_time to 30 seconds

    Currently there is a delay (around 20 seconds) between the the agent
    update call and the server reply, due to the testing servers load. This
    time should be higher than the agent-server communication delay but
    still short enough to detect when the DHCP agent is dead during the
    active wait during the DHCP agent network rescheduling.

    "log_agent_heartbeats" is activated to add information about when the
    server has processed the agent report status call. This log will allow
    to check the different between the server updating time and the previous
    agent heartbeat timestamp.

    Conflicts:
        neutron/tests/fullstack/resources/config.py

    Change-Id: Icf9a8802585c908fd4a70d0508139a81d5ac90ee
    Related-Bug: #1799555
    (cherry picked from commit d7c5ae8a0378fa50853e91241426b6470ffe7492)

tags: added: in-stable-stein
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers