Pacemaker turns DHCP agent resource into unmanaged state

Bug #1436414 reported by Ilya Shakhat on 2015-03-25
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Ilya Shakhat
6.0.x
High
Alexander Nevenchannyy
7.0.x
High
Ilya Shakhat

Bug Description

Symptoms:
 * Pacemaker tells that resource p_neutron-dhcp-agent in unmanaged state
 * "neutron agent-list | grep DHCP" tells that dhcp agents dead on all 3 controllers
 * the last message in dhcp agent logs is "Caught SIGTERM, exiting"
 * there are no agent processes on controllers

Environment:
 * MOS 6.1 build 192
 * Ubuntu + Neutron GRE
 * OpenStack contains more than 2000 networks

Steps to reproduce hanging q-agent-cleanup:
 * Create 2k namespaces:
   for i in {1000..2999}; do ip netns add "qdhcp-6a6302b1-37a7-4142-a11d-5be2778f$i" ; done
 * Run q-agent-cleanup:
   q-agent-cleanup.py -a dhcp --cleanup-ports --noop
The process hangs, strace shows https://bugs.launchpad.net/fuel/+bug/1436414/comments/5

Ilya Shakhat (shakhat) wrote :

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "6.1"
  api: "1.0"
  build_number: "192"
  build_id: "2015-03-12_22-54-44"
  nailgun_sha: "c186f71158ed27b03d8db87561ea66c19e39b452"
  python-fuelclient_sha: "59513d6b75f86060ff5059f39fdd9cca56c83f19"
  astute_sha: "ed76b0cacf34a4a683b464ebd86e0beb273b5473"
  fuellib_sha: "fda8128b9ca7a8ce818421040f597a50eece8078"
  ostf_sha: "ecb8e294b0acbdc5b0300d5e39028fb26ecc9088"
  fuelmain_sha: "3764b8a73b3a93fd7ee66937ba4c4c77da409b78"

Ilya Shakhat (shakhat) wrote :
Download full text (3.6 KiB)

Logs: (note that fuel master in GMT, nodes in PDT)

neutron-dhcp-agent (node-14)
------------------------------------------
2015-03-25 03:47:47.461 16245 INFO neutron.openstack.common.service [req-1b24b7b7-de3c-4244-ae0e-1e75ed824399 None] Caught SIGTERM, exiting

neutron-dhcp-agent (node-9)
-----------------------------------------
2015-03-25T10:47:47.467007+00:00 info: 2015-03-25 03:47:47.466 16290 INFO neutron.openstack.common.service [req-62050a3f-f4bf-4cee-99ca-aae8f7e99b6c None] Caught SIGTERM, exiting

ocf.log (node-14)
-----------------------
2015-03-25T10:47:50.469809+00:00 err: ERROR: /usr/lib/ocf/lib/heartbeat/ocf-shellfuncs: line 379: kill: (16245) - No such process
2015-03-25T10:47:50.478260+00:00 err: ERROR: /usr/lib/ocf/lib/heartbeat/ocf-shellfuncs: line 379: kill: (5423) - No such process
2015-03-25T10:47:50.483580+00:00 info: INFO: OpenStack DHCP Agent (neutron-dhcp-agent) stopped

pacemaker.log (node-14)
---------------------------------
<28>Mar 25 03:48:47 node-14 lrmd[6974]: warning: child_timeout_callback: p_neutron-dhcp-agent_stop_0 process (PID 19318) timed out
<28>Mar 25 03:48:47 node-14 lrmd[6974]: warning: operation_finished: p_neutron-dhcp-agent_stop_0:19318 - timed out after 60000ms
<27>Mar 25 03:48:47 node-14 crmd[6977]: error: process_lrm_event: LRM operation p_neutron-dhcp-agent_stop_0 (407) Timed Out (timeout=60000ms)
<29>Mar 25 03:48:47 node-14 crmd[6977]: notice: process_lrm_event: node-14.domain.tld-p_neutron-dhcp-agent_stop_0:407 [ 2015-03-25 03:47:50,827 - INFO - Started: /usr/bin/q-agent-cleanup.py --agent=dhcp
--cleanup-ports\n ]
<29>Mar 25 03:48:47 node-14 attrd[6975]: notice: attrd_cs_dispatch: Update relayed from node-7.domain.tld
<29>Mar 25 03:48:47 node-14 attrd[6975]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-p_neutron-dhcp-agent (INFINITY)
<29>Mar 25 03:48:47 node-14 attrd[6975]: notice: attrd_perform_update: Sent update 2845: fail-count-p_neutron-dhcp-agent=INFINITY
<29>Mar 25 03:48:47 node-14 attrd[6975]: notice: attrd_cs_dispatch: Update relayed from node-7.domain.tld
<29>Mar 25 03:48:47 node-14 attrd[6975]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-p_neutron-dhcp-agent (1427280528)
<29>Mar 25 03:48:47 node-14 attrd[6975]: notice: attrd_perform_update: Sent update 2848: last-failure-p_neutron-dhcp-agent=1427280528
<29>Mar 25 03:48:47 node-14 attrd[6975]: notice: attrd_cs_dispatch: Update relayed from node-7.domain.tld
<29>Mar 25 03:48:47 node-14 attrd[6975]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-p_neutron-dhcp-agent (INFINITY)
<29>Mar 25 03:48:47 node-14 attrd[6975]: notice: attrd_perform_update: Sent update 2851: fail-count-p_neutron-dhcp-agent=INFINITY
<29>Mar 25 03:48:47 node-14 attrd[6975]: notice: attrd_cs_dispatch: Update relayed from node-7.domain.tld
<29>Mar 25 03:48:47 node-14 attrd[6975]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-p_neutron-dhcp-agent (1427280528)
<29>Mar 25 03:48:47 node-14 attrd[6975]: notice: attrd_perform_update: Sent update 2854: last-failure-p_neutron-dhcp-agent=1427280528
<29>Mar 25 03...

Read more...

Ilya Shakhat (shakhat) wrote :

Reconstruction of workflow sequence:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

03:47:47 - 03:47:50
--------------------------
Works OCF script /usr/lib/ocf/resource.d/fuel/ocf-neutron-dhcp-agent, function "neutron_dhcp_agent_stop" (this seen by message "OpenStack DHCP Agent (neutron-dhcp-agent) stopped", https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/cluster/files/ocf/ocf-neutron-dhcp-agent#L633)

During this time the function kills all processes (https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/cluster/files/ocf/ocf-neutron-dhcp-agent#L599) by sending SIGTERM. It's observed as SIGTERM message in dhcp-agent log. At least 2 processes are not killed in time and script tries to kill them 3 seconds later. We see message from kill that it has not found 2 pids

03:47:50
------------
OCF script writes that it the agent is stopped (https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/cluster/files/ocf/ocf-neutron-dhcp-agent#L633)

03:48:47
-----------
Pacemaker tells that there's timeout in 60 seconds during stop and q-agent-cleanup.py is still running:
<27>Mar 25 03:48:47 node-14 crmd[6977]: error: process_lrm_event: LRM operation p_neutron-dhcp-agent_stop_0 (407) Timed Out (timeout=60000ms)
<29>Mar 25 03:48:47 node-14 crmd[6977]: notice: process_lrm_event: node-14.domain.tld-p_neutron-dhcp-agent_stop_0:407 [ 2015-03-25 03:47:50,827 - INFO - Started: /usr/bin/q-agent-cleanup.py --agent=dhcp
--cleanup-ports\n ]
It means that OCF script is in https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/cluster/files/ocf/ocf-neutron-dhcp-agent#L639 and function is not finished.
As result pacemaker thinks that stop hanged and marks resource as unmanaged. The same happens on all 3 controllers leaving OpenStack without any DHCP agents.

q-agent-cleanup is slow when number of namespaces is high. E.g. __collect_ports_for_namespace is called for every namespace. Even when the load is low call to "ip netns exec <namespace> ip l show" takes 0.1 seconds, resulting in 200 seconds on 2k namespaces which is 3 times more than timeout in Pacemaker.

tags: added: scale
Changed in fuel:
milestone: none → 6.1
importance: Undecided → High
Dmitry Ilyin (idv1985) wrote :

As far as I remember this cleanup script was forking and exiting instantly. Was it changed?
It's possible to increase timeout also...

Ilya Shakhat (shakhat) wrote :

Some more observations from a repro:

root@node-7:~# strace -p 7257
Process 7257 attached
wait4(7264, ^CProcess 7257 detached

pid 7257 - q-agent-cleanup
it waits for process 7264 which is ip netns list, which in turn hangs in writing to stdout:

root@node-7:~# strace -p 7264
Process 7264 attached
write(1, "p-81026552-158c-4343-b6e0-c541d5"..., 4096

Ilya Shakhat (shakhat) wrote :

q-agent-cleanup uses the following code to call processes:
 process = subprocess.Popen(
cmd,
shell=False,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE
)
rc = process.wait()
(https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/cluster/files/q-agent-cleanup.py#L267)

According to documentation (https://docs.python.org/2/library/subprocess.html#subprocess.Popen.wait):
Popen.wait()
    Wait for child process to terminate. Set and return returncode attribute.

    Warning! This will deadlock when using stdout=PIPE and/or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use communicate() to avoid that.

From strace output (https://bugs.launchpad.net/fuel/+bug/1436414/comments/5) the buffer size is 4096 bytes. That means that call to wait() will hang if ip netns list outputs more than this. That in turn means that the max number of namespaces (networks) is 4096 / 43 = 95.

tags: added: ha
Ilya Shakhat (shakhat) on 2015-03-26
description: updated
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
status: New → Confirmed
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Sergey Vasilenko (xenolog)

Fix proposed to branch: master
Review: https://review.openstack.org/168315

Changed in fuel:
assignee: Sergey Vasilenko (xenolog) → Ilya Shakhat (shakhat)
status: Confirmed → In Progress
Ilya Shakhat (shakhat) wrote :

@Vladimir -- since Sergey is on PTO I'm pushing the first part of fix on review.

Changed in fuel:
assignee: Ilya Shakhat (shakhat) → Fuel Library Team (fuel-library)
assignee: Fuel Library Team (fuel-library) → Sergey Vasilenko (xenolog)
Vladimir Kuklin (vkuklin) wrote :

@Ilya

Is this fix enough to close the bug? Could you please recheck it with our scale lab?

Ilya Shakhat (shakhat) wrote :

@Vladimir

No. this fix is not enough. The fix resolves hang of q-agent-cleanup (described in https://bugs.launchpad.net/fuel/+bug/1436414/comments/6), but the issue with running out of time limit remains.

Changed in fuel:
assignee: Sergey Vasilenko (xenolog) → Sergey Kolekonov (skolekonov)

Fix proposed to branch: master
Review: https://review.openstack.org/172390

Changed in fuel:
assignee: Sergey Kolekonov (skolekonov) → Vladimir Kuklin (vkuklin)
Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → Sergey Kolekonov (skolekonov)
tags: added: neutron
Bogdan Dobrelya (bogdando) wrote :

Raised to critical as unmanaged pacemakr resources badly impact large deployments

Changed in fuel:
importance: High → Critical

Reviewed: https://review.openstack.org/172390
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=d6cff81273fe05e59ac0b0a4c7f94cb56d648f86
Submitter: Jenkins
Branch: master

commit d6cff81273fe05e59ac0b0a4c7f94cb56d648f86
Author: Sergey Kolekonov <email address hidden>
Date: Fri Apr 10 15:07:56 2015 +0300

    Disable resources cleanup for DHCP agent by default

    Now all resources created by DHCP agent (namespaces, processes) are removed by
    Pacemaker on agent start and stop. This operation can be rather slow when many
    networks exist in a cloud, and timeout will turn the resource to unmanaged
    state. To avoid such cases resources are kept by default.
    It also helps to avoid service interruption because all resource have to be
    re-created by the agent in case of restart.

    Change-Id: Ia6cdc3b86b162f4614a8ed93838d3e09d9126ffc
    Closes-bug: #1436414

Changed in fuel:
status: In Progress → Fix Committed

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/6.0
Review: https://review.openstack.org/174235
Reason: the main patch was merged w/o review process finished, hence will be reverted

Vladimir Kuklin (vkuklin) wrote :

Reopened the bug to discuss how to really address q-agent cleanup functionality

Decision is the following:

1) Introduce ability to reload agents without doing a cleanup
2) Optimize namespaces cleanup process using python-pyroute2 library

Changed in fuel:
status: Fix Committed → Triaged
Bogdan Dobrelya (bogdando) wrote :

If we have agreed the next steps, could we revert the merged patch then? (https://review.openstack.org/#/c/174239/ )

Changed in fuel:
status: Triaged → In Progress

Reviewed: https://review.openstack.org/168315
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=ea5967a0e1fee5c0b29b4864be4fe27ef24471ae
Submitter: Jenkins
Branch: master

commit ea5967a0e1fee5c0b29b4864be4fe27ef24471ae
Author: Ilya Shakhat <email address hidden>
Date: Fri Mar 27 15:17:30 2015 +0300

    Fix the way q-agent-cleanup executes shell processes

    Shell commands are executed via subprocess.Popen(). Currently the
    status of process is tracked by wait() method which results in deadlocks
    when output is large enough (tens of kB). The correct way is to use
    communicate() method.

    Partial-bug 1436414

    Change-Id: Ibbbfa3a5331f865e48160c37e9ba558af19dc680

Reviewed: https://review.openstack.org/177595
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=d704db0b82c89549db36a78f6a9f6312b3e5fe73
Submitter: Jenkins
Branch: master

commit d704db0b82c89549db36a78f6a9f6312b3e5fe73
Author: Sergey Kolekonov <email address hidden>
Date: Sun Apr 26 15:24:21 2015 +0300

    Implement reload operation for Neutron DHCP agent

    - added reload operation to Neutron DHCP agent resource
    - enable artifacts cleanup on agent stop and start by default because
      fast restart is done using reload operation
    - added dummy parameter to resource because it's needed for reload operation
      to work (see LP bug #1448160)

    Related-bug: #1436414
    Change-Id: I0dc9f9dd21ec938912a8b588d71da93397c73597

tags: added: release-notes

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/174239

Bogdan Dobrelya (bogdando) wrote :

Please update the bug status, is it Fix commited now or still in progress?

Alexander Ignatov (aignatov) wrote :

Marking at #1436414 as Known Issue and won't fix for 6.1. It’s too late to do any manipulations around q_agent_cleanup before release.

Sergey Kolekonov already did many changes to help customers to turn off cleanup if needed and also fast agent reload as part of #1448160
All of this will have detailed description in 6.1 release notes.

And we believe that #1436414 issue will gone away when we move resource cleanup to Neutron core.

Changed in fuel:
status: In Progress → Won't Fix
importance: Critical → High
Alexander Ignatov (aignatov) wrote :

Also decreased priority from Critical to High since there are several possible workarounds

Sergey Kolekonov (skolekonov) wrote :

Release notes:

Pacemaker can turn DHCP agent resource into unmanaged state in case of a large number of networks.
Pacemaker monitors DHCP agents on all controller nodes and restarts them if for some reason they are seemed to be dead.
By default Pacemaker will try to clean all artefacts created by the agent (namespaces, ports, processes). In case of a large number of
networks this procedure can take too long time to finish and the resource will be marked as unmanaged.
In such case the resources should be cleaned up by executing the following command: pcs resource cleanup p_neutron-dhcp-agent.

To prevent such cases cleanup on starting/stopping of Neutron DHCP agent resource can be disabled by executing the following command: pcs resource update p_neutron-dhcp-agent remove_artifacts_on_stop_start=true --force. Then disable and enable the resource to apply changes.

If resource need to be restarted without removing any artefacts (for example, to apply configuration changes or in other cases), it supports reload operation. It's required to change one of the resource parameters.
For example, execute pcs resource update p_neutron-dhcp-agent debug=true. The resource should be reloaded. If it's restarted instead of reloaded, try to change the parameter again, because Pacemaker may not start to accept changes from the very first try.

tags: added: release-notes-done
removed: release-notes

Reviewed: https://review.openstack.org/192128
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=01d3498ab1cb26cee606ecada8906f218d838c26
Submitter: Jenkins
Branch: stable/6.0

commit 01d3498ab1cb26cee606ecada8906f218d838c26
Author: Ilya Shakhat <email address hidden>
Date: Fri Mar 27 15:17:30 2015 +0300

    Fix the way q-agent-cleanup executes shell processes

    Shell commands are executed via subprocess.Popen(). Currently the
    status of process is tracked by wait() method which results in deadlocks
    when output is large enough (tens of kB). The correct way is to use
    communicate() method.

    Partial-bug 1436414

    Change-Id: Ibbbfa3a5331f865e48160c37e9ba558af19dc680

OSCI Robot (oscirobot) wrote :

Changeset merged. Package placed on primary repository.
RPM package fuel-library6.0 has been built for project stackforge/fuel-library.
Files placed in repository:
fuel-ha-utils6.0-6.0.0-6200.1.noarch.rpm
fuel-library6.0-6.0.0-6200.1.noarch.rpm
Repository URL: http://osci-obs.vm.mirantis.net:82/centos-fuel-6.0-updates-stable/centos .

OSCI Robot (oscirobot) wrote :

Changeset merged. Package placed on primary repository.
DEB package fuel-library has been built for project stackforge/fuel-library.
Files placed in repository:
fuel-ha-utils6.0_6.0.0-6200.1_all.deb
fuel-library6.0_6.0.0-6200.1_all.deb
Repository URL: http://osci-obs.vm.mirantis.net:82/ubuntu-fuel-6.0-updates-stable/ubuntu .

Related fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Ilya Shakhat <email address hidden>
Review: https://review.fuel-infra.org/9860

Related fix proposed to branch: master
Change author: Sergey Kolekonov <email address hidden>
Review: https://review.fuel-infra.org/9892

Reviewed: https://review.fuel-infra.org/9892
Submitter: Andrey Nikitin <email address hidden>
Branch: master

Commit: d4bc31bdec23f3e2ebe0c3e1edc0aae31dce4b58
Author: Sergey Kolekonov <email address hidden>
Date: Tue Jul 28 09:59:19 2015

Add pyroute2 package for Trusty

Neutron requires pyroute2 library to operate with namespaces during cleanup
operations.

Change-Id: I62a0ad588feae73860fc7d8bea876d76a1c2d6dc
Related-bug: #1436414

Related fix proposed to branch: 7.0
Change author: Sergey Kolekonov <email address hidden>
Review: https://review.fuel-infra.org/9949

Related fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Sergey Kolekonov <email address hidden>
Review: https://review.fuel-infra.org/9950

Reviewed: https://review.fuel-infra.org/9950
Submitter: Dmitry Teselkin <email address hidden>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: f28290bca9c3aa72756f3dad6d2192784a2bef43
Author: Sergey Kolekonov <email address hidden>
Date: Wed Jul 29 12:32:42 2015

Add pyroute2 to dependencies for neutron-common

Add pyroute2 to dependencies for neutron-common as it's required to operate
with namespaces during cleanup operations

Change-Id: I6d368504ab39c6133ec4b014dd9efb882d087bf3
Related-bug: #1436414

Reviewed: https://review.fuel-infra.org/9949
Submitter: Dmitry Teselkin <email address hidden>
Branch: 7.0

Commit: 77ffb1499ed615fde5ce775dc58499ac130dde3c
Author: Sergey Kolekonov <email address hidden>
Date: Wed Jul 29 12:24:47 2015

Add pyroute2 for Trusty

- Neutron requires pyroute2 library to operate with namespaces during cleanup
  operations
- Source code and specification are backported from Debian
  https://packages.debian.org/stretch/python-pyroute2

Change-Id: I46649a21a9a76d34c404ebc10a8087ac54815d49
Related-bug: #1436414

Reviewed: https://review.fuel-infra.org/9860
Submitter: mos-infra-ci <>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: 3b15c9623b92057d2280114bbdb74463d4df0499
Author: Ilya Shakhat <email address hidden>
Date: Tue Aug 4 11:51:51 2015

Implement faster version of neutron-netns-cleanup

Faster version is enabled if neutron-netns-cleanup is started by
privileged user. It uses pyroute2 library for operations with namespaces.

Change-Id: I1080a1baba9cb5ca58c61aefcfbc1dece726d15f
Related-Bug: #1436414

Changed in fuel:
assignee: Sergey Kolekonov (skolekonov) → Ilya Shakhat (shakhat)

Reviewed: https://review.openstack.org/204075
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=14240365c777a26bb467c14e322081e8864cc84f
Submitter: Jenkins
Branch: master

commit 14240365c777a26bb467c14e322081e8864cc84f
Author: Ilya Shakhat <email address hidden>
Date: Tue Jul 21 16:00:37 2015 +0300

    Use neutron-netns-cleanup utility

    Utility neutron-netns-cleanup is extended to cover cleaning features
    of both OCF and q-agent-cleanup. It is responsible for:
     * terminating in-namespace processes
     * removing net interfaces and OVS ports
     * removing namespaces
    New utility uses pyroute2 library and at least 3 times faster than
    the old solution.

    Closes-Bug: #1436414
    Closes-Bug: #1434196

    Change-Id: Id8f12e7a342ccad3c3362d70159101bb95d918fd

Changed in fuel:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/213097
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=1314487097880756a9dc34e2b674517ac3803613
Submitter: Jenkins
Branch: master

commit 1314487097880756a9dc34e2b674517ac3803613
Author: Sergey Kolekonov <email address hidden>
Date: Fri Aug 14 14:13:30 2015 +0300

    Add reload operation for Neutron L3 agent resource

    - implement reload operation for L3 agent resource
    - remove old unused code
    - update DHCP agent resource to keep consistency

    Change-Id: I8a586ce41ddf37295f939d5ee54fe2c437cc82c6
    Related-bug: #1436414
    Closes-bug: #1464817

Related fix proposed to branch: 8.0
Change author: Sergey Kolekonov <email address hidden>
Review: https://review.fuel-infra.org/11165

Related fix proposed to branch: master
Change author: Sergey Kolekonov <email address hidden>
Review: https://review.fuel-infra.org/11187

Reviewed: https://review.fuel-infra.org/11165
Submitter: Igor Yozhikov <email address hidden>
Branch: 8.0

Commit: 0bc64b6a5856003c378c1447d8dabd8dbca3f2c0
Author: Sergey Kolekonov <email address hidden>
Date: Fri Sep 4 08:45:21 2015

Add pyroute2 for MOS 8.0

- Neutron requires pyroute2 library to operate with namespaces during cleanup
  operations
- cherry-picked from 7.0 branch 77ffb1499ed615fde5ce775dc58499ac130dde3c

Change-Id: I46649a21a9a76d34c404ebc10a8087ac54815d49
Related-bug: #1436414

Leontiy Istomin (listomin) wrote :

hasn't been reproduced with at least 7.0-293 build on scale lab with 200 nodes

Leontiy Istomin (listomin) wrote :

I've tried to reproduce the issue using the following steps:
root@node-6:~# ip netns | grep -c qdhcp
3
 * Create 2k namespaces:
   for i in {1000..2999}; do ip netns add "qdhcp-6a6302b1-37a7-4142-a11d-5be2778f$i" ; done
 * Run q-agent-cleanup:
   neutron-netns-cleanup --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/dhcp_agent.ini
after command execution:
root@node-6:~# ip netns | grep -c qdhcp
2003
from /var/log/neutron/neutron-netns-cleanup.log on node-6: http://paste.openstack.org/show/474018/

Leontiy Istomin (listomin) wrote :

https://bugs.launchpad.net/mos/+bug/1499729 prevents to check this issue on 7.0 using those steps:
root@node-6:~# ip netns | grep -c qdhcp
3
 * Create 2k namespaces:
   for i in {1000..2999}; do ip netns add "qdhcp-6a6302b1-37a7-4142-a11d-5be2778f$i" ; done
 * Run q-agent-cleanup:
   neutron-netns-cleanup --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/dhcp_agent.ini

tags: added: on-verification
tags: removed: on-verification
tags: added: on-verification
tags: removed: on-verification
tags: added: on-verification
tags: removed: on-verification

Added to 7.0 MU1, since it needs QA verification.

tags: added: 70mu1-confirmed
tags: added: on-verification

removed on-verification because steps to verify are blocked by another bug
https://bugs.launchpad.net/mos/+bug/1499729

tags: removed: on-verification

Related fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Ilya Shakhat <email address hidden>
Review: https://review.fuel-infra.org/13298

Related fix proposed to branch: stable/liberty
Change author: Ilya Shakhat <email address hidden>
Review: https://review.fuel-infra.org/13426

Change abandoned by Ann Kamyshnikova <email address hidden> on branch: openstack-ci/fuel-8.0/liberty
Review: https://review.fuel-infra.org/13298
Reason: replaced with https://review.fuel-infra.org/#/c/13426/

Change restored by Ann Kamyshnikova <email address hidden> on branch: openstack-ci/fuel-8.0/liberty
Review: https://review.fuel-infra.org/13298

Change abandoned by Ann Kamyshnikova <email address hidden> on branch: stable/liberty
Review: https://review.fuel-infra.org/13426

Reviewed: https://review.fuel-infra.org/13298
Submitter: Pkgs Jenkins <email address hidden>
Branch: openstack-ci/fuel-8.0/liberty

Commit: 80d67be0944a54e59e45d92b0cb90cb2b4a92753
Author: Ilya Shakhat <email address hidden>
Date: Mon Nov 2 10:56:14 2015

Implement faster version of neutron-netns-cleanup

Faster version is enabled if neutron-netns-cleanup is started by
privileged user. It uses pyroute2 library for operations with namespaces.

Change-Id: I1080a1baba9cb5ca58c61aefcfbc1dece726d15f
Related-Bug: #1436414

Change abandoned by Sergey Kolekonov <email address hidden> on branch: master
Review: https://review.fuel-infra.org/11187

Related fix proposed to branch: 9.0/mitaka
Change author: Ilya Shakhat <email address hidden>
Review: https://review.fuel-infra.org/18404

Change abandoned by Sergey Belous <email address hidden> on branch: stable/mitaka
Review: https://review.fuel-infra.org/18770
Reason: Ops. Wrong branch

Related fix proposed to branch: 9.0/mitaka
Change author: Sergey Belous <email address hidden>
Review: https://review.fuel-infra.org/18773

Change abandoned by Sergey Belous <email address hidden> on branch: 9.0/mitaka
Review: https://review.fuel-infra.org/18404
Reason: Abandoned because this change included into https://review.fuel-infra.org/#/c/18773/

Related fix proposed to branch: stable/mitaka
Change author: Sergey Belous <email address hidden>
Review: https://review.fuel-infra.org/19166

Change abandoned by Ann Kamyshnikova <email address hidden> on branch: stable/mitaka
Review: https://review.fuel-infra.org/19166

Reviewed: https://review.fuel-infra.org/18773
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0/mitaka

Commit: 1bacd47ed29fd621430f23f89f94c68e2cbd2543
Author: Sergey Belous <email address hidden>
Date: Thu Apr 7 10:18:06 2016

Neutron netns_cleanup improvements

Implement faster version of neutron-netns-cleanup

Faster version is enabled if neutron-netns-cleanup is started by
privileged user. It uses pyroute2 library for operations with namespaces.
Related-Bug: #1436414
=========================================================================

Improve performance of net-ns cleanup.

There is a couple of things that may increase performance:
1. Using python ovs idl library to connect to ovsdb server
directly rather than using ovs-vsctl.
That also allows transactional execution of ovs commands such
as port deletion.
2. Parallel execution of netns operations.
Candidates for deletion are split to chunks each processed
by a separate worker. That helps to avoid contention working
with pyroute2 library.
Closes-Bug: #1522432

Change-Id: I96c2f3216a64ef8fdc1df52e681399849d59de25

Related fix proposed to branch: 10.0/newton
Change author: Sergey Belous <email address hidden>
Review: https://review.fuel-infra.org/32539

Related fix proposed to branch: mcp/newton
Change author: Sergey Belous <email address hidden>
Review: https://review.fuel-infra.org/33748

Related fix proposed to branch: 11.0/ocata
Change author: Sergey Belous <email address hidden>
Review: https://review.fuel-infra.org/34150

Related fix proposed to branch: mcp/ocata
Change author: Sergey Belous <email address hidden>
Review: https://review.fuel-infra.org/34918

Change abandoned by Roman Podoliaka <email address hidden> on branch: 11.0/ocata
Review: https://review.fuel-infra.org/34150
Reason: we do not need 11.0/ocata anymore - use mcp/ocata instead

Change abandoned by Alexander Ignatov <email address hidden> on branch: mcp/newton
Review: https://review.fuel-infra.org/33748
Reason: This was mos-specific patch. Not required for Newton as well as Ocata.

Change abandoned by Alexander Ignatov <email address hidden> on branch: mcp/ocata
Review: https://review.fuel-infra.org/34918
Reason: This was mos-specific patch. Not required for Newton as well as Ocata.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers