dhcp-agent performance issue

Bug #1548190 reported by Prashant Shetty
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Kevin Benton

Bug Description

Single controller[48 GB RAM, 16vCPU, 120GB Disk]
1 Network Node
100 ESX hypervisors distributed in 10 nova-compute nodes
100 KVM hypervisors

Test:

1. Create /16 network
2. Template to spawn 500 instances on network created in step 1

Observation:

1. Tried spinning 500 instances on KVM and 500 instances on ESX hypervisor using CirrOS qemu & vmdk image respectively
2. Stack creation went through fine, but randomly few VM's missed dhcp IP on first boot
3. After forcing sudo udhcpc from guest OS, VM's was able to receive IP
4. Tried tcpdump on network dhcp port on network node, I could able to see only dhcp request coming from guest VM but there was no dhcp offer or ack from dhcp agent.

Assume this issue is due to scale test on this setup.

Please let me know if you need any particular logs or details to debug this issue, I can work and re-create issue and collect logs.

Logs:

Network node:

stack@network:/opt/stack/neutron$ git log -1
commit 171351d5fb55f96e9ed9e56eab755f2717ee67b0
Merge: 0a5a3cf 8f15d72
Author: Jenkins <email address hidden>
Date: Fri Feb 19 07:25:52 2016 +0000

    Merge "Adopt Grafana to plot Neutron Failure Rates"
stack@network:/opt/stack/neutron$

stack@network:/opt/stack/neutron$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.3 LTS"
stack@network:/opt/stack/neutron$

stack@network:/opt/stack/neutron$ dpkg -l | grep dhcp
ii isc-dhcp-client 4.2.4-7ubuntu12.3 amd64 ISC DHCP client
ii isc-dhcp-common 4.2.4-7ubuntu12.3 amd64 common files used by all the isc-dhcp* packages
stack@network:/opt/stack/neutron$

nobody 32359 0.0 0.0 28212 2324 ? S Feb19 0:00 dnsmasq --no-hosts --no-resolv --strict-order --except-interface=lo --pid-file=/opt/stack/data/neutron/dhcp/dba9d787-d145-4b2b-8c5d-d1b3375935ba/pid --dhcp-hostsfile=/opt/stack/data/neutron/dhcp/dba9d787-d145-4b2b-8c5d-d1b3375935ba/host --addn-hosts=/opt/stack/data/neutron/dhcp/dba9d787-d145-4b2b-8c5d-d1b3375935ba/addn_hosts --dhcp-optsfile=/opt/stack/data/neutron/dhcp/dba9d787-d145-4b2b-8c5d-d1b3375935ba/opts --dhcp-leasefile=/opt/stack/data/neutron/dhcp/dba9d787-d145-4b2b-8c5d-d1b3375935ba/leases --dhcp-match=set:ipxe,175 --bind-interfaces --interface=ns-460a72e9-8f --dhcp-range=set:tag0,169.254.169.252,static,86400s --dhcp-lease-max=4 --conf-file= --domain=openstacklocal

=== sshd host keys ===
-----BEGIN SSH HOST KEY KEYS-----
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAAAgmNQGIVf2P+n6U3yYkWwv/V598atmu3ZCHN8XRHIgujUkKKJ5IaurmDFH9XqSJhB2B+90ZP/z0bTqkI0to7QuoFw0GzyV2og59cZZrjntimdoAdo7RZKuwtyQZoIOHILBE2On/y1F3ixhVeE5j78GuiDN+yCNiIBhAnm0V9d0QxGtqc= root@kvm-218-admin-3
ssh-dss AAAAB3NzaC1kc3MAAACBAJNHiG1dQxMG+vmlJEWY1xgak5xHGFzm+QFEkFXOaRzdzPw8y9DX/O+bxnrcaSKwKHBJcKaJZJ+SHnVEpX2QgQQA+uSSMCMKQFISjdTCtTQ/0tUiLculV/ISrXDCm9cHG6DWqplVoc/LmXJObh1NZrjPunW2NrUCoI2qXtFMbdiVAAAAFQC4GI/n1hxNcGEjkzJ2GRRcenU+hwAAAIAjJ09CspilFn7DintV3Sh7lCs1I5+m+mQcI/AJeLSE+Ee04Nw47KWgrFyoaCQNYyHt1wgONlTmcKYyDxUGwjbRTyZYEwdaxgd+klxt7NP6fR2kmuzCLiyFP+W2WNJcqRr6xxlRif4TOwbOd/Qlt0JeM1LufaxyuogSBgWS/kys6AAAAIAIPMlAPi41qCGhUnUkbx23OrzqkP+xZEj7cGnxa/HxpnIkZ+qMhRIhBMFADEt4MKKmdZyTPYDiSaiBR298ic6ywekA1YxAqwCEtKlA+7Cdl9q7REIPVh1CSWZdJEKi9qKBMYaMzuw7cbkdh5ZbOMHWALS8JgndYGD77xWQwKamCA== root@kvm-218-admin-3
-----END SSH HOST KEY KEYS-----
=== network info ===
if-info: lo,up,127.0.0.1,8,::1
if-info: eth0,up,,8,fe80::f816:3eff:feec:77dc
=== datasource: configdrive local ===
instance-id: 11a83228-beef-4c0e-a7a3-15748de381e9
name: kvm-218-admin-3
availability-zone: nova
local-hostname: kvm-218-admin-3.novalocal
launch-index: 0
=== cirros: current=0.3.4 uptime=191.83 ===
route: fscanf
=== pinging gateway failed, debugging connection ===
############ debug start ##############
### /etc/init.d/sshd start
Starting dropbear sshd: OK
route: fscanf
### ifconfig -a
eth0 Link encap:Ethernet HWaddr FA:16:3E:EC:77:DC
          inet6 addr: fe80::f816:3eff:feec:77dc/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:84 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:3840 (3.7 KiB) TX bytes:1150 (1.1 KiB)

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:12 errors:0 dropped:0 overruns:0 frame:0
          TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:1020 (1020.0 B) TX bytes:1020 (1020.0 B)

### route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
route: fscanf
### cat /etc/resolv.conf
cat: can't open '/etc/resolv.conf': No such file or directory
### gateway not found
/sbin/cirros-status: line 1: can't open /etc/resolv.conf: no such file
### pinging nameservers
### uname -a
Linux kvm-218-admin-3 3.2.0-80-virtual #116-Ubuntu SMP Mon Mar 23 17:28:52 UTC 2015 x86_64 GNU/Linux
### lsmod
Module Size Used by Not tainted
nls_iso8859_1 12713 0
nls_cp437 16991 0
vfat 17585 0
fat 61512 1 vfat
isofs 40259 0
ip_tables 27473 0
x_tables 29891 1 ip_tables
pcnet32 42119 0
8139cp 27360 0
ne2k_pci 13691 0
8390 18856 1 ne2k_pci
e1000 108589 0
acpiphp 24231 0
### dmesg | tail
[ 2.581084] EXT3-fs (vda): mounted filesystem with ordered data mode
[ 2.746639] ISO 9660 Extensions: Microsoft Joliet Level 3
[ 2.790563] ISO 9660 Extensions: RRIP_1991A
[ 2.955492] kjournald starting. Commit interval 5 seconds
[ 2.973249] EXT3-fs (vda): using internal journal
[ 2.973330] EXT3-fs (vda): mounted filesystem with ordered data mode
[ 5.386057] EXT3-fs (vda): using internal journal
[ 7.067755] ISO 9660 Extensions: Microsoft Joliet Level 3
[ 7.151524] ISO 9660 Extensions: RRIP_1991A
[ 19.008702] eth0: no IPv6 routers present
### tail -n 25 /var/log/messages
Feb 3 11:31:14 cirros kern.info kernel: [ 2.267494] acpiphp: Slot [25] registered
Feb 3 11:31:14 cirros kern.info kernel: [ 2.267701] acpiphp: Slot [26] registered
Feb 3 11:31:14 cirros kern.info kernel: [ 2.267899] acpiphp: Slot [27] registered
Feb 3 11:31:14 cirros kern.info kernel: [ 2.268298] acpiphp: Slot [28] registered
Feb 3 11:31:14 cirros kern.info kernel: [ 2.268528] acpiphp: Slot [29] registered
Feb 3 11:31:14 cirros kern.info kernel: [ 2.268721] acpiphp: Slot [30] registered
Feb 3 11:31:14 cirros kern.info kernel: [ 2.268907] acpiphp: Slot [31] registered
Feb 3 11:31:14 cirros kern.info kernel: [ 2.302428] e1000: Intel(R) PRO/1000 Network Driver - version 7.3.21-k8-NAPI
Feb 3 11:31:14 cirros kern.info kernel: [ 2.302460] e1000: Copyright (c) 1999-2006 Intel Corporation.
Feb 3 11:31:14 cirros kern.info kernel: [ 2.325543] ne2k-pci.c:v1.03 9/22/2003 D. Becker/P. Gortmaker
Feb 3 11:31:14 cirros kern.info kernel: [ 2.348426] 8139cp: 8139cp: 10/100 PCI Ethernet driver v1.3 (Mar 22, 2004)
Feb 3 11:31:14 cirros kern.info kernel: [ 2.375341] pcnet32: pcnet32.c:v1.35 21.Apr.2008 <email address hidden>
Feb 3 11:31:14 cirros kern.info kernel: [ 2.400792] ip_tables: (C) 2000-2006 Netfilter Core Team
Feb 3 11:31:14 cirros kern.info kernel: [ 2.579114] kjournald starting. Commit interval 5 seconds
Feb 3 11:31:14 cirros kern.info kernel: [ 2.581084] EXT3-fs (vda): mounted filesystem with ordered data mode
Feb 3 11:31:14 cirros kern.debug kernel: [ 2.746639] ISO 9660 Extensions: Microsoft Joliet Level 3
Feb 3 11:31:14 cirros kern.debug kernel: [ 2.790563] ISO 9660 Extensions: RRIP_1991A
Feb 3 11:31:14 cirros kern.info kernel: [ 2.955492] kjournald starting. Commit interval 5 seconds
Feb 3 11:31:14 cirros kern.info kernel: [ 2.973249] EXT3-fs (vda): using internal journal
Feb 3 11:31:14 cirros kern.info kernel: [ 2.973330] EXT3-fs (vda): mounted filesystem with ordered data mode
Feb 3 11:31:14 cirros kern.info kernel: [ 5.386057] EXT3-fs (vda): using internal journal
Feb 3 11:31:15 cirros kern.debug kernel: [ 7.067755] ISO 9660 Extensions: Microsoft Joliet Level 3
Feb 3 11:31:15 cirros kern.debug kernel: [ 7.151524] ISO 9660 Extensions: RRIP_1991A
Feb 3 11:31:27 cirros kern.debug kernel: [ 19.008702] eth0: no IPv6 routers present
Feb 3 11:34:19 cirros authpriv.info dropbear[261]: Running in background
############ debug end ##############
  ____ ____ ____
 / __/ __ ____ ____ / __ \/ __/
/ /__ / // __// __// /_/ /\ \
\___//_//_/ /_/ \____/___/
   http://cirros-cloud.net

login as 'cirros' user. default password: 'cubswin:)'. use 'sudo' for root.
kvm-218-admin-3 login:

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

can you check system logs for dnsmasq messages? logs should show pretty much all requests/responses.
You'll be able to see if it is performance issue or something else. Also note process ids on those messages.
Right now dnsmasq gets restarted with each VM being spawned after its config files are renewed.
That might be the reason.

Revision history for this message
Soumiyajit (soumiyajit-v) wrote :

This happens as the dnsmasq request for ip may be not having a success check by the nova-compute. Need to verify but this is a usual case for multiple vms spawn.

Changed in neutron:
status: New → Confirmed
assignee: nobody → Soumiyajit (soumiyajit-v)
Revision history for this message
Kevin Benton (kevinbenton) wrote :
Changed in neutron:
status: Confirmed → Fix Released
importance: Undecided → High
assignee: Soumiyajit (soumiyajit-v) → Kevin Benton (kevinbenton)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/399535
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d1930cefd27448eefc373a229a26f8da25581983
Submitter: Jenkins
Branch: master

commit d1930cefd27448eefc373a229a26f8da25581983
Author: Kevin Benton <email address hidden>
Date: Fri Nov 18 04:25:32 2016 -0700

    Lock in DHCP agent based on network_id

    All cache operations and dnsmasq process operations
    are scoped to a network ID so we can always safely
    perform concurrent actions on different network IDs.
    This patch adjusts the DHCP agent to lock based on
    network ID rather than having a global lock for every
    operation.

    sync_state calls are still protected with a reader/writer
    lock to ensure that when sync_state needs to run, all
    other operations are blocked.

    Related-Bug: #1548190
    Change-Id: I56010dc801d82be56f12e834c5164316872c2f8b

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/newton)

Related fix proposed to branch: stable/newton
Review: https://review.openstack.org/450771

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/newton)

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: stable/newton
Review: https://review.openstack.org/450771
Reason: Superseded by https://review.openstack.org/#/c/452623/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/newton)

Reviewed: https://review.openstack.org/452623
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=37de2f6c4c831a8f4515eee9938996936dd2776c
Submitter: Jenkins
Branch: stable/newton

commit 37de2f6c4c831a8f4515eee9938996936dd2776c
Author: Kevin Benton <email address hidden>
Date: Fri Nov 18 04:25:32 2016 -0700

    Lock in DHCP agent based on network_id

    All cache operations and dnsmasq process operations
    are scoped to a network ID so we can always safely
    perform concurrent actions on different network IDs.
    This patch adjusts the DHCP agent to lock based on
    network ID rather than having a global lock for every
    operation.

    sync_state calls are still protected with a reader/writer
    lock to ensure that when sync_state needs to run, all
    other operations are blocked.

    Related-Bug: #1548190
    Change-Id: I56010dc801d82be56f12e834c5164316872c2f8b
    (cherry picked from commit d1930cefd27448eefc373a229a26f8da25581983)

    Squashed this this commit since tests fail otherwise:

    Always acquire network.id lock in dhcp port update

    Looking at the cache before aqcuiring a lock may cause the
    agent to mistakenly think the network doesn't exist when it
    is actually being wired in parallel.

    Always acquiring the network-based semaphore will ensure that
    the network isn't currently being setup in another coroutine.

    Closes-Bug: #1659919
    Change-Id: I99ae71e3c5b1cd91dca3f6c80b04d2ecb79de64f
    (cherry picked from commit 38de22bf2d4c0879a84db4fbc9fa030f181affc0)

tags: added: in-stable-newton
tags: removed: neutron-proactive-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.