20 nodes discovering failed

Bug #1379917 reported by Sergey Galkin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Tomasz 'Zen' Napierala
5.1.x
Fix Committed
High
Tomasz 'Zen' Napierala

Bug Description

api: '1.0'
astute_sha: f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13
auth_required: true
build_id: 2014-10-09_01-14-07
build_number: '23'
feature_groups:
- mirantis
fuellib_sha: 46ad455514614ec2600314ac80191e0539ddfc04
fuelmain_sha: 8c9f42552fb37de643a500f8f99ba8e7fa4e15e3
nailgun_sha: eb8f2b358ea4bb7eb0b2a0075e7ad3d3a905db0d
ostf_sha: 64cb59c681658a7a55cc2c09d079072a41beb346
production: docker
release: 5.1.1

Steps to reproduce.
1. Create lab with 21 nodes with PXE boot on 10G interfacases.
2. Power off 20 nodes
2. Install Fuel in kvm on the one node.
3. Power on 20 nodes simultaneously
Fuel will discover only 15-17 nodes. 3-5 nodes can't boot from pxe and will boot from hdd.

Logs from jenkins job with time after job start

Case #1, power on 20 nodes simultaneously, only 16 nodes discovered.

00:04:34.359 2014-10-10 20:13:46,936 DEBUG - Discover 1 nodes
00:04:44.496 2014-10-10 20:13:57,072 DEBUG - Discover 2 nodes
00:04:59.724 2014-10-10 20:14:12,301 DEBUG - Discover 5 nodes
00:05:04.824 2014-10-10 20:14:17,401 DEBUG - Discover 7 nodes
00:05:15.035 2014-10-10 20:14:27,612 DEBUG - Discover 8 nodes
00:05:20.152 2014-10-10 20:14:32,729 DEBUG - Discover 10 nodes
00:05:30.401 2014-10-10 20:14:42,978 DEBUG - Discover 11 nodes
00:05:45.794 2014-10-10 20:14:58,371 DEBUG - Discover 14 nodes
00:05:50.925 2014-10-10 20:15:03,502 DEBUG - Discover 15 nodes
00:05:56.094 2014-10-10 20:15:08,671 DEBUG - Discover 16 nodes

Case #2, power on 20 nodes with 10 seconds interval, only 19 nodes discoverd.

00:04:32.024 2014-10-10 21:21:57,947 DEBUG - Discover 1 nodes
00:04:37.106 2014-10-10 21:22:03,029 DEBUG - Discover 2 nodes
00:05:02.467 2014-10-10 21:22:28,390 DEBUG - Discover 3 nodes
00:05:22.761 2014-10-10 21:22:48,684 DEBUG - Discover 4 nodes
00:05:38.011 2014-10-10 21:23:03,934 DEBUG - Discover 6 nodes
00:05:43.107 2014-10-10 21:23:09,030 DEBUG - Discover 7 nodes
00:05:48.203 2014-10-10 21:23:14,126 DEBUG - Discover 9 nodes
00:05:58.419 2014-10-10 21:23:24,342 DEBUG - Discover 10 nodes
00:06:29.079 2014-10-10 21:23:55,002 DEBUG - Discover 11 nodes
00:06:44.424 2014-10-10 21:24:10,347 DEBUG - Discover 12 nodes
00:06:59.814 2014-10-10 21:24:25,737 DEBUG - Discover 14 nodes
00:07:15.213 2014-10-10 21:24:41,135 DEBUG - Discover 15 nodes
00:07:35.757 2014-10-10 21:25:01,680 DEBUG - Discover 16 nodes
00:07:56.359 2014-10-10 21:25:22,282 DEBUG - Discover 17 nodes
00:08:06.646 2014-10-10 21:25:32,569 DEBUG - Discover 18 nodes
00:08:32.377 2014-10-10 21:25:58,300 DEBUG - Discover 19 nodes

Case #3, power on 20 nodes with 15 seconds interval, all nodes discovered

00:05:21.753 2014-10-10 22:05:23,777 DEBUG - Discover 2 nodes
00:05:26.839 2014-10-10 22:05:28,862 DEBUG - Discover 3 nodes
00:06:02.439 2014-10-10 22:06:04,462 DEBUG - Discover 4 nodes
00:06:12.747 2014-10-10 22:06:14,770 DEBUG - Discover 5 nodes
00:06:22.954 2014-10-10 22:06:24,978 DEBUG - Discover 6 nodes
00:07:03.673 2014-10-10 22:07:05,697 DEBUG - Discover 10 nodes
00:07:13.909 2014-10-10 22:07:15,933 DEBUG - Discover 11 nodes
00:07:54.799 2014-10-10 22:07:56,823 DEBUG - Discover 13 nodes
00:08:05.056 2014-10-10 22:08:07,080 DEBUG - Discover 14 nodes
00:08:20.456 2014-10-10 22:08:22,480 DEBUG - Discover 15 nodes
00:08:46.131 2014-10-10 22:08:48,154 DEBUG - Discover 16 nodes
00:09:06.691 2014-10-10 22:09:08,715 DEBUG - Discover 17 nodes
00:09:11.844 2014-10-10 22:09:13,868 DEBUG - Discover 19 nodes
00:10:08.528 2014-10-10 22:10:10,552 DEBUG - Discover 20 nodes

Tags: scale
Revision history for this message
Sergey Galkin (sgalkin) wrote :

Snapshot for case #1

Łukasz Oleś (loles)
Changed in fuel:
status: New → Triaged
importance: Undecided → High
milestone: none → 6.0
Revision history for this message
Łukasz Oleś (loles) wrote :

It looks like dnsmasq assigne one IP to two nodes:
IP MAC
10.20.1.131 0c:c4:7a:1d:90:fe
10.20.1.131 0c:c4:7a:1d:92:76

10.20.1.45 0c:c4:7a:1d:91:64
10.20.1.45 0c:c4:7a:1d:93:da

As Alexandr Shaposhnikov suggested in another bug setting dhcp-sequential-ip option in dnsmasq fixes the problem.
This change requires more investigation

Łukasz Oleś (loles)
tags: added: scale
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: nobody → Tomasz 'Zen' Napierala (tzn)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/127946
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=de060cc37f232870bf1f699b3973fc425e9abe05
Submitter: Jenkins
Branch: master

commit de060cc37f232870bf1f699b3973fc425e9abe05
Author: Łukasz Oleś <email address hidden>
Date: Sun Oct 12 22:05:54 2014 +0200

    Add dhcp-sequential-ip option to dnsmasq

    For many simultaneously DHCPDISCOVER requests dnsmasq
    can offer the same IP for two different MAC addresses.
    This option prevents it by assigning IPs one by one
    instead of using hashing algorithm.

    Change-Id: Iff3c42d21e1f1c09cb9eab5f07dbb066508dcb56
    Related-bug: 1378000
    Related-bug: 1376680
    Related-bug: 1379917
    Blueprint: 100-nodes-support

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (stable/5.1)

Related fix proposed to branch: stable/5.1
Review: https://review.openstack.org/128611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: stable/5.1
Review: https://review.openstack.org/128621

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/128591
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=bcdd8043d2ccd403f2b22c1ba40a81ea7c288090
Submitter: Jenkins
Branch: master

commit bcdd8043d2ccd403f2b22c1ba40a81ea7c288090
Author: Matthew Mosesohn <email address hidden>
Date: Wed Oct 15 13:29:53 2014 +0400

    Increase tolerance of install DHCP

    For Ubuntu, added the following kernel options:
    netcfg/link_detection_timeout=20
    netcfg/dhcptimeout=120
    For CentOS:
    dhcptimeout=120

    Closes-Bug: #1381266
    Related-Bug: #1376680
    Related-Bug: #1379917
    blueprint 100-nodes-support

    Change-Id: I3e5f687590a145ba174d19d392bdbb73d4d9a11e

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (stable/5.1)

Reviewed: https://review.openstack.org/128621
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=dfcb881320a7ce3d1e2f7ee578f62519155431b0
Submitter: Jenkins
Branch: stable/5.1

commit dfcb881320a7ce3d1e2f7ee578f62519155431b0
Author: Matthew Mosesohn <email address hidden>
Date: Wed Oct 15 13:29:53 2014 +0400

    Increase tolerance of install DHCP

    For Ubuntu, added the following kernel options:
    netcfg/link_detection_timeout=20
    netcfg/dhcptimeout=120
    For CentOS:
    dhcptimeout=120

    Closes-Bug: #1381266
    Related-Bug: #1376680
    Related-Bug: #1379917
    blueprint 100-nodes-support

    Change-Id: I3e5f687590a145ba174d19d392bdbb73d4d9a11e

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/128806

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/128806
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=a28fbb4ce9476bded8633abd11031a80b2eea47c
Submitter: Jenkins
Branch: master

commit a28fbb4ce9476bded8633abd11031a80b2eea47c
Author: Łukasz Oleś <email address hidden>
Date: Thu Oct 16 03:44:20 2014 +0200

    Increase tolerance of install DHCP

    For Ubuntu, added the following kernel options:
    netcfg/link_detection_timeout=20
    netcfg/dhcptimeout=120
    For CentOS:
    dhcptimeout=120

    Closes-Bug: #1381266
    Related-Bug: #1376680
    Related-Bug: #1379917
    blueprint 100-nodes-support

    Change-Id: I8bf38ae18b5741b03e6bda5f8e69748a8ecdf2ec

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (stable/5.1)

Related fix proposed to branch: stable/5.1
Review: https://review.openstack.org/129184

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (master)

Change abandoned by Sergii Golovatiuk (<email address hidden>) on branch: master
Review: https://review.openstack.org/128939

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/129203

Changed in fuel:
status: Triaged → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (stable/5.1)

Reviewed: https://review.openstack.org/129184
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=335f009ba3b4ab63f90e1394e9fe08cd16c84ead
Submitter: Jenkins
Branch: stable/5.1

commit 335f009ba3b4ab63f90e1394e9fe08cd16c84ead
Author: Łukasz Oleś <email address hidden>
Date: Thu Oct 16 03:44:20 2014 +0200

    Increase tolerance of install DHCP

    For Ubuntu, added the following kernel options:
    netcfg/link_detection_timeout=20
    netcfg/dhcptimeout=120
    For CentOS:
    dhcptimeout=120

    Closes-Bug: #1381266
    Related-Bug: #1376680
    Related-Bug: #1379917
    blueprint 100-nodes-support

    Change-Id: I8bf38ae18b5741b03e6bda5f8e69748a8ecdf2ec

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/129203
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=3070f3fa3c6d1b227242fd2a00a7babfc73424f5
Submitter: Jenkins
Branch: master

commit 3070f3fa3c6d1b227242fd2a00a7babfc73424f5
Author: Sergii Golovatiuk <email address hidden>
Date: Thu Oct 16 16:27:18 2014 +0200

    Increase settings for dnsmasq and sysctl

    * Make a new variable dhcp_lease_max. It increases the number of
      available leases from 1000 to 1800. It allows to provision nodes on
      scale, when Debian Installer or Anaconda looses IP in the middle of
      install.
    * Make a new variable lease_time. It increases the default lease size
      to 120m, up from the default 60m.
    * Add cache-size to dnsmasq template. dnsmasq will keep more entries in
      case.
    * Increased neighbour table on master node to keep more ARP requests
      that come in parallel once deployment is started. This change also
      removes unneed broadcast traffic. New values are:
      net.ipv4.neigh.default.gc_thresh1 = 256
      net.ipv4.neigh.default.gc_thresh2 = 1024
      net.ipv4.neigh.default.gc_thresh3 = 2048
    * Fix linting

    Related-Bug: #1376680
    Related-Bug: #1379917
    Related-Bug: #1381997
    blueprint 100-nodes-support
    DocImpact

    Change-Id: I4da8070143e401f7a9246e72eda35e601b8c6386

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (stable/5.1)

Related fix proposed to branch: stable/5.1
Review: https://review.openstack.org/129850

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (stable/5.1)

Reviewed: https://review.openstack.org/129850
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=0e0479727e2240f8c51eb899435bac505377e245
Submitter: Jenkins
Branch: stable/5.1

commit 0e0479727e2240f8c51eb899435bac505377e245
Author: Sergii Golovatiuk <email address hidden>
Date: Thu Oct 16 16:27:18 2014 +0200

    Increase settings for dnsmasq and sysctl

    * Make a new variable dhcp_lease_max. It increases the number of
      available leases from 1000 to 1800. It allows to provision nodes on
      scale, when Debian Installer or Anaconda looses IP in the middle of
      install.
    * Make a new variable lease_time. It increases the default lease size
      to 120m, up from the default 60m.
    * Add cache-size to dnsmasq template. dnsmasq will keep more entries in
      case.
    * Increased neighbour table on master node to keep more ARP requests
      that come in parallel once deployment is started. This change also
      removes unneed broadcast traffic. New values are:
      net.ipv4.neigh.default.gc_thresh1 = 256
      net.ipv4.neigh.default.gc_thresh2 = 1024
      net.ipv4.neigh.default.gc_thresh3 = 2048
    * Fix linting

    Related-Bug: #1376680
    Related-Bug: #1379917
    Related-Bug: #1381997
    blueprint 100-nodes-support

    Change-Id: I4da8070143e401f7a9246e72eda35e601b8c6386

Sergey Galkin (sgalkin)
description: updated
Revision history for this message
Sergey Galkin (sgalkin) wrote :

On

api: '1.0'
astute_sha: 97eea90efe0a1f17b4934919d6e459d270c10372
auth_required: true
build_id: 2014-10-23_19-30-06
build_number: '44'
feature_groups:
- mirantis
fuellib_sha: fb088f2c77b00b75c9c894a06d5779fdeeb536ca
fuelmain_sha: 7b34d04edb929659dad5fa4ddb224346d4e74ec3
nailgun_sha: 52e777a8dee1bfa8ca3f69e42bb563696fdfd995
ostf_sha: de177931b53fbe9655502b73d03910b8118e25f1
production: docker

 If we power on all 100 nodes together we will not discover all 100 nodes.
100% discovering for all 100 nodes happens only with 7 second interval for power on

Revision history for this message
Tomasz 'Zen' Napierala (tzn) wrote :

We need to reproduce it with latest, #53 ISO and observe entire deployment

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/128611
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=1a785608f7b45af20a165b7cae1a5e2f0a4d63e0
Submitter: Jenkins
Branch: stable/5.1

commit 1a785608f7b45af20a165b7cae1a5e2f0a4d63e0
Author: Łukasz Oleś <email address hidden>
Date: Sun Oct 12 22:05:54 2014 +0200

    Add dhcp-sequential-ip option to dnsmasq

    For many simultaneously DHCPDISCOVER requests dnsmasq
    can offer the same IP for two different MAC addresses.
    This option prevents it by assigning IPs one by one
    instead of using hashing algorithm.

    Change-Id: Iff3c42d21e1f1c09cb9eab5f07dbb066508dcb56
    Related-bug: 1378000
    Related-bug: 1376680
    Related-bug: 1379917
    Blueprint: 100-nodes-support

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.