Deploy cluster with 57 nodes failed

Bug #1376680 reported by Sergey Galkin
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Matthew Mosesohn

Bug Description

api: '1.0'
astute_sha: f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13
auth_required: true
build_id: 2014-10-01_16-53-30
build_number: '47'
feature_groups:
- mirantis
fuellib_sha: 1cf0534efe911059853528e274891481146cedc5
fuelmain_sha: 0bd3360eb68d2604971fee2a27be0fa8e15d949c
nailgun_sha: eb8f2b358ea4bb7eb0b2a0075e7ad3d3a905db0d
ostf_sha: 64cb59c681658a7a55cc2c09d079072a41beb346
production: docker
release: 5.1.1

Steps to reproduce.
1. Start deploy cluster with Murano, Sahara, Ceilometer with HA on 3 controllers and computes with cinder lvm roles on Ubuntu

About 30 nodes hangs with 0% progress bar on Installed Ubuntu step
After rebooted all nodes through IPMI about 15 nodes from 30 installed succesfully
Last 15 nodes hangs and reboot don't helps. All nodes have error Network autoconfiguration failed.

Boot video for one node and screenshot with error attached

Tags: scale
Revision history for this message
Sergey Galkin (sgalkin) wrote :
Revision history for this message
Sergey Galkin (sgalkin) wrote :

Screenshot

Changed in fuel:
milestone: none → 6.0
importance: Undecided → High
Revision history for this message
Sergey Galkin (sgalkin) wrote :

I have the snapshot but its size about 450G. If you need it let me know.

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

I don't want a 450GB snapshot :D

We'll try raising dhcp timeout to 120 and test this fix and see if it works.

Changed in fuel:
assignee: nobody → Matthew Mosesohn (raytrac3r)
status: New → In Progress
Revision history for this message
Sergey Galkin (sgalkin) wrote :

Looks like the same root with https://bugs.launchpad.net/fuel/+bug/1378000

Revision history for this message
Tomasz 'Zen' Napierala (tzn) wrote :

Do we want to keep 120 second timeout as ultimate solution for this? What is the root cause then?

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

120s timeout didn't help. Duplicate ip address allocation is the more serious issue and that will definitely cause all sorts of connection issues. I suggest we break the system again and try to see how many duplicate IPs we have, where they are, and who we can blame for it. If it's nailgun, we've got some serious design flaws, probably related to DB consistency.

Revision history for this message
Łukasz Oleś (loles) wrote :

Can you extract from the snapshot cobbler logs?

tags: added: scale
Revision history for this message
Aleksandr Shaposhnikov (alashai8) wrote :

Well, main problem is that lease time for discovered nodes is around 16 minutes (1000s) and dnsmasq using non-sequential HASH-based assignment of IP's. So if IP wouldn't renewed then on this scale there is a high risk that this IP will be leased to another node because HASH algorithm of dnsmasq that currently used to dynamically assign IP based on MAC will go through lot of collisions. So first of all I suggest to do the following:
1. Increase lease time for discovered nodes to at least 2h.
2. Change dnsmasq algorithm to sequential one.
3. Check that bootstrap OS able to prolong the lease.
4. Check that MOS/FUEL able to accept node with new IP and able to handle that correctly in terms of db.

Revision history for this message
Sergey Galkin (sgalkin) wrote :

Reproduced on 20 nodes on

api: '1.0'
astute_sha: f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13
auth_required: true
build_id: 2014-10-13_00-01-06
build_number: '27'
feature_groups:
- mirantis
fuellib_sha: 46ad455514614ec2600314ac80191e0539ddfc04
fuelmain_sha: 431350ba204146f815f0e51dd47bf44569ae1f6d
nailgun_sha: 88a94a11426d356540722593af1603e5089d442c
ostf_sha: 64cb59c681658a7a55cc2c09d079072a41beb346
production: docker
release: 5.1.1

Revision history for this message
Sergey Galkin (sgalkin) wrote :

Snapshot for 20 nodes

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/127946
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=de060cc37f232870bf1f699b3973fc425e9abe05
Submitter: Jenkins
Branch: master

commit de060cc37f232870bf1f699b3973fc425e9abe05
Author: Łukasz Oleś <email address hidden>
Date: Sun Oct 12 22:05:54 2014 +0200

    Add dhcp-sequential-ip option to dnsmasq

    For many simultaneously DHCPDISCOVER requests dnsmasq
    can offer the same IP for two different MAC addresses.
    This option prevents it by assigning IPs one by one
    instead of using hashing algorithm.

    Change-Id: Iff3c42d21e1f1c09cb9eab5f07dbb066508dcb56
    Related-bug: 1378000
    Related-bug: 1376680
    Related-bug: 1379917
    Blueprint: 100-nodes-support

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (stable/5.1)

Related fix proposed to branch: stable/5.1
Review: https://review.openstack.org/128611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: stable/5.1
Review: https://review.openstack.org/128621

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/128591
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=bcdd8043d2ccd403f2b22c1ba40a81ea7c288090
Submitter: Jenkins
Branch: master

commit bcdd8043d2ccd403f2b22c1ba40a81ea7c288090
Author: Matthew Mosesohn <email address hidden>
Date: Wed Oct 15 13:29:53 2014 +0400

    Increase tolerance of install DHCP

    For Ubuntu, added the following kernel options:
    netcfg/link_detection_timeout=20
    netcfg/dhcptimeout=120
    For CentOS:
    dhcptimeout=120

    Closes-Bug: #1381266
    Related-Bug: #1376680
    Related-Bug: #1379917
    blueprint 100-nodes-support

    Change-Id: I3e5f687590a145ba174d19d392bdbb73d4d9a11e

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (stable/5.1)

Reviewed: https://review.openstack.org/128621
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=dfcb881320a7ce3d1e2f7ee578f62519155431b0
Submitter: Jenkins
Branch: stable/5.1

commit dfcb881320a7ce3d1e2f7ee578f62519155431b0
Author: Matthew Mosesohn <email address hidden>
Date: Wed Oct 15 13:29:53 2014 +0400

    Increase tolerance of install DHCP

    For Ubuntu, added the following kernel options:
    netcfg/link_detection_timeout=20
    netcfg/dhcptimeout=120
    For CentOS:
    dhcptimeout=120

    Closes-Bug: #1381266
    Related-Bug: #1376680
    Related-Bug: #1379917
    blueprint 100-nodes-support

    Change-Id: I3e5f687590a145ba174d19d392bdbb73d4d9a11e

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/128806

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/128806
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=a28fbb4ce9476bded8633abd11031a80b2eea47c
Submitter: Jenkins
Branch: master

commit a28fbb4ce9476bded8633abd11031a80b2eea47c
Author: Łukasz Oleś <email address hidden>
Date: Thu Oct 16 03:44:20 2014 +0200

    Increase tolerance of install DHCP

    For Ubuntu, added the following kernel options:
    netcfg/link_detection_timeout=20
    netcfg/dhcptimeout=120
    For CentOS:
    dhcptimeout=120

    Closes-Bug: #1381266
    Related-Bug: #1376680
    Related-Bug: #1379917
    blueprint 100-nodes-support

    Change-Id: I8bf38ae18b5741b03e6bda5f8e69748a8ecdf2ec

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (stable/5.1)

Related fix proposed to branch: stable/5.1
Review: https://review.openstack.org/129184

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (master)

Change abandoned by Sergii Golovatiuk (<email address hidden>) on branch: master
Review: https://review.openstack.org/128939

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/129203

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (stable/5.1)

Reviewed: https://review.openstack.org/129184
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=335f009ba3b4ab63f90e1394e9fe08cd16c84ead
Submitter: Jenkins
Branch: stable/5.1

commit 335f009ba3b4ab63f90e1394e9fe08cd16c84ead
Author: Łukasz Oleś <email address hidden>
Date: Thu Oct 16 03:44:20 2014 +0200

    Increase tolerance of install DHCP

    For Ubuntu, added the following kernel options:
    netcfg/link_detection_timeout=20
    netcfg/dhcptimeout=120
    For CentOS:
    dhcptimeout=120

    Closes-Bug: #1381266
    Related-Bug: #1376680
    Related-Bug: #1379917
    blueprint 100-nodes-support

    Change-Id: I8bf38ae18b5741b03e6bda5f8e69748a8ecdf2ec

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/129203
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=3070f3fa3c6d1b227242fd2a00a7babfc73424f5
Submitter: Jenkins
Branch: master

commit 3070f3fa3c6d1b227242fd2a00a7babfc73424f5
Author: Sergii Golovatiuk <email address hidden>
Date: Thu Oct 16 16:27:18 2014 +0200

    Increase settings for dnsmasq and sysctl

    * Make a new variable dhcp_lease_max. It increases the number of
      available leases from 1000 to 1800. It allows to provision nodes on
      scale, when Debian Installer or Anaconda looses IP in the middle of
      install.
    * Make a new variable lease_time. It increases the default lease size
      to 120m, up from the default 60m.
    * Add cache-size to dnsmasq template. dnsmasq will keep more entries in
      case.
    * Increased neighbour table on master node to keep more ARP requests
      that come in parallel once deployment is started. This change also
      removes unneed broadcast traffic. New values are:
      net.ipv4.neigh.default.gc_thresh1 = 256
      net.ipv4.neigh.default.gc_thresh2 = 1024
      net.ipv4.neigh.default.gc_thresh3 = 2048
    * Fix linting

    Related-Bug: #1376680
    Related-Bug: #1379917
    Related-Bug: #1381997
    blueprint 100-nodes-support
    DocImpact

    Change-Id: I4da8070143e401f7a9246e72eda35e601b8c6386

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (stable/5.1)

Related fix proposed to branch: stable/5.1
Review: https://review.openstack.org/129850

Łukasz Oleś (loles)
Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (stable/5.1)

Reviewed: https://review.openstack.org/129850
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=0e0479727e2240f8c51eb899435bac505377e245
Submitter: Jenkins
Branch: stable/5.1

commit 0e0479727e2240f8c51eb899435bac505377e245
Author: Sergii Golovatiuk <email address hidden>
Date: Thu Oct 16 16:27:18 2014 +0200

    Increase settings for dnsmasq and sysctl

    * Make a new variable dhcp_lease_max. It increases the number of
      available leases from 1000 to 1800. It allows to provision nodes on
      scale, when Debian Installer or Anaconda looses IP in the middle of
      install.
    * Make a new variable lease_time. It increases the default lease size
      to 120m, up from the default 60m.
    * Add cache-size to dnsmasq template. dnsmasq will keep more entries in
      case.
    * Increased neighbour table on master node to keep more ARP requests
      that come in parallel once deployment is started. This change also
      removes unneed broadcast traffic. New values are:
      net.ipv4.neigh.default.gc_thresh1 = 256
      net.ipv4.neigh.default.gc_thresh2 = 1024
      net.ipv4.neigh.default.gc_thresh3 = 2048
    * Fix linting

    Related-Bug: #1376680
    Related-Bug: #1379917
    Related-Bug: #1381997
    blueprint 100-nodes-support

    Change-Id: I4da8070143e401f7a9246e72eda35e601b8c6386

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/128611
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=1a785608f7b45af20a165b7cae1a5e2f0a4d63e0
Submitter: Jenkins
Branch: stable/5.1

commit 1a785608f7b45af20a165b7cae1a5e2f0a4d63e0
Author: Łukasz Oleś <email address hidden>
Date: Sun Oct 12 22:05:54 2014 +0200

    Add dhcp-sequential-ip option to dnsmasq

    For many simultaneously DHCPDISCOVER requests dnsmasq
    can offer the same IP for two different MAC addresses.
    This option prevents it by assigning IPs one by one
    instead of using hashing algorithm.

    Change-Id: Iff3c42d21e1f1c09cb9eab5f07dbb066508dcb56
    Related-bug: 1378000
    Related-bug: 1376680
    Related-bug: 1379917
    Blueprint: 100-nodes-support

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.