Duplicate IPs during deploy on 1000+ nodes env

Bug #1630299 reported by Alexander Gordeev on 2016-10-04
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Vladimir Kozhukalov
Mitaka
High
Vladimir Kozhukalov
Newton
High
Vladimir Kozhukalov
Ocata
High
Vladimir Kozhukalov

Bug Description

This bug is a follow-up to https://bugs.launchpad.net/fuel/+bug/1378000 but for a newer release.

Detailed bug description:
  Still getting duplicated IPs for a couple of nodes. Fuel 9.0 (9.1)

[root@fuel ~]# grep 10.21.8.171 fresh.node.list
| 1078 | Untitled (d1:28) | discover | ubuntu | [] | 10.21.8.171 | 52:54:00:6c:d1:28 | None | Standard PC (i440FX + PIIX, 1996) | False |
| 39 | Untitled (39:04) | discover | ubuntu | [] | 10.21.8.171 | 52:54:00:2a:39:04 | None | Standard PC (i440FX + PIIX, 1996) | True |
[root@fuel ~]#

dnsmasq.conf contains dhcp-sequential-ip option. Perhaps, it doesn't work with more than 1000 nodes.

There's more than 200Gb of logs, i can't attach all of them. The env is still online, but we need to react ASAP. No idea, how long it would be kept.

Changed in fuel:
milestone: none → 9.2
importance: Undecided → High
status: New → Confirmed
Łukasz Oleś (loles) wrote :

It usually happens when you delete old environment and some nodes are not rebooted. This not rebooted nodes still have old node configuration.
Rebooted node can get the same IP as a node from the old env.

To make sure that it's not a case here please reboot the nodes with the same IPs.

Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Alexander Gordeev (a-gordeev)
Dmitry Pyzhov (dpyzhov) on 2016-11-01
Changed in fuel:
assignee: Alexander Gordeev (a-gordeev) → Fuel Sustaining (fuel-sustaining-team)
Dmitry Pyzhov (dpyzhov) on 2016-11-17
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Georgy Kibardin (gkibardin)
Aleksandr Didenko (adidenko) wrote :

Problem description:

1) node-A gets IP x.x.x.x
2) node-A reports IP x.x.x.x to nailgun and it's saved to DB
3) node-A goes offline (or DHCPREQUEST/DHCPACK packets are lost during IP lease renewal)
4) DHCP lease for IP x.x.x.x expires
5) node-B gets IP x.x.x.x
6) node-B reports IP x.x.x.x to nailgun and it's saved to DB (here we have duplicate IPs in DB)
7) deployment task is started where node-A and node-B have the same x.x.x.x IP
9) provisioning fails with error:
#<XMLRPC::FaultException: <class 'cobbler.cexceptions.CX'>:'IP address duplicated: x.x.x.x'>

Simple solution:
Increase DHCP lease time up to several days (average time between bootstrap of first nodes and provisioning step)

Complex solution (new fuel feature):
Update MAC<-->IP pinning in Cobbler/DHCP as soon as node (mac, IP) is updated in the DB, not only on provision step as we do now. This way we won't have "split brain" information on MAC<-->IP mappings.

Fix proposed to branch: master
Review: https://review.openstack.org/399127

Changed in fuel:
assignee: Georgy Kibardin (gkibardin) → Aleksandr Didenko (adidenko)
status: Confirmed → In Progress
Changed in fuel:
assignee: Aleksandr Didenko (adidenko) → Georgy Kibardin (gkibardin)

Change abandoned by Aleksandr Didenko (<email address hidden>) on branch: master
Review: https://review.openstack.org/399127
Reason: This was made for testing purposes

Changed in fuel:
status: In Progress → Confirmed
tags: added: release-notes

Related fix proposed to branch: master
Change author: Olena Logvinova <email address hidden>
Review: https://review.fuel-infra.org/29773

Reviewed: https://review.fuel-infra.org/29773
Submitter: Mariia Zlatkova <email address hidden>
Branch: master

Commit: b7d8b4438bcd9b6eb00a55b07ab4e002995d4b4a
Author: Olena Logvinova <email address hidden>
Date: Wed Jan 11 14:46:49 2017

[RN9.2] Known issues - Duplicated IPs during provisioning

Change-Id: I06dd2832c00eab0e67350a75df7d713a7bc87040
Related-Bug: #1630299

Sergey Galkin (sgalkin) wrote :

Reproduced on http://mirror.fuel-infra.org/mos-repos/centos/mos9.0-centos7/snapshots/proposed-2017-01-13-184421/x86_64

10 duplicated IP on 448 nodes cluster

As result
14:10:48 Node 'Untitled (89:60)' is back online
14:10:47 Node 'Untitled (8d:50)' has gone away
14:10:47 Node 'Untitled (89:60)' has gone away
14:10:47 Node 'Untitled (5e:e0)' has gone away
14:09:55 Node 'Untitled (ce:f0)' is back online
14:09:46 Node 'Untitled (9f:7c)' has gone away
14:09:46 Node 'Untitled (ca:30)' has gone away
14:08:46 Node 'Untitled (ce:f0)' has gone away
14:08:46 Node 'Untitled (aa:24)' has gone away
14:08:22 Node 'Untitled (7a:e0)' is back online
14:07:16 Node 'Untitled (7a:e0)' has gone away
14:06:46 Node 'Untitled (b5:30)' has gone away
14:06:32 Node 'Untitled (44:b0)' is back online

[root@fuel ~]# fuel nodes | awk -F '|' '{print $5}' | sort | uniq -c | grep -v ' 1 '
      2 10.21.0.25
      2 10.21.0.28
      2 10.21.0.32
      2 10.21.0.34
      2 10.21.0.37
      2 10.21.0.38
      2 10.21.0.39
      2 10.21.0.42
      2 10.21.0.43
      2 10.21.0.44

After detailed research of the situation it turned out that there were nodes in the lab that got addresses from other dhcp server or on the previous test cycle and then the Fuel master node was re-installed and all dhcp leases were removed. Once the new test cycle was started nodes got duplicate ip addresses.

After careful removal of all dhcp leases and disabling all unnecessary nodes everything went good. Setting bug to invalid status.

Sergey Galkin (sgalkin) wrote :

I have a repo with one Fuel. Steps for reproduce
1. Deploy cluster with 400 nodes.
2. Deployment should be failed (with one node offline on example)
3. Remove 100 nodes from cluster. Start deploy again
4. Deployment should be failed (with one node offline on example)
5. Reset cluster

After reset we have a lot of duplicates IPs, 33 in my case

Georgy Kibardin (gkibardin) wrote :
Dmitry Sutyagin (dsutyagin) wrote :

Imho, proper fix requires obtaining control over lease process, might be possible with dnsmasq witha --dhcp-script option together with --leasefile-ro (maybe) - and writing our own lease management, for which we will have full control, preventing any IP duplication. Alternativey we need to switch to a different DHCP server which provides API/CLI to control leases. The we can set static leases for all nodes except "discover", and remove them on reset / deletion of node. I'll try to look into dnsmasq, but it may be wiser in long term to switch to something like ISC DHCP server (dhcpd).

Georgy Kibardin (gkibardin) wrote :

I think it is better to start with the second option, i.e. replace dnsmasq with ISC. Dnsmasq was not initially designed for such a load.

Roman Rufanov (rrufanov) wrote :

switching to dhcpd is not good with timeline we have, please consider other options

Georgy Kibardin (gkibardin) wrote :

Looking at dnsmasq codebase I suspect finding race condition which leads to IP duplication may take even longer.

tags: added: release-notes-done
removed: release-notes
Dmitry Sutyagin (dsutyagin) wrote :

I've looked into dnsmasq lease control option and looks like there ain't any - dnsmasq can have an external database (with --dhcp-script and --leasefile-ro) BUT it will still maintain it's in-memory copy and provide new IPs based on that. This means that we cannot dynamically pre-allocate IPs outside dnsmasq without restarting it - it will not get the new values automatically, which makes the whole process unreliable (there will be a race between IP update in our code and dnsmasq's giving out new IPs before it's restarted, also frequent dnsmasq restarts may cause dhcp renew issues). So far I only see the path of switching to a different dhcp server + writing IP management code to control it.

Quick in-place fix:
You can edit file `/etc/dnsmasq.d/default.conf`
`dhcp-range=default,192.168.2.110,192.168.2.210,255.255.0.0,120m`
where 120m - is default lease time, after change you need to restart dnsmasq :
`systemctl restart dnsmasq.service`

But preferred fix it in, in fuel-library :
in fuel-master fix file :
/etc/puppet/mitaka-9.0/modules/fuel/manifests/dnsmasq/dhcp_range.pp
(at line `$lease_time = '120m'` )
(Read more in [1])
And re-run puppet apply with command :
`/etc/puppet/modules/fuel/examples/deploy.sh` (read more in [3])

Otherwise, with applying update or puppet re-run - you'r change will be overwrited by puppet.

After you done with dnsmasq config, you need to reboot all discovered nodes.
(or restart dhclient on each node)

# See in code
[1]https://github.com/openstack/fuel-library/blob/211e873bdd2e35313043b122990b2793d6249714/deployment/puppet/fuel/manifests/dnsmasq/dhcp_range.pp#L19
https://github.com/openstack/fuel-library/blob/211e873bdd2e35313043b122990b2793d6249714/deployment/puppet/fuel/templates/dnsmasq.dhcp-range.erb#L3
[3] https://github.com/openstack/fuel-main/blob/master/iso/bootstrap_admin_node.sh#L534

Related fix proposed to branch: stable/9.2
Change author: Olena Logvinova <email address hidden>
Review: https://review.fuel-infra.org/31621

Reviewed: https://review.fuel-infra.org/31621
Submitter: Olga Gusarenko <email address hidden>
Branch: stable/9.2

Commit: 2bca7db458180dd4e5e0dd870c5f9aaea4e53611
Author: Olena Logvinova <email address hidden>
Date: Tue Mar 7 15:41:45 2017

[RN 9.2] Known issues - Add workaround steps for 1630299

This patch adds detailed steps for the workaround of bug
https://bugs.launchpad.net/fuel/+bug/1630299

Change-Id: Iec8147ceb238477316ef38f3cea0d08ff439923f
Related-Bug: #1630299

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers