Fuel for OpenStack

Duplicate IPs during deploy on 1000+ nodes env

Bug #1630299 reported by Alexander Gordeev on 2016-10-04

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	In Progress	High	Vladimir Kozhukalov	Fuel for OpenStack 11.0
Mitaka	Won't Fix	High	Vladimir Kozhukalov	Fuel for OpenStack 9.x-updates
Newton	In Progress	High	Vladimir Kozhukalov	Fuel for OpenStack 10.1
Ocata	In Progress	High	Vladimir Kozhukalov	Fuel for OpenStack 11.0

Bug Description

This bug is a follow-up to https://bugs.launchpad.net/fuel/+bug/1378000 but for a newer release.

Detailed bug description:
Still getting duplicated IPs for a couple of nodes. Fuel 9.0 (9.1)

dnsmasq.conf contains dhcp-sequential-ip option. Perhaps, it doesn't work with more than 1000 nodes.

There's more than 200Gb of logs, i can't attach all of them. The env is still online, but we need to react ASAP. No idea, how long it would be kept.

Tags:

Stanislaw Bogatkin (sbogatkin) on 2016-10-05

Changed in fuel:
milestone:	none → 9.2
importance:	Undecided → High
status:	New → Confirmed

Revision history for this message

Łukasz Oleś (loles) wrote on 2016-10-10:

It usually happens when you delete old environment and some nodes are not rebooted. This not rebooted nodes still have old node configuration.
Rebooted node can get the same IP as a node from the old env.

To make sure that it's not a case here please reboot the nodes with the same IPs.

Oleksiy Molchanov (omolchanov) on 2016-10-28

Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → Alexander Gordeev (a-gordeev)

Dmitry Pyzhov (dpyzhov) on 2016-11-01

Changed in fuel:
assignee:	Alexander Gordeev (a-gordeev) → Fuel Sustaining (fuel-sustaining-team)

Dmitry Pyzhov (dpyzhov) on 2016-11-17

Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → Georgy Kibardin (gkibardin)

Revision history for this message

Aleksandr Didenko (adidenko) wrote on 2016-11-17:

Problem description:

1) node-A gets IP x.x.x.x
2) node-A reports IP x.x.x.x to nailgun and it's saved to DB
3) node-A goes offline (or DHCPREQUEST/DHCPACK packets are lost during IP lease renewal)
4) DHCP lease for IP x.x.x.x expires
5) node-B gets IP x.x.x.x
6) node-B reports IP x.x.x.x to nailgun and it's saved to DB (here we have duplicate IPs in DB)
7) deployment task is started where node-A and node-B have the same x.x.x.x IP
9) provisioning fails with error:
#<XMLRPC::FaultException: <class 'cobbler.cexceptions.CX'>:'IP address duplicated: x.x.x.x'>

Simple solution:
Increase DHCP lease time up to several days (average time between bootstrap of first nodes and provisioning step)

Complex solution (new fuel feature):
Update MAC<-->IP pinning in Cobbler/DHCP as soon as node (mac, IP) is updated in the DB, not only on provision step as we do now. This way we won't have "split brain" information on MAC<-->IP mappings.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-11-17: Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/399127

Changed in fuel:
assignee:	Georgy Kibardin (gkibardin) → Aleksandr Didenko (adidenko)
status:	Confirmed → In Progress

Aleksandr Didenko (adidenko) on 2016-11-17

Changed in fuel:
assignee:	Aleksandr Didenko (adidenko) → Georgy Kibardin (gkibardin)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-11-18: Change abandoned on fuel-astute (master)

Change abandoned by Aleksandr Didenko (<email address hidden>) on branch: master
Review: https://review.openstack.org/399127
Reason: This was made for testing purposes

Georgy Kibardin (gkibardin) on 2016-12-02

Changed in fuel:
status:	In Progress → Confirmed

Alexey Shtokolov (ashtokolov) on 2016-12-21

tags:

added: release-notes

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2017-01-11: Related fix proposed to mos/mos-docs (master)

Related fix proposed to branch: master
Change author: Olena Logvinova <email address hidden>
Review: https://review.fuel-infra.org/29773

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2017-01-11: Related fix merged to mos/mos-docs (master)

Reviewed: https://review.fuel-infra.org/29773
Submitter: Mariia Zlatkova <email address hidden>
Branch: master

Commit: b7d8b4438bcd9b6eb00a55b07ab4e002995d4b4a
Author: Olena Logvinova <email address hidden>
Date: Wed Jan 11 14:46:49 2017

[RN9.2] Known issues - Duplicated IPs during provisioning

Change-Id: I06dd2832c00eab0e67350a75df7d713a7bc87040
Related-Bug: #1630299

Revision history for this message

Sergey Galkin (sgalkin) wrote on 2017-01-16:

Reproduced on http://mirror.fuel-infra.org/mos-repos/centos/mos9.0-centos7/snapshots/proposed-2017-01-13-184421/x86_64

10 duplicated IP on 448 nodes cluster

As result
14:10:48 Node 'Untitled (89:60)' is back online
14:10:47 Node 'Untitled (8d:50)' has gone away
14:10:47 Node 'Untitled (89:60)' has gone away
14:10:47 Node 'Untitled (5e:e0)' has gone away
14:09:55 Node 'Untitled (ce:f0)' is back online
14:09:46 Node 'Untitled (9f:7c)' has gone away
14:09:46 Node 'Untitled (ca:30)' has gone away
14:08:46 Node 'Untitled (ce:f0)' has gone away
14:08:46 Node 'Untitled (aa:24)' has gone away
14:08:22 Node 'Untitled (7a:e0)' is back online
14:07:16 Node 'Untitled (7a:e0)' has gone away
14:06:46 Node 'Untitled (b5:30)' has gone away
14:06:32 Node 'Untitled (44:b0)' is back online

[root@fuel ~]# fuel nodes | awk -F '|' '{print $5}' | sort | uniq -c | grep -v ' 1 '
      2 10.21.0.25
      2 10.21.0.28
      2 10.21.0.32
      2 10.21.0.34
      2 10.21.0.37
      2 10.21.0.38
      2 10.21.0.39
      2 10.21.0.42
      2 10.21.0.43
      2 10.21.0.44

Revision history for this message

Vladimir Kozhukalov (kozhukalov) wrote on 2017-01-17:

After detailed research of the situation it turned out that there were nodes in the lab that got addresses from other dhcp server or on the previous test cycle and then the Fuel master node was re-installed and all dhcp leases were removed. Once the new test cycle was started nodes got duplicate ip addresses.

After careful removal of all dhcp leases and disabling all unnecessary nodes everything went good. Setting bug to invalid status.

Revision history for this message

Sergey Galkin (sgalkin) wrote on 2017-01-23:

I have a repo with one Fuel. Steps for reproduce
1. Deploy cluster with 400 nodes.
2. Deployment should be failed (with one node offline on example)
3. Remove 100 nodes from cluster. Start deploy again
4. Deployment should be failed (with one node offline on example)
5. Reset cluster

After reset we have a lot of duplicates IPs, 33 in my case

Revision history for this message

Georgy Kibardin (gkibardin) wrote on 2017-01-23:

#10

bug_dnsmasq_duplicates.tgz Edit (3.8 KiB, application/x-tar)

Revision history for this message

Dmitry Sutyagin (dsutyagin) wrote on 2017-01-25:

#11

Imho, proper fix requires obtaining control over lease process, might be possible with dnsmasq witha --dhcp-script option together with --leasefile-ro (maybe) - and writing our own lease management, for which we will have full control, preventing any IP duplication. Alternativey we need to switch to a different DHCP server which provides API/CLI to control leases. The we can set static leases for all nodes except "discover", and remove them on reset / deletion of node. I'll try to look into dnsmasq, but it may be wiser in long term to switch to something like ISC DHCP server (dhcpd).

Revision history for this message

Georgy Kibardin (gkibardin) wrote on 2017-01-25:

#12

I think it is better to start with the second option, i.e. replace dnsmasq with ISC. Dnsmasq was not initially designed for such a load.

Revision history for this message

Roman Rufanov (rrufanov) wrote on 2017-01-25:

#13

switching to dhcpd is not good with timeline we have, please consider other options

Revision history for this message

Georgy Kibardin (gkibardin) wrote on 2017-01-26:

#14

Looking at dnsmasq codebase I suspect finding race condition which leads to IP duplication may take even longer.

Maria Zlatkova (mzlatkova) on 2017-01-26

tags:

added: release-notes-done
removed: release-notes

Revision history for this message

Dmitry Sutyagin (dsutyagin) wrote on 2017-01-30:

#15

I've looked into dnsmasq lease control option and looks like there ain't any - dnsmasq can have an external database (with --dhcp-script and --leasefile-ro) BUT it will still maintain it's in-memory copy and provide new IPs based on that. This means that we cannot dynamically pre-allocate IPs outside dnsmasq without restarting it - it will not get the new values automatically, which makes the whole process unreliable (there will be a race between IP update in our code and dnsmasq's giving out new IPs before it's restarted, also frequent dnsmasq restarts may cause dhcp renew issues). So far I only see the path of switching to a different dhcp server + writing IP management code to control it.

Revision history for this message

Dmitry Pyzhov (dpyzhov) wrote on 2017-02-28:

#16

The bug is being fixed in blueprint https://blueprints.launchpad.net/fuel/+spec/get-rid-cobbler-dnsmasq

code is here: https://review.openstack.org/#/q/topic:bp/get-rid-cobbler-dnsmasq

Revision history for this message

Aleksey Zvyagintsev (azvyagintsev) wrote on 2017-03-07:

#17

Quick in-place fix:
You can edit file `/etc/dnsmasq.d/default.conf`
`dhcp-range=default,192.168.2.110,192.168.2.210,255.255.0.0,120m`
where 120m - is default lease time, after change you need to restart dnsmasq :
`systemctl restart dnsmasq.service`

But preferred fix it in, in fuel-library :
in fuel-master fix file :
/etc/puppet/mitaka-9.0/modules/fuel/manifests/dnsmasq/dhcp_range.pp
(at line `$lease_time = '120m'` )
(Read more in [1])
And re-run puppet apply with command :
`/etc/puppet/modules/fuel/examples/deploy.sh` (read more in [3])

Otherwise, with applying update or puppet re-run - you'r change will be overwrited by puppet.

After you done with dnsmasq config, you need to reboot all discovered nodes.
(or restart dhclient on each node)

# See in code
[1]https://github.com/openstack/fuel-library/blob/211e873bdd2e35313043b122990b2793d6249714/deployment/puppet/fuel/manifests/dnsmasq/dhcp_range.pp#L19
https://github.com/openstack/fuel-library/blob/211e873bdd2e35313043b122990b2793d6249714/deployment/puppet/fuel/templates/dnsmasq.dhcp-range.erb#L3
[3] https://github.com/openstack/fuel-main/blob/master/iso/bootstrap_admin_node.sh#L534

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2017-03-07: Related fix proposed to mos/mos-docs (stable/9.2)

#18

Related fix proposed to branch: stable/9.2
Change author: Olena Logvinova <email address hidden>
Review: https://review.fuel-infra.org/31621

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2017-03-09: Related fix merged to mos/mos-docs (stable/9.2)

#19

Reviewed: https://review.fuel-infra.org/31621
Submitter: Olga Gusarenko <email address hidden>
Branch: stable/9.2

Commit: 2bca7db458180dd4e5e0dd870c5f9aaea4e53611
Author: Olena Logvinova <email address hidden>
Date: Tue Mar 7 15:41:45 2017

[RN 9.2] Known issues - Add workaround steps for 1630299

This patch adds detailed steps for the workaround of bug
https://bugs.launchpad.net/fuel/+bug/1630299

Change-Id: Iec8147ceb238477316ef38f3cea0d08ff439923f
Related-Bug: #1630299

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Related blueprints

Rework dhcp/dns/tftp related stuff

Bug attachments

bug_dnsmasq_duplicates.tgz Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.