bootstrap/centos: dhcp client does not handle IP addresses confict

Bug #1492307 reported by Leontii Istomin
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Alexei Sheplyakov
7.0.x
Won't Fix
High
Alexei Sheplyakov

Bug Description

5 from 203 bootstraped nodes can't get ip address via dhcp.
Screenshots with debug information are attached.
If I perform dhclient on the node, it works fine.

There were some bootstrap nodes booted by the previous installation of
the master node (perhaps they failed to reboot). The DHCP server on
the newly deployed master node is not aware of those nodes, thus
the IP addresses leased to the the nodes collide with the ones of
those hanging nodes. Apparently CentOS' network configuration scripts
are unable to handle such a conflict without a special configuration.

dhclient shipped with CentOS bails out (after sending DHCPDECLINE) if
a duplicate address has been detected and `-1' ("one shot") flag has
been given. This kind of violates the RFC 2131 which prescribes
the client to restart the configuration process. On the other
hand retrying the configuration denies the `-1' switch purpose.
CentOS developers choose to violate the RFC to obey the "one shot"
behavior. As a result network-scripts are unable to handle IP
addresses conflict and fail to bring the interfaces up during
the OS boot sequence.

api: '1.0'
astute_sha: ad6d59812b775bc12e7bd7aec8f81374595ffa63
auth_required: true
build_id: '268'
build_number: '268'
feature_groups:
- mirantis
fuel-agent_sha: 082a47bf014002e515001be05f99040437281a2d
fuel-library_sha: f3780484874f5f4a1831714710ff552f33522915
fuel-nailgun-agent_sha: d7027952870a35db8dc52f185bb1158cdd3d1ebd
fuel-ostf_sha: 582a81ccaa1e439a3aec4b8b8f6994735de840f4
fuelmain_sha: 9ab01caf960013dc882825dc9b0e11ccf0b81cb0
nailgun_sha: f882c428db97ee1eb93a4871f9d5857c5a7771b2
openstack_version: 2015.1.0-7.0
production: docker
python-fuelclient_sha: 9643fa07f1290071511066804f962f62fe27b512
release: '7.0'

Diagnostic Snapshot is here: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-09-04_14-13-11.tar.xz

Revision history for this message
Leontii Istomin (listomin) wrote :
Revision history for this message
Leontii Istomin (listomin) wrote :

/var/log from the node

description: updated
Revision history for this message
Leontii Istomin (listomin) wrote :

Please pay attention on this screenshot. Probably, the interface wasn't ready when dhclient tried to get ip address

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

init does not wait for all NICs to initialize before starting /etc/init.d/network
Neither does setub-bootdev [1] script, which is supposed to create network configuration such that
only the boot interface is configured

 https://github.com/stackforge/fuel-main/blob/master/bootstrap/sync/etc/init.d/setup-bootdev

Revision history for this message
Georgy Okrokvertskhov (gokrokvertskhov) wrote :

This issue has simple workaround: the server which did not get an IP address should be rebooted. As we are talking about initial cloud set-up it is appropriate to ask that some nodes might be restarted if they did not appear int he discovery list.
This issue should be properly documented in the know issues.

Revision history for this message
Aleksey Zvyagintsev (azvyagintsev) wrote :

@Alexey
is this issue in setup-bootdev [1] script related to situation when :
1) Boot node into bootstrap
2) eth0 - admin if
3)reboot node
4) node again into bootstrap
5)eth4 - admin interface
problem - iface flapping ?
(i suppose yes - but please, fix me if i mistaken )

 https://github.com/stackforge/fuel-main/blob/master/bootstrap/sync/etc/init.d/setup-bootdev

We encountered with flapping issue on Dell 630 instances. (ixgbe)

Changed in fuel:
importance: Undecided → High
milestone: none → 7.0
Revision history for this message
Leontii Istomin (listomin) wrote :

I've reproduced the issue with 7.0-182 build
I can't reproduce the issue withe 7.0-98 buld

Revision history for this message
Leontii Istomin (listomin) wrote :

reproduced the issue with 7.0-98 build

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-main (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/220728

Revision history for this message
Aleksey Zvyagintsev (azvyagintsev) wrote : Re: 5 from 203 bootstraped nodes can't get ip address via dhcp

reproduced the issue with 7.0-286 build on Customer HW
raising up importance

Changed in fuel:
importance: High → Critical
Revision history for this message
Aleksey Zvyagintsev (azvyagintsev) wrote :

Reproduced on flow(as usual, from time-to-time):
1) Boot node into bootstrap
2) eth4 - admin if
3)deploy env
4)reset env
5) node again into bootstrap
6) eth0 - admin if

Also, maybe we got another one bug - Nailgun UI continue thinking that node has eth4 admin if.(but i think its related with some timeout for info flow between nailgun-agent and naigun-ui\db)

Revision history for this message
Leontii Istomin (listomin) wrote :

1. grab linux and initramfs.img from http://jenkins-product.srt.mirantis.net:8080/view/custom_iso/job/custom_7.0_iso/1170 ISO
2. put them into /var/www/nailgun/bootstrap/
3. chmod 755 for thos files
4. dockerctl shell cobbler cobbler sync
5. dockerctl shell cobbler service dnsmasq restart

reproduced the issue (logs from the node are attached)

Revision history for this message
Leontii Istomin (listomin) wrote :

screenshot of node's IPMI is attached

Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
Igor Marnat (imarnat)
summary: - 5 from 203 bootstraped nodes can't get ip address via dhcp
+ 5 out of 203 bootstraped nodes can't get ip address via dhcp
Revision history for this message
Eugene Bogdanov (ebogdanov) wrote : Re: 5 out of 203 bootstraped nodes can't get ip address via dhcp

DCHP issue cannot be treated as critical. Decreasing priority to high. Issue with interface naming must be filed as a separate bug.

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

Actually this is a configuration issue: the network configuration generated by bootstrap is wrong/error prone.

summary: - 5 out of 203 bootstraped nodes can't get ip address via dhcp
+ bootstrap/centos: dhcp client does not retry to obtain an IP
summary: - bootstrap/centos: dhcp client does not retry to obtain an IP
+ bootstrap/centos: dhcp client does not handle IP addresses confict
description: updated
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-main (master)

Fix proposed to branch: master
Review: https://review.openstack.org/221315

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Igor Marnat (imarnat) wrote :

Marked as Won't fix for 7.0 - risk of regression is too high and impact is too low. Will fix for further releases.

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

I'm not sure if Alexey is an appropriate assignee. But 'fuel' user is definitely not appropriate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-main (master)

Reviewed: https://review.openstack.org/221315
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=ace6a4a56074299fd10f92217c09306b3af2714d
Submitter: Jenkins
Branch: master

commit ace6a4a56074299fd10f92217c09306b3af2714d
Author: Alexey Sheplyakov <email address hidden>
Date: Tue Sep 8 17:06:47 2015 +0300

    bootstrap: force dhclient to restart configuration on failures

    dhclient shipped with CentOS bails out (after sending DHCPDECLINE) if
    a duplicate address has been detected and `-1' ("one shot") flag has
    been given. This kind of violates the RFC 2131 which prescribes
    the client to restart the configuration process. On the other hand
    retrying the configuration denies the `-1' switch purpose. CentOS'
    developers choose to violate the RFC to obey the "one shot" behavior.
    As a result network-scripts are unable to handle IP addresses conflict
    and fail to bring the interfaces up during the OS boot sequence.

    Suppose there are some bootstrap nodes booted by a different master
    node (say, by a previously deployed one) in the admin/pxe network.
    The DHCP server on the current (the newly deployed) master node is
    not aware of those nodes, thus the IP addresses leased to the nodes
    collide with the ones of the (stale) nodes booted by a different
    master node. As a result some nodes will fail to boot.

    In order to solve the problem specify the PERSISTENT_DHCLIENT flag
    in the ifcfg-${boot_iface} config file, so dhclient will retry
    the configuration.

    Closes-Bug: #1492307
    Change-Id: Ica89fc05d4f936feb3066d27a8fb9a24bc36b55c

Changed in fuel:
status: In Progress → Fix Committed
Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/8.0.x
Changed in fuel:
milestone: 7.0 → 8.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-main (master)

Reviewed: https://review.openstack.org/220728
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=29e6f272ca843ee215c68e445ce41886a253a1a1
Submitter: Jenkins
Branch: master

commit 29e6f272ca843ee215c68e445ce41886a253a1a1
Author: Alexey Sheplyakov <email address hidden>
Date: Sat Sep 5 12:57:31 2015 +0300

    bootstrap/centos: try to detect boot NIC harder

    A network hardware might take a while to initialize (due to loading
    firmware, STP, hardware VLAN filtering initialization, etc). Therefore
    setup-bootdev should retry to find out the boot interface name instead
    of giving up immediately, chances are that the hardware hasn't been
    initialized yet.

    Related-Bug: #1492307
    Change-Id: I5d6274fc3d632591631780d1e5824929632a78b6

Dmitry Pyzhov (dpyzhov)
tags: added: mos-linux
Dmitry Pyzhov (dpyzhov)
tags: added: area-mos
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

It looks like the issue was fixed, it is not easy to reproduce the issue on environment without fix and it doesn't reproduce on MOS 8.0 RC1

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.