Inspection should wait for IP addresses to appear on nodes

Bug #1564954 reported by Dmitry Tantsur
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ironic-python-agent
Fix Released
Medium
Dmitry Tantsur

Bug Description

While it's not a problem for the coreos image, the DIB build uses pretty hacky approach to DHCP'ing. Sometimes it results in inspection reporting no IP addresses for any NIC's, while actually IPA just ran after DHCP has finished. See e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1322892. This results in failed inspection.

Dmitry Tantsur (divius)
Changed in ironic-python-agent:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic-python-agent (master)

Fix proposed to branch: master
Review: https://review.openstack.org/300548

Changed in ironic-python-agent:
assignee: Dmitry Tantsur (divius) → Jim Rollenhagen (jim-rollenhagen)
Changed in ironic-python-agent:
assignee: Jim Rollenhagen (jim-rollenhagen) → Dmitry Tantsur (divius)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic-python-agent (master)

Reviewed: https://review.openstack.org/300548
Committed: https://git.openstack.org/cgit/openstack/ironic-python-agent/commit/?id=3deb25a3cec7955c5e38d83af74add58478f884c
Submitter: Jenkins
Branch: master

commit 3deb25a3cec7955c5e38d83af74add58478f884c
Author: Dmitry Tantsur <email address hidden>
Date: Fri Apr 1 16:36:01 2016 +0200

    Wait for the interfaces to get IP addresses before inspection

    In the DIB build the DHCP code (provided by the dhcp-all-interfaces element)
    races with the service starting IPA. It does not matter for deployment itself,
    as we're waiting for the route to the Ironic API to appear. However, for
    inspection it may result in reporting back all NIC's without IP addresses.
    Inspection fails in this case.

    This change makes inspection wait for *all* NIC's to get their IP addresses up
    to a small timeout. The timeout is 60 seconds by default and can be changed
    via the new ipa-inspection-dhcp-wait-timeout kernel option (0 to not wait).

    After the wait inspection proceedes in any case, so the worst downside
    is making inspection 60 seconds longer.

    To avoid waiting for NIC's that are not even connected, this change extends the
    NetworkInterface class with 'has_carrier' field.

    Closes-Bug: #1564954
    Change-Id: I5bf14de4c1c622f4bf6e3eadbe20c44759da5d66

Changed in ironic-python-agent:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic-python-agent (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/305916

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic-python-agent (stable/mitaka)

Reviewed: https://review.openstack.org/305916
Committed: https://git.openstack.org/cgit/openstack/ironic-python-agent/commit/?id=3fba1ee8db0aa0b1519ef2135e602268488570f4
Submitter: Jenkins
Branch: stable/mitaka

commit 3fba1ee8db0aa0b1519ef2135e602268488570f4
Author: Dmitry Tantsur <email address hidden>
Date: Fri Apr 1 16:36:01 2016 +0200

    Wait for the interfaces to get IP addresses before inspection

    In the DIB build the DHCP code (provided by the dhcp-all-interfaces element)
    races with the service starting IPA. It does not matter for deployment itself,
    as we're waiting for the route to the Ironic API to appear. However, for
    inspection it may result in reporting back all NIC's without IP addresses.
    Inspection fails in this case.

    This change makes inspection wait for *all* NIC's to get their IP addresses up
    to a small timeout. The timeout is 60 seconds by default and can be changed
    via the new ipa-inspection-dhcp-wait-timeout kernel option (0 to not wait).

    After the wait inspection proceedes in any case, so the worst downside
    is making inspection 60 seconds longer.

    To avoid waiting for NIC's that are not even connected, this change extends the
    NetworkInterface class with 'has_carrier' field.

    Closes-Bug: #1564954
    Change-Id: I5bf14de4c1c622f4bf6e3eadbe20c44759da5d66
    (cherry picked from commit 3deb25a3cec7955c5e38d83af74add58478f884c)

tags: added: in-stable-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ironic-python-agent (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/313511

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ironic-python-agent (master)

Reviewed: https://review.openstack.org/313511
Committed: https://git.openstack.org/cgit/openstack/ironic-python-agent/commit/?id=6da6ace3840d56c7145ddf528bbdcbb813fc6ce2
Submitter: Jenkins
Branch: master

commit 6da6ace3840d56c7145ddf528bbdcbb813fc6ce2
Author: Dmitry Tantsur <email address hidden>
Date: Fri May 6 13:26:44 2016 +0200

    [inspection] wait for the PXE DHCP by default and remove the carrier check

    We hoped that checking /sys/class/net/XXX/carrier will allow us
    to not wait for interfaces that are not connected at all.
    In reality this field turned out to be unreliable. For example, it is
    also set to 0 when interface is down or is being configured.
    The bug https://bugzilla.redhat.com/show_bug.cgi?id=1327255 shows
    the case when carrier is 0 for all interfaces, including one that is
    used to post back data, which is obvious non-sense.

    This change removes check on carrier for the loop. To avoid 60 seconds
    wait for people with several NIC's, it's changed to only wait for the
    PXE booting NIC, which obviously must get an IP address.

    This makes IP addresses in the inspection data for other NIC's somewhat
    unreliable. A new option inspection_dhcp_all_interfaces is introduced
    to allow waiting for all NIC's to get IP addresses.

    This change should finally fix bug 1564954.

    Change-Id: I8b04bf726980fdcf6bd536c6bb28e30ac50658fb
    Related-Bug: #1564954

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ironic-python-agent (stable/mitaka)

Related fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/314713

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ironic-python-agent (stable/mitaka)

Reviewed: https://review.openstack.org/314713
Committed: https://git.openstack.org/cgit/openstack/ironic-python-agent/commit/?id=ed978f312e1079c6eb7166947253007d141eb82d
Submitter: Jenkins
Branch: stable/mitaka

commit ed978f312e1079c6eb7166947253007d141eb82d
Author: Dmitry Tantsur <email address hidden>
Date: Fri May 6 13:26:44 2016 +0200

    [inspection] wait for the PXE DHCP by default and remove the carrier check

    We hoped that checking /sys/class/net/XXX/carrier will allow us
    to not wait for interfaces that are not connected at all.
    In reality this field turned out to be unreliable. For example, it is
    also set to 0 when interface is down or is being configured.
    The bug https://bugzilla.redhat.com/show_bug.cgi?id=1327255 shows
    the case when carrier is 0 for all interfaces, including one that is
    used to post back data, which is obvious non-sense.

    This change removes check on carrier for the loop. To avoid 60 seconds
    wait for people with several NIC's, it's changed to only wait for the
    PXE booting NIC, which obviously must get an IP address.

    This makes IP addresses in the inspection data for other NIC's somewhat
    unreliable. A new option inspection_dhcp_all_interfaces is introduced
    to allow waiting for all NIC's to get IP addresses.

    This change should finally fix bug 1564954.

    Change-Id: I8b04bf726980fdcf6bd536c6bb28e30ac50658fb
    Related-Bug: #1564954
    (cherry picked from commit 6da6ace3840d56c7145ddf528bbdcbb813fc6ce2)

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/ironic-python-agent 1.2.1

This issue was fixed in the openstack/ironic-python-agent 1.2.1 release.

Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/ironic-python-agent 1.3.0

This issue was fixed in the openstack/ironic-python-agent 1.3.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.