conductor lacks periodic task to keep PXE env up to date

Bug #1279331 reported by aeva black
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
High
aeva black

Bug Description

There is currently no periodic task that ensures that a conductor's local PXE boot environment (the tftp config file and cached kernels & ramdisks) matches the set of nodes which are currently mapped to that host by the hash ring. This could get out of sync if, eg, the size of the hash ring changes, the conductor was temporarily offline and a node was created or deleted, or the configuration option for number of hash replicas was changed and the service restarted.

Related to this, ironic.common.hash_ring defines the hash_distribution_replicas option. The purpose of this option is to provide a faster fail-over time when a conductor goes offline by allowing the next conductor in the ring to pre-cache the environment. The "do_node_deploy" RPC message will only be sent to the first conductor that the node is mapped to; if # replicas is greater than one, the additional conductor(s) should precache the kernel & ramdisk from within this periodic task. Note that they should not cache the entire user image; that is only needed during the act of deploying.

Within this periodic task, the conductor should compare its locally-cached deploy environments with the list of nodes mapped to it, and then either prepare or clean up those deployment environments as appropriate.

aeva black (tenbrae)
Changed in ironic:
status: New → Triaged
importance: Undecided → Medium
Ling Gao (linggao)
Changed in ironic:
assignee: nobody → Ling Gao (linggao)
Ling Gao (linggao)
Changed in ironic:
assignee: Ling Gao (linggao) → nobody
Dmitry Tantsur (divius)
Changed in ironic:
assignee: nobody → Dmitry "Divius" Tantsur (divius)
Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/92115

Changed in ironic:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ironic (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/93748

Dmitry Tantsur (divius)
Changed in ironic:
assignee: Dmitry "Divius" Tantsur (divius) → nobody
status: In Progress → Confirmed
aeva black (tenbrae)
Changed in ironic:
milestone: none → juno-rc1
importance: Medium → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/124610

Changed in ironic:
assignee: nobody → Devananda van der Veen (devananda)
status: Confirmed → In Progress
Changed in ironic:
assignee: Devananda van der Veen (devananda) → Lucas Alvares Gomes (lucasagomes)
Changed in ironic:
assignee: Lucas Alvares Gomes (lucasagomes) → Devananda van der Veen (devananda)
Changed in ironic:
assignee: Devananda van der Veen (devananda) → Lucas Alvares Gomes (lucasagomes)
Changed in ironic:
assignee: Lucas Alvares Gomes (lucasagomes) → David Shrewsbury (dshrews)
Changed in ironic:
assignee: David Shrewsbury (dshrews) → Devananda van der Veen (devananda)
Changed in ironic:
assignee: Devananda van der Veen (devananda) → Lucas Alvares Gomes (lucasagomes)
Changed in ironic:
assignee: Lucas Alvares Gomes (lucasagomes) → Devananda van der Veen (devananda)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ironic (master)

Reviewed: https://review.openstack.org/124493
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=d6a277130b72534e499c64e3e468ec5e572fcc4d
Submitter: Jenkins
Branch: master

commit d6a277130b72534e499c64e3e468ec5e572fcc4d
Author: Devananda van der Veen <email address hidden>
Date: Fri Sep 26 11:35:38 2014 -0700

    Add "affinity" tracking to nodes and conductors

    Add "conductor_affinity" column to nodes table, containing a reference to the
    `id` of the conductor service (not its hostname) that has most recently
    performed some action which could require local state to be maintained
    (eg, built a PXE config, or started a SOL session).

    Using the `id` as a foreign key necessitates not deleting conductors
    when unregistering them, but instead marking them offline. This also
    helps in determining if a conductor service was only restarted (though
    this patch does not implement graceful shutdown).

    Thus, this patch also adds an "online" boolean column to the conductors
    table to track whether a conductor is on- or offline, and updates
    the register and unregister methods to use that field transparently.
    It may be noted that this does not change the behavior of
    register_conductor or unregister_conductor, though an optional
    "update_existing" parameter has been added to register_conductor. This
    replaces a DELETE query with an UPDATE query instead.

    Co-Authored-By: David Shrewsbury <email address hidden>
    Co-Authored-By: Lucas Alvares Gomes <email address hidden>

    Related-bug: #1279331
    Change-Id: I8e8b5cc00fc9f565ad2fb442e9a26077342e0a25

Changed in ironic:
assignee: Devananda van der Veen (devananda) → Lucas Alvares Gomes (lucasagomes)
Changed in ironic:
assignee: Lucas Alvares Gomes (lucasagomes) → Devananda van der Veen (devananda)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/124610
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=552a927e56030dc21221c035537d62b2077664a8
Submitter: Jenkins
Branch: master

commit 552a927e56030dc21221c035537d62b2077664a8
Author: Devananda van der Veen <email address hidden>
Date: Sat Sep 27 18:41:46 2014 -0700

    Add periodic task to rebuild conductor local state

    This adds a periodic task which can rebuild the conductor's local state
    (PXE config files, etc) when conductors join or leave the cluster.

    For any node which is newly mapped to the conductor, this will
    trigger calling prepare() and take_over() on that node's deploy
    interface.

    This uses the periodic_max_worker setting like other periodic jobs,
    starting the take over process in separate threads. Thus, in a large
    cluster, it may take some time for all nodes to settle down.
    It also adds a new CONF option to control the timing of this job.

    There is a lot of room for improvement and optimization in this, however
    getting a fix in place is critical to the Juno release.

    NOTE: This does not re-establish any console sessions.

    Co-Authored-By: Lucas Alvares Gomes <email address hidden>
    Change-Id: I0dbe9a5a98ec5fd0c69f32d7590d8141da5a23c2
    Closes-bug: #1279331

Changed in ironic:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in ironic:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ironic:
milestone: juno-rc1 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.