ironic-python-agent silently fails to write a configdrive if a previous drive is found

Bug #1433812 reported by Julia Kreger
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
High
Ramakrishnan G (rameshg87)

Bug Description

When utilizing a standalone ironic install ( via bifrost https://github.com/juliakreger/bifrost ) to re-deploy physical and virtual nodes that already had a configuration drive, I've found that the config drive silently fails to be written.

This is on ironic master branch as of yesterday afternoon and IPA's master branch as of today. My understanding is cleaning will largely resolve this issue in the case of redeployments, although a node that is added to ironic with a pre-existing configdrive from a previous deployment will fail in this same manor.

Investigation has revealed two things are occurring:
1) CoreOS is auto-mounting the pre-existing configuration drive which prevents the kernel view of the disk devices from being updating after a disk image has been written out, resulting in deployment writing out to an _existing_ configdrive partition that is not in the partition table on disk. Upon reboot, only the original partition from the disk image is present. This can be bypassed by setting the kernel parameter "coreos.configdrive=0", although it also means that the masked units in oem/cloud-config.yml are not working as expected. This was identified and verified by logging into the IPA node and checking the mounted devices.

2) The execution of partx in ironic_python_agent/shell/copy_configdrive_to_disk.sh does not appear to work (verified in a VM with the disk bus being virtio) as expected after new image is written out via ironic_python_agent/shell/write_image.sh. This was determined by logging into the node running IPA which was booted with coreos.configdrive=0 (having short circuited Ironic's reboot and complete deployment step) and investigating the state after the IPA debug logging indicated that it successfully wrote to the _existing configdrive device_. The partition table on disk showed a single partition (/dev/vda1), where as the kernel device listing showed two partitions (/dev/vda1, /dev/vda2). Manually re-running `partx -u /dev/vda` successfully re-synced the table, however as result of the first execution apparently not registering (possible race condition with partx and device usage?), the node lacks a configdrive upon reboot.

Setting coreos.configdrive=0 and deploying a locally built IPA ramdisk with 877f66826cd1b50163c67e73c8ebb4590c0f7ec8 reverted, as JayF requested I try in IRC, results in a the configuration drive always being written out if a stale partition existed as it is wiped and partprobe updates the kernel partition table.

+------------------------+-------------------------------------------------------------------------+
| Property | Value |
+------------------------+-------------------------------------------------------------------------+
| instance_uuid | None |
| target_power_state | None |
| properties | {u'memory_mb': u'512', u'cpu_arch': u'x86_64', u'local_gb': u'10', |
| | u'cpus': u'1'} |
| maintenance | False |
| driver_info | {u'ssh_port': 22, u'ssh_username': u'ironic', u'deploy_kernel': |
| | u'http://192.168.122.1:8080/coreos_production_pxe.vmlinuz', |
| | u'deploy_ramdisk': u'http://192.168.122.1:8080 |
| | /coreos_production_pxe_image-oem.cpio.gz', u'ssh_key_filename': |
| | u'/home/ironic/.ssh/id_rsa', u'ssh_address': u'127.0.0.1', |
| | u'ssh_virt_type': u'virsh'} |
| extra | {} |
| last_error | None |
| created_at | 2015-03-18T16:17:01+00:00 |
| target_provision_state | active |
| driver | agent_ssh |
| updated_at | 2015-03-18T22:30:16+00:00 |
| maintenance_reason | None |
| instance_info | {u'root_gb': 10, u'image_source': |
| | u'http://192.168.122.1:8080/deployment_image.qcow2', u'image_checksum': |
| | u'684649dec72ecfc600842ff8af836f57', u'image_url': |
| | u'http://192.168.122.1:8080/deployment_image.qcow2', |
| | u'image_disk_format': u'raw', u'configdrive': |
| | u'http://192.168.122.1:8080/configdrive-a8cb6624-0d9f-c882-affc- |
| | 046ebb96ec01.iso.gz'} |
| driver_internal_info | |
| chassis_uuid | |
| provision_state | deploying |
| reservation | None |
| power_state | power on |
| console_enabled | False |
| uuid | a8cb6624-0d9f-c882-affc-046ebb96ec01 |
+------------------------+-------------------------------------------------------------------------+

Tags: agent
Revision history for this message
Julia Kreger (juliaashleykreger) wrote :
Revision history for this message
Julia Kreger (juliaashleykreger) wrote :
Revision history for this message
Jay Faulkner (jason-oldos) wrote :

After a conversation with Alex Crawford from CoreOS, it sounds like this *is* a race. The ConfigDrive is mounted before oem/cloud-config.yml is processed.

A short-term fix will be to modify the coreos-oem-inject.py script to exclude the mount units causing the problem. CoreOS upstream is working on a longer term fix.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic-python-agent (master)

Fix proposed to branch: master
Review: https://review.openstack.org/165954

Changed in ironic:
assignee: nobody → Jay Faulkner (jason-oldos)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic-python-agent (master)

Reviewed: https://review.openstack.org/165954
Committed: https://git.openstack.org/cgit/openstack/ironic-python-agent/commit/?id=79790171bc76207787873e58561b00652e8875c6
Submitter: Jenkins
Branch: master

commit 79790171bc76207787873e58561b00652e8875c6
Author: Jay Faulkner <email address hidden>
Date: Thu Mar 19 11:25:35 2015 -0700

    Call partprobe+partx before writing configdrive

    partx -u $DEVICE doesn't work in some cases, but partprobe $DEVICE fails
    in virtual environments (like devstack). For now, run both commands and
    ignore partprobe failures.

    Moving forward, this shell should be factored into python and share code
    with the other partition-modifying code added this cycle.

    Change-Id: I7e4c010e260be2a23dcc894bc0c1b30aea949084
    Partial-bug: 1433812

Dmitry Tantsur (divius)
Changed in ironic:
importance: Undecided → High
Revision history for this message
Jay Faulkner (jason-oldos) wrote :

I filed a bug upstream with CoreOS, and suggested a couple of workarounds there https://github.com/coreos/bugs/issues/314.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/167061

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/167063

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ironic (stable/juno)

Change abandoned by Jay Faulkner (<email address hidden>) on branch: stable/juno
Review: https://review.openstack.org/167061

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/167063
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=87abb934e0d97cce0562a028e68c0a70a35c19ce
Submitter: Jenkins
Branch: master

commit 87abb934e0d97cce0562a028e68c0a70a35c19ce
Author: Jay Faulkner <email address hidden>
Date: Mon Mar 23 17:59:38 2015 -0700

    Ensure configdrive isn't mounted in CoreOS ramdisks

    Temporary workaround for bug #1433812. CoreOS processes the
    cloud-config.yml too late the boot process to prevent mounting and
    processing the configdrive. Pass coreos.configdrive=0 on the kernel
    command line to ensure this doesn't occur, as it can be a security risk
    (previous tenants may have written a malicious configdrive, and it would
    be read before being cleaned).

    Long-term, we should remove this workaround and either completely remove
    the mount units from the ramdisk during the build process or get a
    better fix from upstream (https://github.com/coreos/bugs/issues/314).

    Change-Id: I59575b2c5c89c3ceef03598f8b86f0e330cfacad
    Partial-bug: 1433812

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/167449

Changed in ironic:
assignee: Jay Faulkner (jason-oldos) → Lucas Alvares Gomes (lucasagomes)
Changed in ironic:
assignee: Lucas Alvares Gomes (lucasagomes) → Jay Faulkner (jason-oldos)
Changed in ironic:
assignee: Jay Faulkner (jason-oldos) → Ramakrishnan G (rameshg87)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/167700

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/167449
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=be5e85ffd9437cb25b771323eac5b513eb8592ee
Submitter: Jenkins
Branch: master

commit be5e85ffd9437cb25b771323eac5b513eb8592ee
Author: Jay Faulkner <email address hidden>
Date: Tue Mar 24 16:35:38 2015 -0700

    Ensure configdrive isn't mounted for ipxe/elilo

    This extends 87abb934e0d97cce0562a028e68c0a70a35c19ce to work with elilo
    and ipxe.

    Temporary workaround for bug #1433812. CoreOS processes the
    cloud-config.yml too late the boot process to prevent mounting and
    processing the configdrive. Pass coreos.configdrive=0 on the kernel
    command line to ensure this doesn't occur, as it can be a security
    risk (previous tenants may have written a malicious configdrive,
    and it would be read before being cleaned).

    Long-term, we should remove this workaround and either completely remove
    the mount units from the ramdisk during the build process or get a
    better fix from upstream (https://github.com/coreos/bugs/issues/314).

    Change-Id: I03fd230a9d03dd4daeaa53148ec9975d741c14a0
    Partial-bug: 1433812

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/167700
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=05075b36c0c2195084ff13588df0fb1c97dccf2b
Submitter: Jenkins
Branch: master

commit 05075b36c0c2195084ff13588df0fb1c97dccf2b
Author: Ramakrishnan G <email address hidden>
Date: Wed Mar 25 17:15:20 2015 +0000

    Ensure configdrive isn't mounted for ilo drivers

    This extends 87abb934e0d97cce0562a028e68c0a70a35c19ce to
    work with iscsi_ilo and agent_ilo drivers.

    Change-Id: I5b0550485edf4854ca8b9a606003534f53f94408
    Partial-bug: 1433812

Revision history for this message
Jim Rollenhagen (jim-rollenhagen) wrote :

What's left to do here?

Changed in ironic:
status: In Progress → Fix Committed
Changed in ironic:
milestone: none → 4.1.0
status: Fix Committed → Fix Released
Revision history for this message
Lucas Alvares Gomes (lucasagomes) wrote :

Seems all patches tagging this bug has been merged, none of the closed the bug (partial-bug tag). What else is needed to it to be closed?

Revision history for this message
Lucas Alvares Gomes (lucasagomes) wrote :

Talking to ramesh on IRC, the bug is completed:

 <lucasagomes> rameshg87, what else is needed there? Seems all patches have been merged but none were closing the bug (just partial-bug tags)
 <rameshg87> lucasagomes: hi
 <rameshg87> lucasagomes: let me check
 <lucasagomes> rameshg87, yeah was having some connection problems, closed the others
 <rameshg87> lucasagomes: nothing else I guess.
 <rameshg87> lucasagomes: we can close the bug
 <lucasagomes> rameshg87, ack! Thanks

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.