double manage/provide cycle cause trace for pxe_drac

Bug #1676387 reported by George Shuklin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
High
Richard G. Pioso

Bug Description

Affected version: newton

Configuration driver: pxe_drac, neutron networking.

Steps to reproduce:

1. Put node in 'manage' mode with maintanance=off
2. ironic node-set-provision-state $NODE provide
3. Wait until it become available.
4. ironic node-set-provision-state $NODE manage
4. ironic node-set-provision-state $NODE provide

Expected result: node become available.
Actual result: node in 'clean fail' state.

AFAIK it trying to connect to node when it is in power-off state and receive error.

Trace from ironic-conductor:

ronic.conductor.task_manager[26379]: DEBUG Attempting to get exclusive lock on node dbf66326-3def-47a8-b061-0b409d3d544c (for provision action provide) [req-fd6b2d9e-4f22-4f63-a783-2ab6f72142c0 - - - - -] __init__ /usr/lib/python2.7/dist-packages/ironic/conductor/task_manager.py:208
ironic.conductor.task_manager[26379]: DEBUG Node dbf66326-3def-47a8-b061-0b409d3d544c successfully reserved for provision action provide (took 0.02 seconds) [req-fd6b2d9e-4f22-4f63-a783-2ab6f72142c0 - - - - -] reserve_node /usr/lib/python2.7/dist-packages/ironic/conductor/task_manager.py:252
ironic.common.states[26379]: DEBUG Exiting old state 'manageable' in response to event 'provide' [req-fd6b2d9e-4f22-4f63-a783-2ab6f72142c0 - - - - -] on_exit /usr/lib/python2.7/dist-packages/ironic/common/states.py:219
ironic.common.states[26379]: DEBUG Entering new state 'cleaning' in response to event 'provide' [req-fd6b2d9e-4f22-4f63-a783-2ab6f72142c0 - - - - -] on_enter /usr/lib/python2.7/dist-packages/ironic/common/states.py:225
ironic.conductor.manager[26379]: DEBUG Starting automated cleaning for node dbf66326-3def-47a8-b061-0b409d3d544c [req-fd6b2d9e-4f22-4f63-a783-2ab6f72142c0 - - - - -] _do_node_clean /usr/lib/python2.7/dist-packages/ironic/conductor/manager.py:793
ironic.conductor.manager[26379]: INFO Executing cleaning on node dbf66326-3def-47a8-b061-0b409d3d544c, remaining steps: [{u'priority': 99, u'interface': u'deploy', u'reboot_requested': False, u'abortable': True, u'step': u'erase_devices_metadata'}, {u'priority': 10, u'interface': u'deploy', u'reboot_requested': False, u'abortable': True, u'step': u'erase_devices'}] [req-fd6b2d9e-4f22-4f63-a783-2ab6f72142c0 - - - - -]
ironic.conductor.manager[26379]: INFO Executing {u'priority': 99, u'interface': u'deploy', u'reboot_requested': False, u'abortable': True, u'step': u'erase_devices_metadata'} on node dbf66326-3def-47a8-b061-0b409d3d544c [req-fd6b2d9e-4f22-4f63-a783-2ab6f72142c0 - - - - -]
ironic.drivers.modules.agent_client[26379]: DEBUG Executing agent command clean.execute_clean_step for node dbf66326-3def-47a8-b061-0b409d3d544c [req-fd6b2d9e-4f22-4f63-a783-2ab6f72142c0 - - - - -] _command /usr/lib/python2.7/dist-packages/ironic/drivers/modules/agent_client.py:62

ironic.drivers.modules.agent_client[26379]: ERROR Error invoking agent command clean.execute_clean_step for node dbf66326-3def-47a8-b061-0b409d3d544c. Error: HTTPConnectionPool(host='204.74.228.42', port=9999): Max retries exceeded with url: /v1/commands?wait=false (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f4d5241ddd0>: Failed to establish a new connection: [Errno 113] EHOSTUNREACH',)) [req-fd6b2d9e-4f22-4f63-a783-2ab6f72142c0 - - - - -]
ironic.conductor.manager[26379]: ERROR Node dbf66326-3def-47a8-b061-0b409d3d544c failed step {u'priority': 99, u'interface': u'deploy', u'reboot_requested': False, u'abortable': True, u'step': u'erase_devices_metadata'}: Error invoking agent command clean.execute_clean_step for node dbf66326-3def-47a8-b061-0b409d3d544c. Error: HTTPConnectionPool(host='204.74.228.42', port=9999): Max retries exceeded with url: /v1/commands?wait=false (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f4d5241ddd0>: Failed to establish a new connection: [Errno 113] EHOSTUNREACH',)) [req-fd6b2d9e-4f22-4f63-a783-2ab6f72142c0 - - - - -]
ironic.conductor.manager[26379]: TRACE Traceback (most recent call last):
ironic.conductor.manager[26379]: TRACE File "/usr/lib/python2.7/dist-packages/ironic/conductor/manager.py", line 893, in _do_next_clean_step
ironic.conductor.manager[26379]: TRACE result = interface.execute_clean_step(task, step)
ironic.conductor.manager[26379]: TRACE File "/usr/lib/python2.7/dist-packages/ironic_lib/metrics.py", line 61, in wrapped
ironic.conductor.manager[26379]: TRACE result = f(*args, **kwargs)
ironic.conductor.manager[26379]: TRACE File "/usr/lib/python2.7/dist-packages/ironic/drivers/modules/iscsi_deploy.py", line 532, in execute_clean_step
ironic.conductor.manager[26379]: TRACE return deploy_utils.agent_execute_clean_step(task, step)
ironic.conductor.manager[26379]: TRACE File "/usr/lib/python2.7/dist-packages/ironic/drivers/modules/deploy_utils.py", line 624, in agent_execute_clean_step
ironic.conductor.manager[26379]: TRACE result = client.execute_clean_step(step, task.node, ports)
ironic.conductor.manager[26379]: TRACE File "/usr/lib/python2.7/dist-packages/ironic_lib/metrics.py", line 61, in wrapped
ironic.conductor.manager[26379]: TRACE result = f(*args, **kwargs)
ironic.conductor.manager[26379]: TRACE File "/usr/lib/python2.7/dist-packages/ironic/drivers/modules/agent_client.py", line 227, in execute_clean_step
ironic.conductor.manager[26379]: TRACE params=params)
ironic.conductor.manager[26379]: TRACE File "/usr/lib/python2.7/dist-packages/ironic_lib/metrics.py", line 61, in wrapped
ironic.conductor.manager[26379]: TRACE result = f(*args, **kwargs)
ironic.conductor.manager[26379]: TRACE File "/usr/lib/python2.7/dist-packages/ironic/drivers/modules/agent_client.py", line 71, in _command
ironic.conductor.manager[26379]: TRACE raise exception.IronicException(msg)
ironic.conductor.manager[26379]: TRACE IronicException: Error invoking agent command clean.execute_clean_step for node dbf66326-3def-47a8-b061-0b409d3d544c. Error: HTTPConnectionPool(host='204.74.228.42', port=9999): Max retries exceeded with url: /v1/commands?wait=false (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f4d5241ddd0>: Failed to establish a new connection: [Errno 113] EHOSTUNREACH',))
ironic.conductor.manager[26379]: TRACE

204.74.228.42 was an IP address of the neutron port for node on provision/clean networking.

Node was in power-off state (both in real world and in ironic power_state).

Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

The fact that it has an IP to call makes me think we didn't clean up the old heartbeat information (which is likely), however at the same time I'm curious if the node was ever actually powered in this sequence upon the second execution of cleaning steps.

Would it be possible to get a complete conductor log with timestamps starting from when the node is in available state?

Also, out of curiosity, this been confirmed on ocata?

Revision history for this message
George Shuklin (george-shuklin) wrote :

I had had idrac remote console opened during test.

It performs 1st cleanup properly, but if I issue second loop (->manage->provide) it fails without powering on node.

Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

This is sort of looking like somewhere along the way data is not being cleaned up, however I think we fixed a similar bug early on in ocata. Please provide us the full output of the node-show command, redacted naturally. In particular we need to see what the node has in driver_internal_info after the initial move to available state and in it's failed state.

Changed in ironic:
status: New → Incomplete
Revision history for this message
George Shuklin (george-shuklin) wrote :
Download full text (14.0 KiB)

Before 1st cycle (manage state):

{
    "target_power_state": null,
    "extra": {},
    "last_error": null,
    "updated_at": "2017-03-28T12:52:33+00:00",
    "maintenance_reason": null,
    "provision_state": "manageable",
    "clean_step": {},
    "uuid": "dbf66326-3def-47a8-b061-0b409d3d544c",
    "console_enabled": false,
    "target_provision_state": null,
    "raid_config": "",
    "provision_updated_at": "2017-03-28T12:52:33+00:00",
    "maintenance": false,
    "target_raid_config": "",
    "network_interface": "",
    "inspection_started_at": null,
    "inspection_finished_at": null,
    "power_state": "power off",
    "driver": "pxe_drac",
    "reservation": null,
    "properties": {
        "memory_mb": 8096,
        "cpu_arch": "x86_64",
        "local_gb": 480,
        "cpus": 4,
        "capabilities": "availability_zone:az1"
    },
    "instance_uuid": null,
    "name": "s8006",
    "driver_info": {
        "drac_password": "****",
        "drac_username": "root",
        "deploy_kernel": "40b0c54c-d49f-4cf4-b357-f7f616e7b4f3",
        "drac_host": "192.168.212.199",
        "deploy_ramdisk": "723e26ca-8d82-4f52-bfc4-b288e8021e4c"
    },
    "resource_class": "",
    "created_at": "2017-03-27T09:49:20+00:00",
    "driver_internal_info": {
        "agent_cached_clean_steps_refreshed": "2017-03-27 13:01:59.049338",
        "agent_erase_devices_zeroize": true,
        "drac_boot_device": null,
        "agent_cached_clean_steps": {
            "deploy": [
                {
                    "priority": 99,
                    "interface": "deploy",
                    "reboot_requested": false,
                    "abortable": true,
                    "step": "erase_devices_metadata"
                },
                {
                    "priority": 10,
                    "interface": "deploy",
                    "reboot_requested": false,
                    "abortable": true,
                    "step": "erase_devices"
                }
            ]
        },
        "clean_steps": [
            {
                "priority": 99,
                "interface": "deploy",
                "reboot_requested": false,
                "abortable": true,
                "step": "erase_devices_metadata"
            },
            {
                "priority": 10,
                "interface": "deploy",
                "reboot_requested": false,
                "abortable": true,
                "step": "erase_devices"
            }
        ],
        "hardware_manager_version": {
            "generic_hardware_manager": "1.1"
        },
        "is_whole_disk_image": true,
        "agent_continue_if_ata_erase_failed": false,
        "agent_erase_devices_iterations": 1,
        "root_uuid_or_disk_id": "0xd77da3bb",
        "agent_url": "http://redacted_ip:9999",
        "agent_last_heartbeat": 1490620184
    },
    "chassis_uuid": "",
    "instance_info": {}
}

1st cycle after 'provide':

{
    "target_power_state": null,
    "extra": {},
    "last_error": null,
    "updated_at": "2017-03-28T12:58:33+00:00",
    "maintenance_reason": null,
    "provision_state": "clean wait",
    "clean_step": {},
    "uuid": "dbf66326-3def-47a8-b061-...

Changed in ironic:
status: Incomplete → New
Revision history for this message
Vladyslav Drok (vdrok) wrote :

So, it seems like an issue with drac deploy interface, which uses the cached values if they are present -- https://github.com/openstack/ironic/blob/master/ironic/drivers/modules/drac/deploy.py#L51

In case of ISCSI, we do the prepare_inband_cleaning always, we might need to change to that way, although I'm not a DRAC guru :)

Changed in ironic:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Vladyslav Drok (vdrok) wrote :

Also it seems like because of this, we can't do the cleaning in multitenant setup, because we don't call the prepare_inband_cleaning, which does the creation of cleaning ports.

Changed in ironic:
importance: Medium → High
Vladyslav Drok (vdrok)
tags: added: drac
Dmitry Tantsur (divius)
Changed in ironic:
status: Confirmed → Triaged
Changed in ironic:
assignee: nobody → Richard G. Pioso (richard.pioso)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/493421

Changed in ironic:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/494736

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ironic (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/494737

Changed in ironic:
assignee: Richard G. Pioso (richard.pioso) → Vladyslav Drok (vdrok)
Vladyslav Drok (vdrok)
Changed in ironic:
assignee: Vladyslav Drok (vdrok) → Richard G. Pioso (richard.pioso)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/493421
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=86e3a100a3a6fcfcf9d498886e1d7784f07ecd73
Submitter: Jenkins
Branch: master

commit 86e3a100a3a6fcfcf9d498886e1d7784f07ecd73
Author: Richard Pioso <email address hidden>
Date: Sun Aug 13 21:10:06 2017 -0400

    Fix DRAC classic driver double manage/provide

    This change fixes an issue that caused a node using a Dell EMC
    integrated Dell Remote Access Controller (iDRAC) classic driver,
    'pxe_drac' or 'pxe_drac_inspector', to be placed in the 'clean failed'
    state after a double manage/provide cycle, instead of the 'available'
    state.

    The deploy interface implementation used by iDRAC classic drivers has
    been class ironic.drivers.modules.drac.deploy.DracDeploy, which is
    derived from class ironic.drivers.modules.iscsi_deploy.ISCSIDeploy. The
    only difference between them is that DracDeploy overrides the
    prepare_cleaning() method to prevent the booting of the Ironic Python
    Agent (IPA) ramdisk when only out-of-band RAID clean steps are
    requested. However, it caused the issue and did not have its intended
    effect, because Ironic Conductor boots the ramdisk regardless. The
    Ironic Conductor should be modified to preclude the booting of the IPA
    ramdisk fix, rather than leaving it to individual drivers.

    The iDRAC classic drivers' deploy interface implementation has been
    changed to ISCSIDeploy. Since class DracDeploy is no longer needed, its
    source code and automated tests have been removed.

    Change-Id: Ib2c9b7f9f780aaf5f6345825b1f6c9ddb4f9c41f
    Closes-Bug: #1676387
    Related-Bug: #1572529
    Related-Bug: #1705741

Changed in ironic:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ironic (master)

Reviewed: https://review.openstack.org/491263
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=d75fecc974bb9233cf41a457af35e1911232a2d9
Submitter: Jenkins
Branch: master

commit d75fecc974bb9233cf41a457af35e1911232a2d9
Author: Richard Pioso <email address hidden>
Date: Sun Aug 6 13:01:43 2017 -0400

    Add 'hardware type' for Dell EMC iDRACs

    The ironic project is engaged in a driver composition reform [0] which
    introduces dynamic drivers. That reform places hardware vendors in
    charge of defining a new form of drivers for their hardware. The new
    form is called 'hardware type'. The reform will deprecate [1] the
    existing ironic drivers, which are now referred to as 'classic drivers'.

    This change adds a single new 'hardware type', 'idrac', for servers
    controlled by Dell EMC integrated Dell Remote Access Controllers
    (iDRAC).

    [0]
    http://specs.openstack.org/openstack/ironic-specs/specs/7.0/driver-composition-reform.html
    [1]
    http://specs.openstack.org/openstack/ironic-specs/specs/not-implemented/classic-drivers-future.html

    Change-Id: Ie332a5adebfc229a79663c771c9f4409174dd8c9
    Closes-Bug: #1705741
    Related-Bug: #1676387

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (stable/pike)

Reviewed: https://review.openstack.org/494736
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=8d63821bdd6df30d0a015cf427b42432e7a2603c
Submitter: Jenkins
Branch: stable/pike

commit 8d63821bdd6df30d0a015cf427b42432e7a2603c
Author: Richard Pioso <email address hidden>
Date: Sun Aug 13 21:10:06 2017 -0400

    Fix DRAC classic driver double manage/provide

    This change fixes an issue that caused a node using a Dell EMC
    integrated Dell Remote Access Controller (iDRAC) classic driver,
    'pxe_drac' or 'pxe_drac_inspector', to be placed in the 'clean failed'
    state after a double manage/provide cycle, instead of the 'available'
    state.

    The deploy interface implementation used by iDRAC classic drivers has
    been class ironic.drivers.modules.drac.deploy.DracDeploy, which is
    derived from class ironic.drivers.modules.iscsi_deploy.ISCSIDeploy. The
    only difference between them is that DracDeploy overrides the
    prepare_cleaning() method to prevent the booting of the Ironic Python
    Agent (IPA) ramdisk when only out-of-band RAID clean steps are
    requested. However, it caused the issue and did not have its intended
    effect, because Ironic Conductor boots the ramdisk regardless. The
    Ironic Conductor should be modified to preclude the booting of the IPA
    ramdisk fix, rather than leaving it to individual drivers.

    The iDRAC classic drivers' deploy interface implementation has been
    changed to ISCSIDeploy. Since class DracDeploy is no longer needed, its
    source code and automated tests have been removed.

    Change-Id: Ib2c9b7f9f780aaf5f6345825b1f6c9ddb4f9c41f
    Closes-Bug: #1676387
    Related-Bug: #1572529
    Related-Bug: #1705741
    (cherry picked from commit f049eae3fc6aadab69d9d9faf9de89bb1f7e4f38)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ironic (stable/pike)

Reviewed: https://review.openstack.org/494737
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=ccbcd9f52029a3ef7626636a9fbda17615eadb8b
Submitter: Jenkins
Branch: stable/pike

commit ccbcd9f52029a3ef7626636a9fbda17615eadb8b
Author: Richard Pioso <email address hidden>
Date: Sun Aug 6 13:01:43 2017 -0400

    Add 'hardware type' for Dell EMC iDRACs

    The ironic project is engaged in a driver composition reform [0] which
    introduces dynamic drivers. That reform places hardware vendors in
    charge of defining a new form of drivers for their hardware. The new
    form is called 'hardware type'. The reform will deprecate [1] the
    existing ironic drivers, which are now referred to as 'classic drivers'.

    This change adds a single new 'hardware type', 'idrac', for servers
    controlled by Dell EMC integrated Dell Remote Access Controllers
    (iDRAC).

    [0]
    http://specs.openstack.org/openstack/ironic-specs/specs/7.0/driver-composition-reform.html
    [1]
    http://specs.openstack.org/openstack/ironic-specs/specs/not-implemented/classic-drivers-future.html

    Change-Id: Ie332a5adebfc229a79663c771c9f4409174dd8c9
    Closes-Bug: #1705741
    Related-Bug: #1676387
    (cherry picked from commit 22e716d9808725efea18b8236dd4fe1616d754f2)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ironic 9.1.0

This issue was fixed in the openstack/ironic 9.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/498412

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (stable/ocata)

Reviewed: https://review.openstack.org/498412
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=f68f92a917bd50ea21bd000287b8dd8d829b7a01
Submitter: Jenkins
Branch: stable/ocata

commit f68f92a917bd50ea21bd000287b8dd8d829b7a01
Author: Richard Pioso <email address hidden>
Date: Sun Aug 13 21:10:06 2017 -0400

    Fix DRAC classic driver double manage/provide

    This change fixes an issue that caused a node using a Dell EMC
    integrated Dell Remote Access Controller (iDRAC) classic driver,
    'pxe_drac' or 'pxe_drac_inspector', to be placed in the 'clean failed'
    state after a double manage/provide cycle, instead of the 'available'
    state.

    The deploy interface implementation used by iDRAC classic drivers has
    been class ironic.drivers.modules.drac.deploy.DracDeploy, which is
    derived from class ironic.drivers.modules.iscsi_deploy.ISCSIDeploy. The
    only difference between them is that DracDeploy overrides the
    prepare_cleaning() method to prevent the booting of the Ironic Python
    Agent (IPA) ramdisk when only out-of-band RAID clean steps are
    requested. However, it caused the issue and did not have its intended
    effect, because Ironic Conductor boots the ramdisk regardless. The
    Ironic Conductor should be modified to preclude the booting of the IPA
    ramdisk fix, rather than leaving it to individual drivers.

    The iDRAC classic drivers' deploy interface implementation has been
    changed to ISCSIDeploy. Since class DracDeploy is no longer needed, its
    source code and automated tests have been removed.

    Change-Id: Ib2c9b7f9f780aaf5f6345825b1f6c9ddb4f9c41f
    Closes-Bug: #1676387
    Related-Bug: #1572529
    Related-Bug: #1705741
    (cherry picked from commit 86e3a100a3a6fcfcf9d498886e1d7784f07ecd73)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ironic 7.0.3

This issue was fixed in the openstack/ironic 7.0.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ironic 9.2.0

This issue was fixed in the openstack/ironic 9.2.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.