Node gets stuck in 'deploying' when configdrive is too large

Bug #1745630 reported by Hironori Shiina
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
Low
Hironori Shiina

Bug Description

As mentioned in another rfe[1], configdrive may be larger than the size of the instance_info field in DB. In this case, a node gets stuck in 'deploying' state. After the configdrive is set to the node object[1], an exception is raised when node.save() is called. The node cannot be moved to 'deploy failed' because the node cannot be saved with the new provision state since the large configdrive remains in instance_info of the node object.

[1] https://bugs.launchpad.net/ironic/+bug/1596421
[2] https://github.com/openstack/ironic/blob/c1cce7eb452c228dc2633e80f2c98fd142574fa9/ironic/conductor/manager.py#L3172

Revision history for this message
Michael Turek (mjturek) wrote :

So looking through the links Hironori provided, this is happening when the config drive is larger than 64KiB (as the max size for a Text field is 2^16 - 1).

@hshiina - can you detail what exception gets thrown and possibly provide some logs around the exception being thrown? I would guess that we need to catch this in do_node_deploy and do a handle_failure there.

Changed in ironic:
importance: Undecided → Low
status: New → Incomplete
Revision history for this message
Hironori Shiina (shiina-hironori) wrote :
Download full text (21.5 KiB)

Here is the log of ironic-conductor:

2018-02-02 13:12:10.787 59087 INFO ironic.conductor.task_manager [req-f2bac1cc-206e-44d1-8f7e-45c3a7662095 492eb7f94a544916ac0eacfde9ee0754 e6b4771ada4a45bfbc715bea3507af24 - default default] Node 589cc2e4-b038-4595-b49a-29e5d4367717 moved to provision state "deploying" from state "available"; target provision state is "active"
2018-02-02 13:12:10.853 59087 ERROR root [req-f2bac1cc-206e-44d1-8f7e-45c3a7662095 492eb7f94a544916ac0eacfde9ee0754 e6b4771ada4a45bfbc715bea3507af24 - default default] Original exception being dropped: ['Traceback (most recent call last):\n
', ' File "/usr/lib/python2.7/site-packages/ironic/conductor/manager.py", line 2842, in do_node_deploy\n
    task.driver.deploy.prepare(task)\n
', ' File "/usr/lib/python2.7/site-packages/ironic_lib/metrics.py", line 61, in wrapped\n
    result = f(*args, **kwargs)\n
', ' File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 148, in wrapper\n
    return f(*args, **kwargs)\n
', ' File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/iscsi_deploy.py", line 525, in prepare\n
    manager_utils.node_power_action(task, states.POWER_OFF)\n
', ' File "/usr/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 148, in wrapper\n
    return f(*args, **kwargs)\n
', ' File "/usr/lib/python2.7/site-packages/ironic/conductor/utils.py", line 210, in node_power_action\n
    if _can_skip_state_change(task, new_state):\n
', ' File "/usr/lib/python2.7/site-packages/ironic/conductor/utils.py", line 177, in _can_skip_state_change\n
    _not_going_to_change()\n
', ' File "/usr/lib/python2.7/site-packages/ironic/conductor/utils.py", line 150, in _not_going_to_change\n
    node.save()\n
', ' File "/usr/lib/python2.7/site-packages/ironic/objects/node.py", line 375, in save\n
    db_node = self.dbapi.update_node(self.uuid, updates)\n
', ' File "/usr/lib/python2.7/site-packages/ironic/db/sqlalchemy/api.py", line 429, in update_node\n
    return self._do_update_node(node_id, values)\n
', ' File "/usr/lib/python2.7/site-packages/ironic/db/sqlalchemy/api.py", line 465, in _do_update_node\n
    ref.update(values)\n
', ' File "/usr/lib64/python2.7/contextlib.py", line 24, in __exit__\n
    self.gen.next()\n
', ' File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 1029, in _transaction_scope\n
    yield resource\n
', ' File "/usr/lib64/python2.7/contextlib.py", line 24, in __exit__\n
    self.gen.next()\n
', ' File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 641, in _session\n
    self.session.rollback()\n
', ' File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n
    self.force_reraise()\n
', ' File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n
    six.reraise(self.type_, self.value, self.tb)\n
', ' File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 638, in _session\n
    self._end_session_transaction(self.session)\n
', ' File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 666, in _end_session_transaction\n
    session.c...

Revision history for this message
Hironori Shiina (shiina-hironori) wrote :

I guess this issue happened as follows:

At first node.save() after configdrive was set to a node object, an error was raised[1][2][3].

[1] https://github.com/openstack/ironic/blob/9.1.2/ironic/conductor/manager.py#L2842
[2] https://github.com/openstack/ironic/blob/9.1.2/ironic/drivers/modules/iscsi_deploy.py#L525
[3] https://github.com/openstack/ironic/blob/9.1.2/ironic/conductor/utils.py#L150

"Original exception being dropped" means another exception was raised after the original exception was caught[4].I guess node.save() failed in updating the provision state to DEPLOY_FAILED[5].

[4] https://github.com/openstack/ironic/blob/9.1.2/ironic/conductor/manager.py#L2844
[5] https://github.com/openstack/ironic/blob/9.1.2/ironic/conductor/manager.py#L2823

Finally, node.save() failed again when the worker is released[6][7]

[6] https://github.com/openstack/ironic/blob/9.1.2/ironic/conductor/task_manager.py#L425
[7] https://github.com/openstack/ironic/blob/9.1.2/ironic/conductor/task_manager.py#L380

This issue happened in an environment with Pike.

Revision history for this message
Hironori Shiina (shiina-hironori) wrote :

I think this can be fixed by doing node.save() when configdriver is set to an object[1]. If an exception is raised, we will remove the configdrive from the object and reraise the exception.

[1] https://github.com/openstack/ironic/blob/master/ironic/conductor/manager.py#L3169

Michael Turek (mjturek)
Changed in ironic:
status: Incomplete → Triaged
Revision history for this message
Kaifeng Wang (kaifeng) wrote :

I have run into the same issue this week, when nova is configured using vfat as configdrive format, the size is approximately 93K which exceeds mysql text type.

Basically, the issue is what Hironori Shiina had pointed out, and I think the problem is serious, for the node stucking in deploying state, we have to restart conductor service to move it back.

We can avoid this issue by adding node.save() in the _store_configdrive, and capturing the DBDataError exception surrounding it.

Although we rarely reach the maximum size of 64M, 64K still seems too small to me. I think we can at least upgrade the instance_info to mediumtext type, which can holds up to 16M. Any ideas?

Revision history for this message
Hironori Shiina (shiina-hironori) wrote :

Regarding the size of the DB column, it seems that ironic team decided not to store configdrive in DB according to the discussion linked in another RFE[1]. The stored configdrive is reused at rebuilding. A configdrive can be passed to rebuild API from API version 1.35[2]. I guess it would be necessary to
add a method not to store the configdrive in DB for the new API users such as a new config parameter.

[1] https://bugs.launchpad.net/ironic/+bug/1596421
[2] https://bugs.launchpad.net/ironic/+bug/1575935

Revision history for this message
Kaifeng Wang (kaifeng) wrote :

I don't see the issue is addressed by solutions from reference links, for now we still save configdrive to database.
I think at least we should add some protection here.

Changed in ironic:
assignee: nobody → Hironori Shiina (shiina-hironori)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/542097

Changed in ironic:
status: Triaged → In Progress
Revision history for this message
Hironori Shiina (shiina-hironori) wrote :

@Kaifeng,
Yes, we still store the configdrive to DB. More changes are necessary for the ideal solution with the API 1.35 in the feature. We also need a fix now to avoid getting a node stuck in 'deploying'.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/542097
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=927c487a0f822ab789f38be5f895fe061e5e71cb
Submitter: Zuul
Branch: master

commit 927c487a0f822ab789f38be5f895fe061e5e71cb
Author: Hironori Shiina <email address hidden>
Date: Thu Feb 8 18:03:09 2018 +0900

    Remove too large configdrive for handling error

    When configdrive is too large, a node object cannot be saved to DB. If
    it happens, the node cannot moved to DEPLOYFAIL because saving the node
    is prevented again by the large configdrive in the object. In this
    case, the node gets stuck in DEPLOYING, which doesn't allow any state
    transition.

    This patch removes the configdrive from a node object when storing the
    configdrive fails. This also catches ConfigInvalid exception, which is
    mentioned in the docsting, and any unexpected exception from
    _store_configdrive() to avoid getting a node stuck in DEPLOYING.

    Change-Id: I83cf3e02622fc3ed8f5b5389f533e374c1b985f3
    Closes-Bug: 1745630

Changed in ironic:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/544877

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (stable/queens)

Reviewed: https://review.openstack.org/544877
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=4545139c755a9da9538ded50a8f51f9453fc5b48
Submitter: Zuul
Branch: stable/queens

commit 4545139c755a9da9538ded50a8f51f9453fc5b48
Author: Hironori Shiina <email address hidden>
Date: Thu Feb 8 18:03:09 2018 +0900

    Remove too large configdrive for handling error

    When configdrive is too large, a node object cannot be saved to DB. If
    it happens, the node cannot moved to DEPLOYFAIL because saving the node
    is prevented again by the large configdrive in the object. In this
    case, the node gets stuck in DEPLOYING, which doesn't allow any state
    transition.

    This patch removes the configdrive from a node object when storing the
    configdrive fails. This also catches ConfigInvalid exception, which is
    mentioned in the docsting, and any unexpected exception from
    _store_configdrive() to avoid getting a node stuck in DEPLOYING.

    Change-Id: I83cf3e02622fc3ed8f5b5389f533e374c1b985f3
    Closes-Bug: 1745630
    (cherry picked from commit 927c487a0f822ab789f38be5f895fe061e5e71cb)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ironic 10.1.1

This issue was fixed in the openstack/ironic 10.1.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/560845

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (stable/pike)

Reviewed: https://review.openstack.org/560845
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=a55331cf084f0dd12796581605585d5bdfb22ecb
Submitter: Zuul
Branch: stable/pike

commit a55331cf084f0dd12796581605585d5bdfb22ecb
Author: Hironori Shiina <email address hidden>
Date: Thu Feb 8 18:03:09 2018 +0900

    Remove too large configdrive for handling error

    When configdrive is too large, a node object cannot be saved to DB. If
    it happens, the node cannot moved to DEPLOYFAIL because saving the node
    is prevented again by the large configdrive in the object. In this
    case, the node gets stuck in DEPLOYING, which doesn't allow any state
    transition.

    This patch removes the configdrive from a node object when storing the
    configdrive fails. This also catches ConfigInvalid exception, which is
    mentioned in the docsting, and any unexpected exception from
    _store_configdrive() to avoid getting a node stuck in DEPLOYING.

    (cherry picked from commit 927c487a0f822ab789f38be5f895fe061e5e71cb)
    Change-Id: I83cf3e02622fc3ed8f5b5389f533e374c1b985f3
    Closes-Bug: 1745630

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ironic 9.1.5

This issue was fixed in the openstack/ironic 9.1.5 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ironic 11.0.0

This issue was fixed in the openstack/ironic 11.0.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.