Failed to live-migrate instance in cell with microversion >= 2.34

Bug #1716903 reported by Yikun Jiang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Matt Riedemann
Pike
Fix Committed
High
Matt Riedemann

Bug Description

Step 1 create instance in cell1
+--------------------------------------+--------+--------+------------+-------------+---------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------+--------+------------+-------------+---------------------------------+
| 84038890-8d70-45e1-8240-2303f4227e11 | yikun1 | ACTIVE | - | Running | public=2001:db8::a, 172.24.4.13 |
+--------------------------------------+--------+--------+------------+-------------+---------------------------------+

Step 2 live migrate instance
nova live-migration 84038890-8d70-45e1-8240-2303f4227e11

Step 3
The instance will stuck in "MIGRATIING" state.
+--------------------------------------+--------+-----------+------------+-------------+---------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------+-----------+------------+-------------+---------------------------------+
| 84038890-8d70-45e1-8240-2303f4227e11 | yikun1 | MIGRATING | migrating | Running | public=2001:db8::a, 172.24.4.13 |
+--------------------------------------+--------+-----------+------------+-------------+---------------------------------+

It seems we need add @targets_cell decorator for **live_migrate_instance** methods in conductor:
https://github.com/openstack/nova/blob/master/nova/conductor/manager.py#L378

ERROR LOG in super conductor:
Exception during message handling: InstanceActionNotFound: Action for request_id req-5aa03558-ae14-458e-9c35-c3d377c7ce45 on instance 84038890-8d70-45e1-8240-2303f4227e11 not found
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 160, in _process_incoming
    res = self.dispatcher.dispatch(message)
  File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 213, in dispatch
    return self._do_dispatch(endpoint, method, ctxt, args)
  File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 183, in _do_dispatch
    result = func(ctxt, **new_args)
  File "/opt/stack/nova/nova/compute/utils.py", line 875, in decorated_function
    with EventReporter(context, event_name, instance_uuid):
  File "/opt/stack/nova/nova/compute/utils.py", line 846, in __enter__
    self.context, uuid, self.event_name, want_result=False)
  File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 184, in wrapper
    result = fn(cls, context, *args, **kwargs)
  File "/opt/stack/nova/nova/objects/instance_action.py", line 169, in event_start
    db_event = db.action_event_start(context, values)
  File "/opt/stack/nova/nova/db/api.py", line 1957, in action_event_start
    return IMPL.action_event_start(context, values)
  File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 250, in wrapped
    return f(context, *args, **kwargs)
  File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 6155, in action_event_start
    instance_uuid=values['instance_uuid'])
InstanceActionNotFound: Action for request_id req-5aa03558-ae14-458e-9c35-c3d377c7ce45 on instance 84038890-8d70-45e1-8240-2303f4227e11 not found

Changed in nova:
assignee: nobody → Yikun Jiang (yikunkero)
status: New → In Progress
Yikun Jiang (yikunkero)
description: updated
Revision history for this message
Zhenyu Zheng (zhengzhenyu) wrote : Re: Failed to live-migrate instance in cell.
summary: - Failed to live-migrate instance in cell1.
+ Failed to live-migrate instance in cell.
Revision history for this message
Matt Riedemann (mriedem) wrote :

This needs more details. When we lookup the instance in the API we target the context to the cell that the instance is in, and then that context is used when the API calls conductor so we should be fine, plus we have live migration CI tests which are not failing, so this seems invalid.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Is your super conductor nova.conf configured to hit the [api_database]?

Changed in nova:
status: In Progress → Incomplete
Revision history for this message
Yikun Jiang (yikunkero) wrote :

Sorry for the lack of information I given.

In this case, the api cast live migrate rpc to super-conductor,
and the default db of super-conductor is cell0, but instance
action of this instance infomation store in cell1, the super-conductor
wasn't aware of this without targets_cell decorator.

According to implements of bp/cells-aware-api, I found that the rpc cast
from compute to conductor DON'T refresh the transport of conductor rpc
client. So, it will cast to super-condcuotr and ctxt in conductor doesn't
 contain the mq or db connection which refresh in api.

That is,
1. now api just cast message to the **super-conductor**(yep, not cell-conducotr)
2. super-conductor is aware of the instance db_connection depends on
targets_cell decorator.

Unfortunately, live_migrate_intance method does't add this decoractor.
I think, maybe, Dan might omitted this function when add decorators for
every super conductor operation.(https://review.openstack.org/#/c/438022/)

Thus, in order to let super conductor know which db the instance use,
we should add decorator for live_migrate_intance method.

Changed in nova:
status: Incomplete → In Progress
Revision history for this message
Yikun Jiang (yikunkero) wrote :

Is your super conductor nova.conf configured to hit the [api_database]?

@matt seems no, my config as below:

### /etc/nova/nova_cell1.conf
[DEFAULT]
transport_url = rabbit://stackrabbit:1@XXXXX:5672/nova_cell1
[database]
connection = mysql+pymysql://root:1@127.0.0.1/nova_cell1?charset=utf8

### /etc/nova/nova.conf
[DEFAULT]
transport_url = rabbit://stackrabbit:1@XXXXX:5672/
[database]
connection = mysql+pymysql://root:1@127.0.0.1/nova_cell0?charset=utf8
[api_database]
connection = mysql+pymysql://root:1@127.0.0.1/nova_api?charset=utf8

### super-conductor process
/usr/bin/python /usr/local/bin/nova-conductor --config-file /etc/nova/nova.conf

### cell-condutor process
/usr/bin/python /usr/local/bin/nova-conductor --config-file /etc/nova/nova_cell1.conf

description: updated
Revision history for this message
Yikun Jiang (yikunkero) wrote :

BTW, Without the @targets_cell decorator:
1. (before casting)API context.db_connection in api is **cell1** db connection.
I print api context before casting:
context.db_connection._root._root_factory.__dict__ ==>'connection': u'mysql+pymysql://root:1@127.0.0.1/nova_cell1

2. (after casting)Super-Conductor context.db_connection is None.

Changed in nova:
status: In Progress → Incomplete
Changed in nova:
status: Incomplete → In Progress
Revision history for this message
Matt Riedemann (mriedem) wrote :

The reason the live migration CI job didn't catch this is that it doesn't test microversion >= 2.34 which changes to use the live_migrate_instance method. If the microversion < 2.34 it uses the migrate_server method which has the @targets_cell decorator.

summary: - Failed to live-migrate instance in cell.
+ Failed to live-migrate instance in cell with microversion >= 2.34
Changed in nova:
importance: Undecided → High
tags: added: cells live-migration
Changed in nova:
assignee: Yikun Jiang (yikunkero) → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/505285

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/503601
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=062f5b2e876a09119e43c1905f91610cd4e5d015
Submitter: Jenkins
Branch: master

commit 062f5b2e876a09119e43c1905f91610cd4e5d015
Author: Yikun Jiang <email address hidden>
Date: Wed Sep 13 19:35:49 2017 +0800

    Add @targets_cell for live_migrate_instance method in conductor

    With microversion < 2.34, the API casts to the migrate_server
    method in super conductor which targets the context using the
    @targets_cell decorator.

    With microversion >= 2.34, the API casts to the live_migrate_instance
    method in super conductor which does not use the @targets_cell
    decorator, which results in a failure to lookup the instance action
    record when recording the start of the action event with the
    @wrap_instance_event decorator.

    This change simply adds the decorator and provides a test which
    was missing for this before. Note that the live migration CI job
    didn't catch this regression since it only tests up to microversion
    2.26.

    Co-Authored-By: Matt Riedemann <email address hidden>

    Closes-bug: #1716903
    Change-Id: I21d3f3b7589221b7e0a46c332510afc876ca5a79

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/505285
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a9f9e70dd08295618731445f38af2031adc859f1
Submitter: Jenkins
Branch: stable/pike

commit a9f9e70dd08295618731445f38af2031adc859f1
Author: Yikun Jiang <email address hidden>
Date: Wed Sep 13 19:35:49 2017 +0800

    Add @targets_cell for live_migrate_instance method in conductor

    With microversion < 2.34, the API casts to the migrate_server
    method in super conductor which targets the context using the
    @targets_cell decorator.

    With microversion >= 2.34, the API casts to the live_migrate_instance
    method in super conductor which does not use the @targets_cell
    decorator, which results in a failure to lookup the instance action
    record when recording the start of the action event with the
    @wrap_instance_event decorator.

    This change simply adds the decorator and provides a test which
    was missing for this before. Note that the live migration CI job
    didn't catch this regression since it only tests up to microversion
    2.26.

    Co-Authored-By: Matt Riedemann <email address hidden>

    Closes-bug: #1716903
    Change-Id: I21d3f3b7589221b7e0a46c332510afc876ca5a79
    (cherry picked from commit 062f5b2e876a09119e43c1905f91610cd4e5d015)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.0.1

This issue was fixed in the openstack/nova 16.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.0.0b1

This issue was fixed in the openstack/nova 17.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.