NFS based Nova Live Migration eratically fails

Bug #1617299 reported by Sebastian on 2016-08-26
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Tom Patzig
Newton
Medium
Matt Riedemann

Bug Description

Hello,

in our productive Openstack environment we encountered in the last weeks that Openstack Nova VM Live migrations fails.
Currently this is only visible in our automated test environment. Every 15 minutes an automated test is started and it fails 3-4 times a day.

On the Nova instance path we have mounted a central NetApp NFS share to support real Live migrations between different hypervisors.

When we analysed the issue we found the error message and trace:
BadRequest: <Compute-Node> is not on shared storage: Live migration can not be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-8e709fd1-9d72-453b-b4b1-1f26112ea3d3)

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/rally/task/runner.py", line 66, in _run_scenario_once
    getattr(scenario_inst, method_name)(**scenario_kwargs)
  File "/usr/lib/python2.7/site-packages/rally/plugins/openstack/scenarios/nova/servers.py", line 640, in boot_and_live_migrate_server
    block_migration, disk_over_commit)
  File "/usr/lib/python2.7/site-packages/rally/task/atomic.py", line 84, in func_atomic_actions
    f = func(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/rally/plugins/openstack/scenarios/nova/utils.py", line 721, in _live_migrate
    disk_over_commit=disk_over_commit)
  File "/usr/lib/python2.7/site-packages/novaclient/v2/servers.py", line 433, in live_migrate
    disk_over_commit)
  File "/usr/lib/python2.7/site-packages/novaclient/api_versions.py", line 370, in substitution
    return methods[-1].func(obj, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/novaclient/v2/servers.py", line 1524, in live_migrate
    'disk_over_commit': disk_over_commit})
  File "/usr/lib/python2.7/site-packages/novaclient/v2/servers.py", line 1691, in _action
    info=info, **kwargs)
  File "/usr/lib/python2.7/site-packages/novaclient/v2/servers.py", line 1702, in _action_return_resp_and_body
    return self.api.client.post(url, body=body)
  File "/usr/lib/python2.7/site-packages/novaclient/client.py", line 461, in post
    return self._cs_request(url, 'POST', **kwargs)
  File "/usr/lib/python2.7/site-packages/novaclient/client.py", line 436, in _cs_request
    resp, body = self._time_request(url, method, **kwargs)
  File "/usr/lib/python2.7/site-packages/novaclient/client.py", line 409, in _time_request
    resp, body = self.request(url, method, **kwargs)
  File "/usr/lib/python2.7/site-packages/novaclient/client.py", line 403, in request
    raise exceptions.from_response(resp, body, url, method)
BadRequest: <Compute-Node> is not on shared storage: Live migration can not be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-8e709fd1-9d72-453b-b4b1-1f26112ea3d3)

We examined the respective hypervisors for some problems with the NFS share/mount, but everything looks really good. Also the message log file shows no issues during the test timeframe.

The next step was to examine the Nova code to get a hint why Nova is bringing up such an error.
In the Nova code we found the test procedure how Nova checks if there is a shared filesystem between source and destination hypervisor.

In "nova/nova/virt/libvirt/driver.py"

In function „check_can_live_migrate_destination“ a temporary file is created on the destination hypervisor:

# Create file on storage, to be checked on source host
filename = self._create_shared_storage_test_file()

After that – in the same class - in function „check_can_live_migrate_source“:
dest_check_data.is_shared_instance_path = (
    self._check_shared_storage_test_file(
        dest_check_data.filename))

will be checked if the temporary file exists. And this will sometimes fail and migration returns with this error message because the file on the source hypervisor is not yet available:

elif not (dest_check_data.is_shared_block_storage or
          dest_check_data.is_shared_instance_path or
          (booted_from_volume and not has_local_disk)):
    reason = _("Live migration can not be used "
               "without shared storage except "
               "a booted from volume VM which "
               "does not have a local disk.“)

Tom Patzig (tom-patzig) on 2016-08-26
Changed in nova:
assignee: nobody → Tom Patzig (tom-patzig)
Tom Patzig (tom-patzig) wrote :

Just to add, the setup is Liberty based.
And we observed that for this testfile exists check, it could take up to 30s, until the file created on the target is visible on the source HV.

Matt Riedemann (mriedem) wrote :

You might want to talk to Timofey Durakov (tdurakov in IRC), he's been looking at the NFS-based live migration job we have in the upstream CI that has also been randomly failing.

He's been testing that here:

https://review.openstack.org/#/c/329466/

tags: added: live-migration nfs
Matt Riedemann (mriedem) wrote :

Can you hack nova to put in a configurable retry loop in the _check_shared_storage_test_file method? Maybe set that to a timeout of 30 seconds or something with a retry backoff decorator and see if it helps.

Changed in nova:
importance: Undecided → Medium
status: New → Confirmed
summary: - Share based Nova Live Migration eratically fails
+ NFS based Nova Live Migration eratically fails
Sebastian (sebastian-schee) wrote :

We already tested this behavior and are pretty sure, that we run into this issue. The "debug" log entry is not helpful for our productive environment, because we would have to enable it on all HVs.

But we investigated this issue this way that - at least in case using NFS - doing a "ls" or a "touch" on the Nova Instance share, the NFS client is forced to catch all updates from the NFS server. The result is that the temporary file will be visible immediately (within microseconds) on the client.

Timofey Durakov (tdurakov) wrote :

Could you please attach logs from both compute nodes. I'm also interested in temp file being created during this check: especially owner and group.

Fix proposed to branch: master
Review: https://review.openstack.org/366857

Changed in nova:
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/365140
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=93419deb0af159fe8e5b1edb8551446928d73c6e
Submitter: Jenkins
Branch: master

commit 93419deb0af159fe8e5b1edb8551446928d73c6e
Author: Matt Riedemann <email address hidden>
Date: Fri Sep 2 17:02:05 2016 -0400

    libvirt: improve logging for shared storage check

    Log a message when checking if shared storage is being
    used during live migration, and add the instance for context
    in both source and dest tmp file methods.

    Change-Id: I6cca25708cab7c34163511590665ff2b5e3e8ea6
    Related-Bug: #1617299

Changed in nova:
assignee: Tom Patzig (tom-patzig) → Matt Riedemann (mriedem)
Matt Riedemann (mriedem) on 2016-10-01
Changed in nova:
assignee: Matt Riedemann (mriedem) → Tom Patzig (tom-patzig)

Reviewed: https://review.openstack.org/366857
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1af73d1fb3169c5b3cce77d94316922496bbaf9a
Submitter: Jenkins
Branch: master

commit 1af73d1fb3169c5b3cce77d94316922496bbaf9a
Author: Tom Patzig <email address hidden>
Date: Wed Sep 7 11:16:49 2016 +0200

    refresh instances_path when shared storage used

    When doing Live migration with shared storage, it happens erratically,
    that the check for the shared storage test_file fails. Because the shared
    volume is under heavy IO (many instances on many compute nodes) the client
    does not immediately sees the new content of the folder. This delay
    could take up to 30s.
    This can be fixed if the client is forced to refresh the directories
    content, which can be achieved by 'touch' on the directory. Doing so,
    the test_file is visibile instantly, within ms.
    The patch adds a 'touch' on instances_path in check_shared_storage_test_file,
    before checking the existence of the file.

    Change-Id: I16be39142278517f43e6eca3441a56cbc9561113
    Closes-Bug: #1617299

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/381937
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f7d619731286c4c76dcdadbf61aaac9644fece9a
Submitter: Jenkins
Branch: stable/newton

commit f7d619731286c4c76dcdadbf61aaac9644fece9a
Author: Tom Patzig <email address hidden>
Date: Wed Sep 7 11:16:49 2016 +0200

    refresh instances_path when shared storage used

    When doing Live migration with shared storage, it happens erratically,
    that the check for the shared storage test_file fails. Because the shared
    volume is under heavy IO (many instances on many compute nodes) the client
    does not immediately sees the new content of the folder. This delay
    could take up to 30s.
    This can be fixed if the client is forced to refresh the directories
    content, which can be achieved by 'touch' on the directory. Doing so,
    the test_file is visibile instantly, within ms.
    The patch adds a 'touch' on instances_path in check_shared_storage_test_file,
    before checking the existence of the file.

    Change-Id: I16be39142278517f43e6eca3441a56cbc9561113
    Closes-Bug: #1617299
    (cherry picked from commit 1af73d1fb3169c5b3cce77d94316922496bbaf9a)

This issue was fixed in the openstack/nova 14.0.1 release.

Reviewed: https://review.openstack.org/382159
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=eeb23c78914891a5a6943c09c87aceb720d45f58
Submitter: Jenkins
Branch: stable/mitaka

commit eeb23c78914891a5a6943c09c87aceb720d45f58
Author: Tom Patzig <email address hidden>
Date: Wed Sep 7 11:16:49 2016 +0200

    refresh instances_path when shared storage used

    When doing Live migration with shared storage, it happens erratically,
    that the check for the shared storage test_file fails. Because the shared
    volume is under heavy IO (many instances on many compute nodes) the client
    does not immediately sees the new content of the folder. This delay
    could take up to 30s.
    This can be fixed if the client is forced to refresh the directories
    content, which can be achieved by 'touch' on the directory. Doing so,
    the test_file is visibile instantly, within ms.
    The patch adds a 'touch' on instances_path in check_shared_storage_test_file,
    before checking the existence of the file.

    Conflicts:
        nova/tests/unit/virt/libvirt/test_driver.py

    NOTE(lyarwood): Conflict caused by the signature of
    _check_shared_storage_test_file changing as part of I6cca257

    Change-Id: I16be39142278517f43e6eca3441a56cbc9561113
    Closes-Bug: #1617299
    (cherry picked from commit 1af73d1fb3169c5b3cce77d94316922496bbaf9a)

tags: added: in-stable-mitaka

This issue was fixed in the openstack/nova 15.0.0.0b1 development milestone.

Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/liberty
Review: https://review.openstack.org/382160
Reason: liberty is end of life

This issue was fixed in the openstack/nova 13.1.3 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers