NFS based Nova Live Migration eratically fails
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| OpenStack Compute (nova) |
Medium
|
Tom Patzig | ||
| Newton |
Medium
|
Matt Riedemann |
Bug Description
Hello,
in our productive Openstack environment we encountered in the last weeks that Openstack Nova VM Live migrations fails.
Currently this is only visible in our automated test environment. Every 15 minutes an automated test is started and it fails 3-4 times a day.
On the Nova instance path we have mounted a central NetApp NFS share to support real Live migrations between different hypervisors.
When we analysed the issue we found the error message and trace:
BadRequest: <Compute-Node> is not on shared storage: Live migration can not be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-8e709fd1-
Traceback (most recent call last):
File "/usr/lib/
getattr(
File "/usr/lib/
block_
File "/usr/lib/
f = func(self, *args, **kwargs)
File "/usr/lib/
disk_
File "/usr/lib/
disk_
File "/usr/lib/
return methods[
File "/usr/lib/
'disk_
File "/usr/lib/
info=info, **kwargs)
File "/usr/lib/
return self.api.
File "/usr/lib/
return self._cs_
File "/usr/lib/
resp, body = self._time_
File "/usr/lib/
resp, body = self.request(url, method, **kwargs)
File "/usr/lib/
raise exceptions.
BadRequest: <Compute-Node> is not on shared storage: Live migration can not be used without shared storage except a booted from volume VM which does not have a local disk. (HTTP 400) (Request-ID: req-8e709fd1-
We examined the respective hypervisors for some problems with the NFS share/mount, but everything looks really good. Also the message log file shows no issues during the test timeframe.
The next step was to examine the Nova code to get a hint why Nova is bringing up such an error.
In the Nova code we found the test procedure how Nova checks if there is a shared filesystem between source and destination hypervisor.
In "nova/nova/
In function „check_
# Create file on storage, to be checked on source host
filename = self._create_
After that – in the same class - in function „check_
dest_check_
self.
will be checked if the temporary file exists. And this will sometimes fail and migration returns with this error message because the file on the source hypervisor is not yet available:
elif not (dest_check_
reason = _("Live migration can not be used "
"a booted from volume VM which "
Changed in nova: | |
assignee: | nobody → Tom Patzig (tom-patzig) |
Tom Patzig (tom-patzig) wrote : | #1 |
Matt Riedemann (mriedem) wrote : | #2 |
You might want to talk to Timofey Durakov (tdurakov in IRC), he's been looking at the NFS-based live migration job we have in the upstream CI that has also been randomly failing.
He's been testing that here:
tags: | added: live-migration nfs |
Matt Riedemann (mriedem) wrote : | #3 |
Can you hack nova to put in a configurable retry loop in the _check_
Changed in nova: | |
importance: | Undecided → Medium |
status: | New → Confirmed |
summary: |
- Share based Nova Live Migration eratically fails + NFS based Nova Live Migration eratically fails |
Related fix proposed to branch: master
Review: https:/
Sebastian (sebastian-schee) wrote : | #5 |
We already tested this behavior and are pretty sure, that we run into this issue. The "debug" log entry is not helpful for our productive environment, because we would have to enable it on all HVs.
But we investigated this issue this way that - at least in case using NFS - doing a "ls" or a "touch" on the Nova Instance share, the NFS client is forced to catch all updates from the NFS server. The result is that the temporary file will be visible immediately (within microseconds) on the client.
Timofey Durakov (tdurakov) wrote : | #6 |
Could you please attach logs from both compute nodes. I'm also interested in temp file being created during this check: especially owner and group.
Fix proposed to branch: master
Review: https:/
Changed in nova: | |
status: | Confirmed → In Progress |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit 93419deb0af159f
Author: Matt Riedemann <email address hidden>
Date: Fri Sep 2 17:02:05 2016 -0400
libvirt: improve logging for shared storage check
Log a message when checking if shared storage is being
used during live migration, and add the instance for context
in both source and dest tmp file methods.
Change-Id: I6cca25708cab7c
Related-Bug: #1617299
Changed in nova: | |
assignee: | Tom Patzig (tom-patzig) → Matt Riedemann (mriedem) |
Changed in nova: | |
assignee: | Matt Riedemann (mriedem) → Tom Patzig (tom-patzig) |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit 1af73d1fb3169c5
Author: Tom Patzig <email address hidden>
Date: Wed Sep 7 11:16:49 2016 +0200
refresh instances_path when shared storage used
When doing Live migration with shared storage, it happens erratically,
that the check for the shared storage test_file fails. Because the shared
volume is under heavy IO (many instances on many compute nodes) the client
does not immediately sees the new content of the folder. This delay
could take up to 30s.
This can be fixed if the client is forced to refresh the directories
content, which can be achieved by 'touch' on the directory. Doing so,
the test_file is visibile instantly, within ms.
The patch adds a 'touch' on instances_path in check_shared_
before checking the existence of the file.
Change-Id: I16be3914227851
Closes-Bug: #1617299
Changed in nova: | |
status: | In Progress → Fix Released |
Fix proposed to branch: stable/newton
Review: https:/
Fix proposed to branch: stable/mitaka
Review: https:/
Fix proposed to branch: stable/liberty
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: stable/newton
commit f7d619731286c4c
Author: Tom Patzig <email address hidden>
Date: Wed Sep 7 11:16:49 2016 +0200
refresh instances_path when shared storage used
When doing Live migration with shared storage, it happens erratically,
that the check for the shared storage test_file fails. Because the shared
volume is under heavy IO (many instances on many compute nodes) the client
does not immediately sees the new content of the folder. This delay
could take up to 30s.
This can be fixed if the client is forced to refresh the directories
content, which can be achieved by 'touch' on the directory. Doing so,
the test_file is visibile instantly, within ms.
The patch adds a 'touch' on instances_path in check_shared_
before checking the existence of the file.
Change-Id: I16be3914227851
Closes-Bug: #1617299
(cherry picked from commit 1af73d1fb3169c5
This issue was fixed in the openstack/nova 14.0.1 release.
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: stable/mitaka
commit eeb23c78914891a
Author: Tom Patzig <email address hidden>
Date: Wed Sep 7 11:16:49 2016 +0200
refresh instances_path when shared storage used
When doing Live migration with shared storage, it happens erratically,
that the check for the shared storage test_file fails. Because the shared
volume is under heavy IO (many instances on many compute nodes) the client
does not immediately sees the new content of the folder. This delay
could take up to 30s.
This can be fixed if the client is forced to refresh the directories
content, which can be achieved by 'touch' on the directory. Doing so,
the test_file is visibile instantly, within ms.
The patch adds a 'touch' on instances_path in check_shared_
before checking the existence of the file.
Conflicts:
NOTE(lyarwood): Conflict caused by the signature of
_check_
Change-Id: I16be3914227851
Closes-Bug: #1617299
(cherry picked from commit 1af73d1fb3169c5
tags: | added: in-stable-mitaka |
This issue was fixed in the openstack/nova 15.0.0.0b1 development milestone.
Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/liberty
Review: https:/
Reason: liberty is end of life
This issue was fixed in the openstack/nova 13.1.3 release.
Just to add, the setup is Liberty based.
And we observed that for this testfile exists check, it could take up to 30s, until the file created on the target is visible on the source HV.