nova-compute unexpected input/output errors on starting instances (NFS + image-cache)

Bug #1621818 reported by Joris S'heeren
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Matt Riedemann
Mitaka
Won't Fix
Undecided
Joris S'heeren
Newton
Fix Committed
Low
Lee Yarwood

Bug Description

Our setup consists of multiple controllers and multiple hypervisors. Our shared storage for the instances is on a nfs 4.1 export. Using Ubuntu 16.04 LTS and Openstack Mitaka

When we launch an instance, nova updates the mtime for the _base image to let the image cache manager know the image is actively used. I think this was added here: https://review.openstack.org/gitweb?p=openstack/nova.git;a=commitdiff;h=fb6ca3e7c8a38328d384cd41c061ded6623dac90
Because of this, in our setup, we are seeing unexpected input/output errors:

Stderr: u"/bin/touch: setting times of '/var/lib/nova/instances/_base/79e34519bacb47ad6f64c4baca4d33fd5c57d34d': Input/output error

A full trace can be found here: http://paste.openstack.org/show/570161/

This error particularly shows itself when launching multiple instances at once.

Also, because of this error, the instances are rescheduled. The assigned neutron ports, however, are not deleted. This results in multiple ip's assigned to the instances, with only one of them UP. This also results in attached floating ip's not working ..
This is similar to https://bugs.launchpad.net/nova/+bug/1609526, nova should tell neutron, either to delete the unused port, or update it instead of creating a new one.

Some more info on our environment:
----------------------------------
Using libvirt + kvm, neutron with openvswitch L3 HA

# dpkg -l | grep nova
ii nova-common 2:13.0.0-0ubuntu2 all OpenStack Compute - common files
ii nova-compute 2:13.0.0-0ubuntu2 all OpenStack Compute - compute node base
ii nova-compute-kvm 2:13.0.0-0ubuntu2 all OpenStack Compute - compute node (KVM)
ii nova-compute-libvirt 2:13.0.0-0ubuntu2 all OpenStack Compute - compute node libvirt support
ii python-nova 2:13.0.0-0ubuntu2 all OpenStack Compute Python libraries
ii python-novaclient 2:3.3.1-2 all client library for OpenStack Compute API - Python 2.7

# dpkg -l |grep libvirt
ii libvirt-bin 1.3.1-1ubuntu10.1 amd64 programs for the libvirt library
ii libvirt0:amd64 1.3.1-1ubuntu10.1 amd64 library for interfacing with different virtualization systems
ii nova-compute-libvirt 2:13.0.0-0ubuntu2 all OpenStack Compute - compute node libvirt support
ii python-libvirt 1.3.1-1ubuntu1 amd64 libvirt Python bindings

description: updated
Matt Riedemann (mriedem)
tags: added: compute image-cache nfs
Changed in nova:
status: New → Confirmed
summary: - nova-compute unexpected input/output errors on starting instances
+ nova-compute unexpected input/output errors on starting instances (NFS +
+ image-cache)
Revision history for this message
Matt Riedemann (mriedem) wrote :

Is there any more useful information in dmesg or syslog when this fails? Is this 100% fail or intermittent, i.e. a timing issue?

Does the image cache base directory exist?

/var/lib/nova/instances/_base/

And can nova write to it?

Changed in nova:
status: Confirmed → Incomplete
Revision history for this message
Joris S'heeren (jsheeren) wrote :

There is no useful information in dmesg or syslog unfortunately.

The fails are intermittent. When launching a lot of instances at once coming from the same _base image, we see the error.

The image cache base directory exists and nova can write to it:

root@compute:/var/lib/nova/instances# ls -l
total 256
drwxr-xr-x 2 nova nova 4096 Aug 17 14:09 02db8511-2f20-41da-bcc2-797a9bbbe63b
... snip ...
drwxr-xr-x 2 nova nova 4096 Aug 29 17:24 bab8ddbf-c483-4462-9273-755812d84903
drwxr-xr-x 2 nova nova 4096 Sep 7 13:33 _base
drwxr-xr-x 2 nova nova 4096 Sep 9 17:10 c3251e4f-4c0e-42d8-a039-78ed9263b46c
... snip

root@compute:/var/lib/nova/instances/_base# ls -la
total 34802256
drwxr-xr-x 2 nova nova 4096 Sep 7 13:33 .
drwxr-xr-x 65 nova nova 8192 Sep 9 17:10 ..
-rw-r--r-- 1 libvirt-qemu kvm 8589934592 Sep 8 12:50 21171f1738d671d6801abab7196e4a5460c57af9
-rw-r--r-- 1 libvirt-qemu kvm 16105807872 Sep 9 09:13 3e58771f795c5e889445b424cbce395a69bbfb08
... snip

The nfs mount point is:
1.2.3.4:/data on /var/lib/nova/instances type nfs4 (rw,relatime,vers=4.1,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=5.6.7.8,local_lock=none,addr=1.2.3.4)

We can simulate it outside of nova by creating a file of a certain size inside the nfs export. Then in a loop run the touch operation; and in another loop run the copy operation to wherever.
Now and then we see the input/output error.

Revision history for this message
Joris S'heeren (jsheeren) wrote :

I've submitted a patch to fix this issue: https://review.openstack.org/#/c/368590/

With this patch, we put the execute call in a try, except block to catch possible errors. It also logs a warning with the path and error message.

Now, at least once, the update will succeed.

Logs now show:
2016-09-09 17:10:22.033 22753 WARNING nova.virt.libvirt.utils [req-a8b01dc3-f349-44ba-85ef-2ef87b0d7eb6 9bded406068d46c5b817a1f7b604dd89 028894d76a79489ead5d682232ecbe83 - - -] Failed to update mtime on path /var/lib/nova/instances/_base/79e34519bacb47ad6f64c4baca4d33fd5c57d34d: Unexpected error while running command.
Command: sudo nova-rootwrap /etc/nova/rootwrap.conf touch -c /var/lib/nova/instances/_base/79e34519bacb47ad6f64c4baca4d33fd5c57d34d
Exit code: 1
Stdout: u''
Stderr: u"/bin/touch: setting times of '/var/lib/nova/instances/_base/79e34519bacb47ad6f64c4baca4d33fd5c57d34d': Input/output error\n"

Revision history for this message
Augustina Ragwitz (auggy) wrote :

Updated bug to indicate this is In Progress as there is a change pushed for it.

Changed in nova:
assignee: nobody → Joris S'heeren (jsheeren)
status: Incomplete → In Progress
Revision history for this message
Claudiu Belu (cbelu) wrote :

Set bug priority to Medium, as the failure occurs during spawn. I would have marked it higher if the error was consistent / not limited hit rate. Still nice to fix though.

Changed in nova:
importance: Undecided → Medium
Revision history for this message
Matt Riedemann (mriedem) wrote :

The patch was proposed against stable/mitaka but needs to be proposed against master first, then backported to stable/newton and finally stable/mitaka if that's your target release.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Otherwise if we landed the patch in mitaka, once you upgrade to newton or ocata the fix is gone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/386956

Changed in nova:
assignee: Joris S'heeren (jsheeren) → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/386956
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=53da313a86e81bf1df75119ca0e8f857e7b2909c
Submitter: Jenkins
Branch: master

commit 53da313a86e81bf1df75119ca0e8f857e7b2909c
Author: Joris S'heeren <email address hidden>
Date: Fri Sep 9 15:40:58 2016 +0200

    Catch error and log warning when not able to update mtimes

    When we launch an instance, nova updates the mtime for the _base image
    to let the image cache manager know the image is actively used. This
    can lead to unexpected I/O errors when launching a large amount of
    instances at once coming from the same _base image.

    This commit puts the execute call in a try, except block to catch
    possible errors. It also logs a warning with the path and error message.
    With this, at least once the update will succeed.

    Closes-Bug: #1621818

    Co-Authored-By: Matt Riedemann <email address hidden>

    Change-Id: I2fd1700aa4563a906eb574cbbe16caa63abae0d6
    Signed-off-by: Joris S'heeren <email address hidden>

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/391091

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.0.0.0b1

This issue was fixed in the openstack/nova 15.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/mitaka)

Change abandoned by Lee Yarwood (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/368590
Reason: Hello John,

stable/mitaka has now entered phase II support [1][2], only accepting critical bugfixes and security patches. As this review does not meet these criteria it is being abandoned at this time.

However please reopen this review if you feel it is still suitable for stable/mitaka and the nova-stable-maint team will revisit this decision.

[1] http://docs.openstack.org/project-team-guide/stable-branches.html#support-phases
[2] https://releases.openstack.org/#release-series

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/newton)

Reviewed: https://review.openstack.org/391091
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=68402ae5443fd811b3dcb41537b621935d23f4ad
Submitter: Jenkins
Branch: stable/newton

commit 68402ae5443fd811b3dcb41537b621935d23f4ad
Author: Joris S'heeren <email address hidden>
Date: Fri Sep 9 15:40:58 2016 +0200

    Catch error and log warning when not able to update mtimes

    When we launch an instance, nova updates the mtime for the _base image
    to let the image cache manager know the image is actively used. This
    can lead to unexpected I/O errors when launching a large amount of
    instances at once coming from the same _base image.

    This commit puts the execute call in a try, except block to catch
    possible errors. It also logs a warning with the path and error message.
    With this, at least once the update will succeed.

    Closes-Bug: #1621818

    Co-Authored-By: Matt Riedemann <email address hidden>

    Change-Id: I2fd1700aa4563a906eb574cbbe16caa63abae0d6
    Signed-off-by: Joris S'heeren <email address hidden>
    (cherry picked from commit 53da313a86e81bf1df75119ca0e8f857e7b2909c)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 14.0.4

This issue was fixed in the openstack/nova 14.0.4 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.