Instance reach ERROR status during resizing with "Conflict updating instance"

Bug #1589490 reported by Georgy Dyuldin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Fix Committed
Medium
Ivan Udovichenko
9.x
Fix Released
Medium
Ivan Udovichenko

Bug Description

Detailed bug description:

During resizing nova instance it became to ERROR status with
Conflict updating instance b07417fc-da95-4b2f-ae4e-01a916f7066f. Expected: {'task_state': [u'resize_prep']}. Actual: {'task_state': None}

Steps to reproduce:

1. Set use_cow_images=0 value in nova config and restart nova-compute services on all computes
1. Create ubuntu image from ubuntu trusty qcow image
2. Create 2 flavors:
    name='test-eph',ram=1024, vcpus=1, disk=5, ephemeral=1
    and
    name='test-eph-large', ram=2048, vcpus=1, disk=5, ephemeral=1
3. Boot instance from image from step1 with 'test-eph' flavor
4. Resize instance to flavor 'test-eph-large' with nova API ( instance.resize(flavor) )
5. Wait until instance reach VERIFY_RESIZE status

Expected results:

All steps are passed.

Actual result:

Instance reach ERROR status:
Conflict updating instance b07417fc-da95-4b2f-ae4e-01a916f7066f. Expected: {'task_state': [u'resize_prep']}. Actual: {'task_state': None}
File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 375, in decorated_function
  return function(self, context, *args, **kwargs)
File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 3892, in resize_instance
  instance.save(expected_task_state=task_states.RESIZE_PREP)
File "/usr/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 207, in wrapper
  ctxt, self, fn.__name__, args, kwargs)
File "/usr/lib/python2.7/dist-packages/nova/conductor/rpcapi.py", line 245, in object_action
  objmethod=objmethod, args=args, kwargs=kwargs)
File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 158, in call
  retry=self.retry)
File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 91, in _send
  timeout=timeout, retry=retry)
File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 512, in send
  retry=retry)
File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 503, in _send
  raise result

Reproducibility:

Time to time

Description of the environment:
- Versions of components: MOS iso 9.0 builds #424 #443
- Network model: Neutron VLAN

Revision history for this message
Georgy Dyuldin (g-dyuldin) wrote :
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

This is weird: the resize task fails because it finds instance in unexpected state and the reason for that is that it was explicitly reset by nova-compute on start (as a part of cleanup for unfinished migrations / resizes):

2016-06-06 12:44:07.650 18216 DEBUG nova.compute.manager [req-f182f0f6-29f2-4a91-9f75-44f51af1f989 - - - - -] [instance: b07417fc-da95-4b2f-ae4e-01a916f7066f] Instance in transitional state resize_prep at start-up clearing task state _init_instance /usr/lib/python2.7/dist-packages/nova/compute/manager.py:1028

which effectively means nova-compute was restarted in between of two events:

1) nova-api receiving a REST request to resize and instance

and

2) nova-compute actually receiving an RPC call to fulfil the request on the compute node.

Unfortunately the logs are very scarce on why it happens, we can only see that the process was restarted:

2016-06-06 12:44:07.117 18216 WARNING oslo_reports.guru_meditation_report [-] Guru mediation now registers SIGUSR1 and SIGUSR2 by default for b
ackward compatibility. SIGUSR1 will no longer be registered in a future release, so please use SIGUSR2 to generate reports.

The previous entry suggests it was not terminated properly:

2016-06-06 12:44:03.463 6303 DEBUG oslo_service.periodic_task [req-e7eaf8d3-bc19-4889-a195-e3b9066f373b - - - - -] Running periodic task Comput
eManager._check_instance_build_time run_periodic_tasks /usr/lib/python2.7/dist-packages/oslo_service/periodic_task.py:215

as there is no mention if received signals we handle, which are SIGHUP, SIGINT and SIGTERM

There are no upstart logs in the snapshot, so we can't say for sure.

Did you by any chance kill the nova-compute process with SIGKILL? Please ping us on Slack if this is reproduced again and you have an environment for live debugging.

Changed in mos:
status: New → Incomplete
assignee: nobody → Georgy Dyuldin (g-dyuldin)
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

I checked Georgy's environment where this was reproduced and we found that:

1) Georgy changes the nova-compute configuration and restarts the process right *before* the test

2) the latter actually happens in the middle of the test due to its asynchronous nature

3) from logs we see that nova-compute was stopped abruptly with KILL

root@node-3:~# dmesg | grep -i nova
[ 1876.863558] init: nova-compute main process (25612) killed by KILL signal
[ 2705.071550] init: nova-compute main process (6303) killed by KILL signal

4) tracing with perf_events allows to see that this happens on "service nova-compute restart" ( http://paste.openstack.org/show/508564/): first upstart sends a TERM signal and waits for process to terminate gracefully and after $timeout seconds (defaults to 5 - http://upstart.ubuntu.com/cookbook/#kill-timeout) it sends a KILL signal

5) the problem with the latter is that it's not aligned with the graceful shutdown timeout value in oslo.service - https://github.com/openstack/oslo.service/blob/master/oslo_service/_options.py#L49 - which is 60 seconds.

In my opinion we should tweak the upstart scripts of all OpenStack services which use oslo.service (basically every one of them) and set the "kill timeout" value in upstart to (graceful_shutdown_timeout + 5) to make sure we first give services a chance to terminate gracefully and finish all in-flight requests.

Changed in mos:
status: Incomplete → Confirmed
status: Confirmed → Won't Fix
tags: added: area-packaging
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Not sure the impact is enough to put this into 9.0 at this point, but we could test this on 10.0 and them put into MU for 9.0.

tags: added: move-to-mu
Dina Belova (dbelova)
Changed in mos:
status: Won't Fix → Confirmed
milestone: 9.0 → 9.0-updates
Revision history for this message
Ivan Udovichenko (iudovichenko) wrote :
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to packages/xenial/openstack-pkg-tools (master)

Related fix proposed to branch: master
Change author: Ivan Udovichenko <email address hidden>
Review: https://review.fuel-infra.org/23797

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix merged to packages/xenial/openstack-pkg-tools (master)

Reviewed: https://review.fuel-infra.org/23797
Submitter: Pkgs Jenkins <email address hidden>
Branch: master

Commit: e3aa7e36e72055fd8840a0c5270f8b195102c3a7
Author: Ivan Udovichenko <email address hidden>
Date: Fri Jul 29 14:36:04 2016

Update openstack-pkg-tools package

* Align kill timeout with graceful shutdown timeout of oslo.service (LP: #1589490).

Change-Id: I1373ccdff2579944d88af9b9e81e203efcc6ab08
Related-Bug: #1589490

tags: added: on-verification
Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :

verified on snapshot #152

tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.