Mirantis OpenStack

Instance reach ERROR status during resizing with "Conflict updating instance"

Series 10.0.x
Bug #1589490

Bug #1589490 reported by Georgy Dyuldin on 2016-06-06

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Status tracked in 10.0.x
10.0.x	Fix Committed	Medium	Ivan Udovichenko	Mirantis OpenStack 10.0
9.x	Fix Released	Medium	Ivan Udovichenko	Mirantis OpenStack 9.1

Bug Description

Detailed bug description:

During resizing nova instance it became to ERROR status with
Conflict updating instance b07417fc-da95-4b2f-ae4e-01a916f7066f. Expected: {'task_state': [u'resize_prep']}. Actual: {'task_state': None}

Steps to reproduce:

1. Set use_cow_images=0 value in nova config and restart nova-compute services on all computes
1. Create ubuntu image from ubuntu trusty qcow image
2. Create 2 flavors:
    name='test-eph',ram=1024, vcpus=1, disk=5, ephemeral=1
    and
    name='test-eph-large', ram=2048, vcpus=1, disk=5, ephemeral=1
3. Boot instance from image from step1 with 'test-eph' flavor
4. Resize instance to flavor 'test-eph-large' with nova API ( instance.resize(flavor) )
5. Wait until instance reach VERIFY_RESIZE status

Expected results:

All steps are passed.

Actual result:

Instance reach ERROR status:
Conflict updating instance b07417fc-da95-4b2f-ae4e-01a916f7066f. Expected: {'task_state': [u'resize_prep']}. Actual: {'task_state': None}
File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 375, in decorated_function
  return function(self, context, *args, **kwargs)
File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 3892, in resize_instance
  instance.save(expected_task_state=task_states.RESIZE_PREP)
File "/usr/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 207, in wrapper
  ctxt, self, fn.__name__, args, kwargs)
File "/usr/lib/python2.7/dist-packages/nova/conductor/rpcapi.py", line 245, in object_action
  objmethod=objmethod, args=args, kwargs=kwargs)
File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 158, in call
  retry=self.retry)
File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 91, in _send
  timeout=timeout, retry=retry)
File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 512, in send
  retry=retry)
File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 503, in _send
  raise result

Reproducibility:

Time to time

Description of the environment:
- Versions of components: MOS iso 9.0 builds #424 #443
- Network model: Neutron VLAN

Tags:

Revision history for this message

Georgy Dyuldin (g-dyuldin) wrote on 2016-06-06:

fuel-snapshot-2016-06-06_12-48-45.tar.gz Edit (35.7 MiB, application/x-tar)

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-06-06:

This is weird: the resize task fails because it finds instance in unexpected state and the reason for that is that it was explicitly reset by nova-compute on start (as a part of cleanup for unfinished migrations / resizes):

2016-06-06 12:44:07.650 18216 DEBUG nova.compute.manager [req-f182f0f6-29f2-4a91-9f75-44f51af1f989 - - - - -] [instance: b07417fc-da95-4b2f-ae4e-01a916f7066f] Instance in transitional state resize_prep at start-up clearing task state _init_instance /usr/lib/python2.7/dist-packages/nova/compute/manager.py:1028

which effectively means nova-compute was restarted in between of two events:

1) nova-api receiving a REST request to resize and instance

and

2) nova-compute actually receiving an RPC call to fulfil the request on the compute node.

Unfortunately the logs are very scarce on why it happens, we can only see that the process was restarted:

2016-06-06 12:44:07.117 18216 WARNING oslo_reports.guru_meditation_report [-] Guru mediation now registers SIGUSR1 and SIGUSR2 by default for b
ackward compatibility. SIGUSR1 will no longer be registered in a future release, so please use SIGUSR2 to generate reports.

The previous entry suggests it was not terminated properly:

2016-06-06 12:44:03.463 6303 DEBUG oslo_service.periodic_task [req-e7eaf8d3-bc19-4889-a195-e3b9066f373b - - - - -] Running periodic task Comput
eManager._check_instance_build_time run_periodic_tasks /usr/lib/python2.7/dist-packages/oslo_service/periodic_task.py:215

as there is no mention if received signals we handle, which are SIGHUP, SIGINT and SIGTERM

There are no upstart logs in the snapshot, so we can't say for sure.

Did you by any chance kill the nova-compute process with SIGKILL? Please ping us on Slack if this is reproduced again and you have an environment for live debugging.

Changed in mos:
status:	New → Incomplete
assignee:	nobody → Georgy Dyuldin (g-dyuldin)

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-06-07:

I checked Georgy's environment where this was reproduced and we found that:

1) Georgy changes the nova-compute configuration and restarts the process right *before* the test

2) the latter actually happens in the middle of the test due to its asynchronous nature

3) from logs we see that nova-compute was stopped abruptly with KILL

root@node-3:~# dmesg | grep -i nova
[ 1876.863558] init: nova-compute main process (25612) killed by KILL signal
[ 2705.071550] init: nova-compute main process (6303) killed by KILL signal

4) tracing with perf_events allows to see that this happens on "service nova-compute restart" ( http://paste.openstack.org/show/508564/): first upstart sends a TERM signal and waits for process to terminate gracefully and after $timeout seconds (defaults to 5 - http://upstart.ubuntu.com/cookbook/#kill-timeout) it sends a KILL signal

5) the problem with the latter is that it's not aligned with the graceful shutdown timeout value in oslo.service - https://github.com/openstack/oslo.service/blob/master/oslo_service/_options.py#L49 - which is 60 seconds.

In my opinion we should tweak the upstart scripts of all OpenStack services which use oslo.service (basically every one of them) and set the "kill timeout" value in upstart to (graceful_shutdown_timeout + 5) to make sure we first give services a chance to terminate gracefully and finish all in-flight requests.

Changed in mos:
status:	Incomplete → Confirmed
status:	Confirmed → Won't Fix
tags:	added: area-packaging

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-06-07:

Not sure the impact is enough to put this into 9.0 at this point, but we could test this on 10.0 and them put into MU for 9.0.

tags:

added: move-to-mu

Dina Belova (dbelova) on 2016-06-07

Changed in mos:
status:	Won't Fix → Confirmed
milestone:	9.0 → 9.0-updates

Revision history for this message

Ivan Udovichenko (iudovichenko) wrote on 2016-07-15:

The newest package which was produced by CR:
https://review.fuel-infra.org/#/c/22584/

For example, you may check nova-compute package:
http://perestroika-repo-tst.infra.mirantis.net/mos-repos/xenial/master/pool/main/n/nova/nova-compute_14.0.0~b1-1~u16.04%2bmos491_all.deb

Already includes fix:
https://anonscm.debian.org/cgit/openstack/openstack-pkg-tools.git/commit/?id=3c7918d0a13ab35fb7fe1c1417d16580ce1e2c9c

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-07-29: Related fix proposed to packages/xenial/openstack-pkg-tools (master)

Related fix proposed to branch: master
Change author: Ivan Udovichenko <email address hidden>
Review: https://review.fuel-infra.org/23797

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2016-07-29: Related fix merged to packages/xenial/openstack-pkg-tools (master)

Reviewed: https://review.fuel-infra.org/23797
Submitter: Pkgs Jenkins <email address hidden>
Branch: master

Commit: e3aa7e36e72055fd8840a0c5270f8b195102c3a7
Author: Ivan Udovichenko <email address hidden>
Date: Fri Jul 29 14:36:04 2016

Update openstack-pkg-tools package

* Align kill timeout with graceful shutdown timeout of oslo.service (LP: #1589490).

Change-Id: I1373ccdff2579944d88af9b9e81e203efcc6ab08
Related-Bug: #1589490

Dmitry Belyaninov (dbelyaninov) on 2016-08-02

tags:

added: on-verification

Revision history for this message

Dmitry Belyaninov (dbelyaninov) wrote on 2016-08-18:

verified on snapshot #152

tags:

removed: on-verification

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

fuel-snapshot-2016-06-06_12-48-45.tar.gz Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.