nova-compute becomes disabled after attempt to live-migrate of instance

Bug #1486070 reported by Alexander Zatserklyany
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
High
Roman Podoliaka
9.x
Fix Released
High
Dmitry Teselkin

Bug Description

Description:

Nova-compute become disabled after attempt to live-migrate of instance

Steps to reproduce:

1) Get network id
export NET_ID=$(neutron net-list | grep 'net04 ' | awk {'print $2'})

2) Launch an instance
nova boot --image TestVM --flavor 2 --nic net-id=${NET_ID} test

3) Run live-migration
nova live-migration --block-migrate test

Expected results:

The instance migrated to another host

Actual results:
The instance didn't migrate to another host

See on compute in /var/log/nova-all.log message like:
Exception during message handling: Connection to the hypervisor is broken on host: node-5.test.domain.local

dmesg shows that libvirt segfaulted:

[ 4799.067827] libvirtd[15028]: segfault at 0 ip 00007fd8ef157e73 sp 00007fd8f926e6b0 error 4 in libvirt_driver_qemu.so[7fd8ef0d6000+117000]

and it hasn't been restarted:

root@node-2:~# service libvirtd status
 * Checking status of libvirt management daemon libvirtd [fail]
root@node-2:~# ps -ef | grep virtd
root 8266 6317 0 10:34 pts/5 00:00:00 grep --color=auto virtd

affects: fuel → mos
Changed in mos:
milestone: 7.0 → none
importance: Undecided → Critical
Changed in mos:
importance: Critical → Undecided
Changed in mos:
milestone: none → 7.0
Changed in mos:
assignee: nobody → MOS Nova (mos-nova)
Revision history for this message
Alexander Zatserklyany (zatserklyany) wrote :

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "182"
  build_id: "2015-08-17_03-04-59"

Revision history for this message
Timofey Durakov (tdurakov) wrote :

@zatserklyany, could you provide more info? Do you use Ceph for volumes, ephemerals? Was instance volume-backed, or ephemeral one? Is node-5 a source or destination host? What does it mean "nova-compute become desabled"? Whole service faile with error, or just test instance failed to start?

Changed in mos:
status: New → Incomplete
Revision history for this message
Alexander Zatserklyany (zatserklyany) wrote :

Configuration:
node-1, node-2, node-3 - controller, mongo
node-4, node-5 - compute, cinder

Migration was from node-4 to node-5
root@node-1:~# nova-manage service list | grep compute
nova-compute node-5.test.domain.local nova disabled :-) 2015-08-18 13:47:41
nova-compute node-4.test.domain.local nova enabled :-) 2015-08-18 13:47:26

After 'ssh node-5 service libvirtd restart'
nova-compute node-5.test.domain.local nova enabled :-) 2015-08-18 13:49:51

Every time after 'nova live-migration --block-migrate test'
 nova-compute node-5.test.domain.local nova disabled :-)

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Surprisingly I can reproduce this on ISO #188. I updated the description a bit pointing the segfault of libvirtd.

Packages versions:

root@node-2:~# dpkg -l | grep libvirt
ii libvirt-bin 1.2.9-9~u14.04+mos2 amd64 programs for the libvirt library
ii libvirt-clients 1.2.9-9~u14.04+mos2 amd64 programs for the libvirt library
ii libvirt-daemon 1.2.9-9~u14.04+mos2 amd64 programs for the libvirt library
ii libvirt-daemon-system 1.2.9-9~u14.04+mos2 amd64 Libvirt daemon configuration files
ii libvirt0 1.2.9-9~u14.04+mos2 amd64 library for interfacing with different virtualization systems
ii python-libvirt 1.2.9-1~u14.04+mos1 amd64 libvirt Python bindings

summary: - Nova-compute become disabled after attempt to live-migrate of instance
+ nova-compute become disabled after attempt to live-migrate of instance
description: updated
Changed in mos:
status: Incomplete → Confirmed
assignee: MOS Nova (mos-nova) → MOS Linux (mos-linux)
importance: Undecided → High
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
tags: added: libvirt
Pavel Boldin (pboldin)
Changed in mos:
assignee: MOS Linux (mos-linux) → Pavel Boldin (pboldin)
Revision history for this message
Pavel Boldin (pboldin) wrote :

Backtrace:

#0 0x00007fe58119be73 in ?? () from /usr/lib/libvirt/connection-driver/libvirt_driver_qemu.so
#1 0x00007fe58119ccc9 in qemuMigrationPrepareTunnel () from /usr/lib/libvirt/connection-driver/libvirt_driver_qemu.so
#2 0x00007fe5811c7a6c in ?? () from /usr/lib/libvirt/connection-driver/libvirt_driver_qemu.so
#3 0x00007fe590939051 in virDomainMigratePrepareTunnel3Params () from /usr/lib/libvirt.so.0
#4 0x00007fe5922b123b in ?? ()
#5 0x00007fe5922e6fa2 in virNetServerProgramDispatch ()
#6 0x00007fe5922e0fbd in ?? ()
#7 0x00007fe590890d55 in ?? () from /usr/lib/libvirt.so.0
#8 0x00007fe59089024e in ?? () from /usr/lib/libvirt.so.0
#9 0x00007fe5905bb182 in start_thread (arg=0x7fe58c2b5700) at pthread_create.c:312
#10 0x00007fe5902e847d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

This code is strcmp inlined:

   81e62: 48 8b 74 24 40 mov 0x40(%rsp),%rsi
   81e67: 48 8d 3d 9a ec 05 00 lea 0x5ec9a(%rip),%rdi # e0b08 <_fini+0x25ec>
   81e6e: b9 05 00 00 00 mov $0x5,%ecx
   81e73: f3 a6 repz cmpsb %es:(%rdi),%ds:(%rsi)

The string at 0xe0b08 is: "rdma"

This code checks for the "rdma" protocol but after some process is started.

This is the code from `qemuMigrationPrepareAny', called from `qemuMigrationPrepareTunnel` with protocol=NULL making inlined `strcmp' causing SIGEGV. (file src/qemu/qemu_migration.c line 2749)

However, this seems like a misconfiguration in our code: block migration should not use tunnelled migration and the appropriate nova-compute configuration option should be set block_migration_flag.

Revision history for this message
Pavel Boldin (pboldin) wrote :

Setting option block_migration_flag to exclude Tunnelled migration helps.

Changed in mos:
assignee: Pavel Boldin (pboldin) → Roman Podoliaka (rpodolyaka)
Changed in mos:
status: Confirmed → In Progress
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/nova (openstack-ci/fuel-7.0/2015.1.0)

Fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Roman Podoliaka <email address hidden>
Review: https://review.fuel-infra.org/10586

summary: - nova-compute become disabled after attempt to live-migrate of instance
+ nova-compute becomes disabled after attempt to live-migrate of instance
tags: added: customer-found
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/nova (openstack-ci/fuel-7.0/2015.1.0)

Change abandoned by Roman Podoliaka <email address hidden> on branch: openstack-ci/fuel-7.0/2015.1.0
Review: https://review.fuel-infra.org/10586
Reason: in favor of https://review.openstack.org/215032

Revision history for this message
Pavel Boldin (pboldin) wrote :

There are two bugs actually:
1. Fuel using block_migration with TUNNELLED
2. libvirt segfaults when TUNNELLED migration is used due to the code path described in https://bugs.launchpad.net/mos/+bug/1486070/comments/6

The proposed patch only fixes first effectively workarounding second.

Revision history for this message
Pavel Boldin (pboldin) wrote :
Changed in mos:
status: In Progress → Fix Committed
Revision history for this message
Alexander Zatserklyany (zatserklyany) wrote :

fuel-7.0-233-2015-08-25_21-41-18.iso
root@node-1:~# export NET_ID=$(neutron net-list | grep 'net04 ' | awk {'print $2'})
root@node-1:~# nova boot --image TestVM --flavor 2 --nic net-id=${NET_ID} test
root@node-1:~# nova show test
+--------------------------------------+----------------------------------------------------------+
| Property | Value |
+--------------------------------------+----------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | node-2.test.domain.local |
| OS-EXT-SRV-ATTR:hypervisor_hostname | node-2.test.domain.local |
root@node-1:~# nova live-migration --block-migrate test
root@node-1:~# nova-manage service list | grep compute
nova-compute node-2.test.domain.local nova enabled :-) 2015-08-26 14:45:56
nova-compute node-5.test.domain.local nova enabled :-) 2015-08-26 14:45:57
root@node-1:~# nova show test
+--------------------------------------+----------------------------------------------------------+
| Property | Value |
+--------------------------------------+----------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | node-5.test.domain.local |
| OS-EXT-SRV-ATTR:hypervisor_hostname | node-5.test.domain.local |

Test passed.

Changed in mos:
status: Fix Committed → Fix Released
Revision history for this message
Sergey Vasilenko (xenolog) wrote :

We can't separate live-migration without enabling tunneled migrations.

I re-open this bug and assign it to 9.0, because this segfault stops all work on separation live-migration traffic from admin network

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to packages/trusty/libvirt (8.0)

Related fix proposed to branch: 8.0
Change author: Dmitry Teselkin <email address hidden>
Review: https://review.fuel-infra.org/15916

tags: added: area-linux
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to packages/trusty/libvirt (master)

Fix proposed to branch: master
Change author: Dmitry Teselkin <email address hidden>
Review: https://review.fuel-infra.org/17394

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to packages/trusty/libvirt (master)

Reviewed: https://review.fuel-infra.org/17394
Submitter: Pkgs Jenkins <email address hidden>
Branch: master

Commit: 12fb4d63a65174cf717516244bbc8d46f19be093
Author: Dmitry Teselkin <email address hidden>
Date: Wed Feb 24 16:21:52 2016

Fix live migration

Add second patch [1] to fix live migration issue [2].

[1] https://git.centos.org/blob/rpms!libvirt/a7d8b1953536811df23cae15003224d26493e961/SOURCES!libvirt-qemu-Really-fix-crash-in-tunnelled-migration.patch
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1147331

Closes-Bug: 1486070

Change-Id: Ie9ef136f07297457b6fc4d738146dedf7819bb02

Revision history for this message
Anna Babich (ababich) wrote :

Verified on: (see the attachment)
Env with neutron vlan and cinder ceph rbd (3 controllers + 2 computes)

Results of reproducing:
root@node-6:~# nova show ephem | grep '\( OS-EXT-SRV-ATTR:host \| status \)'
| OS-EXT-SRV-ATTR:host | node-9.test.domain.local |
| status | ACTIVE |
root@node-6:~# nova live-migration --block-migrate ephem
root@node-6:~# nova show ephem | grep '\( OS-EXT-SRV-ATTR:host \| status \)'
| OS-EXT-SRV-ATTR:host | node-10.test.domain.local |
| status | ACTIVE |
root@node-6:~#

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on packages/trusty/libvirt (8.0)

Change abandoned by Dmitry Teselkin <email address hidden> on branch: 8.0
Review: https://review.fuel-infra.org/15916

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.