Live Migration failure. "Failed to bind socket: Address already in use" from qemu-system-x86_64

Bug #1627476 reported by Leontii Istomin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
Medium
Nikita Karpin

Bug Description

Detailed bug description:
 during "boot_attach_live_migrate_and_delete_server_with_secgroups" rally scenario few LiveMigration steps have been failed. From rally logs: http://paste.openstack.org/show/582875/
From nova-compute.log on node-675: http://paste.openstack.org/show/582876/
Steps to reproduce:

1. Deploy Fuel 9.0 from fuel-9.0-mos-495-2016-06-16_18-18-00.iso
2. Update Fuel to 9.1:
yum-config-manager --add-repo http://mirror.fuel-infra.org/mos-repos/centos/mos9.0-centos7/snapshots/proposed-2016-09-15-134324/x86_64/
rpm --import http://mirror.fuel-infra.org/mos-repos/centos/mos9.0-centos7/snapshots/proposed-2016-09-15-134324/RPM-GPG-KEY-mos9.0
yum install -y python-cudet
yum clean all
update-prepare prepare master
update-prepare update master
3. Applying nailgun patch due to bug https://bugs.launchpad.net/fuel/+bug/1596987:
yum -y install patch && curl -s 'https://review.openstack.org/gitweb?p=openstack/fuel-library.git;a=patch;h=15bd6cd3ed4b7b8a42da47d19e9ed1eb4700e7d3' | patch -b -d /etc/puppet/modules/ -p3
4. fuel rel --sync-deployment-tasks --dir /etc/puppet/
5. Create config for nailgun-agent
mkdir /usr/share/fuel_bootstrap_cli/files/trusty/etc/fuel-agent
curl -s 'http://paste.openstack.org/raw/506300/' > /usr/share/fuel_bootstrap_cli/files/trusty/etc/fuel-agent/fuel-agent.conf
6. fuel-bootstrap build --verbose --debug --activate --label 'replaced-nailgun-agent_fixed'
7. Create cluster with 3 controllers, 20 osd( both baremetal) and 1000 computes(qemu-kvm)
8. Create additional repo for cluster "mos9.0-proposed":
name: mos9.0-proposed
uri: deb http://mirror.fuel-infra.org/mos-repos/ubuntu/snapshots/9.0-2016-09-15-162322/ mos9.0-proposed main restricted
priority: 1200
9. Fix syslog NOFILE limit, due to bug https://bugs.launchpad.net/fuel/+bug/1626092
On fuel node create file /etc/systemd/system/rsyslog.service.d/limits.conf with content:
[Service]
LimitNOFILE=16384
10. deploy the following cluster:
3 harware controllers, 20 hardware Ceph OSDs, 1000 virtual computes(KVM, 7VMs per hypervisor host),
vxlan+dvr, Ceph for all, OpenStack and deployment debug enabled
Expected results:
 The test has passed
Actual result:
 The test has failed
Reproducibility:
 tried once
Workaround:
 not yet
Impact:
 if you'll use 1000-node cluster, spawning and migrating 5 instances at the same time, live migration can be failed.
Description of the environment:
- Operation system: Ubuntu
- Versions of components: MOS 9.1 (9.0-2016-09-15-162322)
- Reference architecture: 3 harware controllers, 20 hardware Ceph OSDs, 1000 virtual computes(KVM, 7VMs per hypervisor host), Ceph for all, OpenStack and deployment debug enabled
- Network model: vxlan+dvr
- Related projects installed: -
Additional information:
 Diagnostic Snapshot feature doesn't work due https://bugs.launchpad.net/fuel/+bug/1627477
 Logs from fuel node: http://mos-scale-share.mirantis.com/1627476_fuel_logs.tar.gz
 Logs from compute node-675: http://mos-scale-share.mirantis.com/1627476_node-675_logs.tar.gz
 Logs from compute node-20: http://mos-scale-share.mirantis.com/1627491_node-20_logs.tar.gz
 Logs from controller node-1041: http://mos-scale-share.mirantis.com/1627476_node-1041_logs.tar.gz
 Logs from controller node-1042: http://mos-scale-share.mirantis.com/1627476_node-1042_logs.tar.gz
 Logs from controller node-1043: http://mos-scale-share.mirantis.com/1627476_node-1043_logs.tar.gz

description: updated
description: updated
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Live migration fails, because qemu can't bind a socket:

2016-09-24T11:25:15.164333Z qemu-system-x86_64: -incoming tcp:[::]:49152: Failed to bind socket: Address already in use

libvirtd allocates a port number within a configured range (we use defaults):

root@node-549:~# grep migration_port /etc/libvirt/qemu.conf
#migration_port_min = 49152
#migration_port_max = 49215

The problem with that is that libvirtd does not actually *bind* a socket, so there is a race condition between allocation done internally in libvirtd and start of a qemu process (which may fail on bind(), as we see in our case).

libvirtd *must* handle concurrent live migrations on one target host properly by itself, but the default port range seem to intersect with the ephemeral ports (which are used to establish connections to remote endpoints):

root@node-549:~# cat /proc/sys/net/ipv4/ip_local_port_range
32768 61000

i.e. the kernel might have already given the very same port number to some other process which called connect() after libvirtd picked it, but before qemu actually started and bound it.

IMO, we should make sure these two port ranges do no intersect when deploying a compute node.

Changed in mos:
milestone: 9.1 → 9.2
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

MOS DA, could you please take a look at my comment above and incorporate it into fuel-library/puppet-nova?

Changed in mos:
assignee: MOS Nova (mos-nova) → MOS Deployment Automation Team (mos-da)
tags: added: area-puppet
Nikita Karpin (mkarpin)
Changed in mos:
assignee: MOS Deployment Automation Team (mos-da) → Nikita Karpin (mkarpin)
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Note, that we'll also need to update the firewall rules on compute nodes to include the new port range:

https://github.com/openstack/fuel-library/blob/789dad263bd467343b0ef3d9d423ac4c0c85311a/deployment/puppet/osnailyfacter/manifests/firewall/firewall.pp#L37

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Perhaps we could simply reserve ports in this range by the means of "ip_local_reserved_ports" ( https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt ), like it's done here:

https://github.com/openstack/fuel-library/blob/master/deployment/puppet/openstack/manifests/reserved_ports.pp#L20

so that we do not need to modify puppet-nova at all.

Revision history for this message
Nikita Karpin (mkarpin) wrote :

Bug in master was fixed here - https://review.openstack.org/#/c/381060/, now I am backporting it to stable/mitaka

Changed in mos:
status: Confirmed → In Progress
Revision history for this message
Nikita Karpin (mkarpin) wrote :
Revision history for this message
Nikita Karpin (mkarpin) wrote :

fix is merged

Changed in mos:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/fuel-library 10.0.0rc1

This issue was fixed in the openstack/fuel-library 10.0.0rc1 release candidate.

tags: added: hard-to-verify
tags: added: scale
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/fuel-library 10.0.0

This issue was fixed in the openstack/fuel-library 10.0.0 release.

Andrew Kalach (akndex)
Changed in mos:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.