Mirantis OpenStack

Live Migration failure. "Failed to bind socket: Address already in use" from qemu-system-x86_64

Bug #1627476 reported by Leontii Istomin on 2016-09-25

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Mirantis OpenStack	Fix Released	Medium	Nikita Karpin	Mirantis OpenStack 9.2

Bug Description

Detailed bug description:
during "boot_attach_live_migrate_and_delete_server_with_secgroups" rally scenario few LiveMigration steps have been failed. From rally logs: http://paste.openstack.org/show/582875/
From nova-compute.log on node-675: http://paste.openstack.org/show/582876/
Steps to reproduce:

1. Deploy Fuel 9.0 from fuel-9.0-mos-495-2016-06-16_18-18-00.iso
2. Update Fuel to 9.1:
yum-config-manager --add-repo http://mirror.fuel-infra.org/mos-repos/centos/mos9.0-centos7/snapshots/proposed-2016-09-15-134324/x86_64/
rpm --import http://mirror.fuel-infra.org/mos-repos/centos/mos9.0-centos7/snapshots/proposed-2016-09-15-134324/RPM-GPG-KEY-mos9.0
yum install -y python-cudet
yum clean all
update-prepare prepare master
update-prepare update master
3. Applying nailgun patch due to bug https://bugs.launchpad.net/fuel/+bug/1596987:
yum -y install patch && curl -s 'https://review.openstack.org/gitweb?p=openstack/fuel-library.git;a=patch;h=15bd6cd3ed4b7b8a42da47d19e9ed1eb4700e7d3' | patch -b -d /etc/puppet/modules/ -p3
4. fuel rel --sync-deployment-tasks --dir /etc/puppet/
5. Create config for nailgun-agent
mkdir /usr/share/fuel_bootstrap_cli/files/trusty/etc/fuel-agent
curl -s 'http://paste.openstack.org/raw/506300/' > /usr/share/fuel_bootstrap_cli/files/trusty/etc/fuel-agent/fuel-agent.conf
6. fuel-bootstrap build --verbose --debug --activate --label 'replaced-nailgun-agent_fixed'
7. Create cluster with 3 controllers, 20 osd( both baremetal) and 1000 computes(qemu-kvm)
8. Create additional repo for cluster "mos9.0-proposed":
name: mos9.0-proposed
uri: deb http://mirror.fuel-infra.org/mos-repos/ubuntu/snapshots/9.0-2016-09-15-162322/ mos9.0-proposed main restricted
priority: 1200
9. Fix syslog NOFILE limit, due to bug https://bugs.launchpad.net/fuel/+bug/1626092
On fuel node create file /etc/systemd/system/rsyslog.service.d/limits.conf with content:
[Service]
LimitNOFILE=16384
10. deploy the following cluster:
3 harware controllers, 20 hardware Ceph OSDs, 1000 virtual computes(KVM, 7VMs per hypervisor host),
vxlan+dvr, Ceph for all, OpenStack and deployment debug enabled
Expected results:
The test has passed
Actual result:
The test has failed
Reproducibility:
tried once
Workaround:
not yet
Impact:
if you'll use 1000-node cluster, spawning and migrating 5 instances at the same time, live migration can be failed.
Description of the environment:
- Operation system: Ubuntu
- Versions of components: MOS 9.1 (9.0-2016-09-15-162322)
- Reference architecture: 3 harware controllers, 20 hardware Ceph OSDs, 1000 virtual computes(KVM, 7VMs per hypervisor host), Ceph for all, OpenStack and deployment debug enabled
- Network model: vxlan+dvr
- Related projects installed: -
Additional information:
Diagnostic Snapshot feature doesn't work due https://bugs.launchpad.net/fuel/+bug/1627477
Logs from fuel node: http://mos-scale-share.mirantis.com/1627476_fuel_logs.tar.gz
Logs from compute node-675: http://mos-scale-share.mirantis.com/1627476_node-675_logs.tar.gz
Logs from compute node-20: http://mos-scale-share.mirantis.com/1627491_node-20_logs.tar.gz
Logs from controller node-1041: http://mos-scale-share.mirantis.com/1627476_node-1041_logs.tar.gz
Logs from controller node-1042: http://mos-scale-share.mirantis.com/1627476_node-1042_logs.tar.gz
Logs from controller node-1043: http://mos-scale-share.mirantis.com/1627476_node-1043_logs.tar.gz

See original description

Tags:

Leontii Istomin (listomin) on 2016-09-25

description:

updated

Leontii Istomin (listomin) on 2016-09-25

description:

updated

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-09-26:

Live migration fails, because qemu can't bind a socket:

2016-09-24T11:25:15.164333Z qemu-system-x86_64: -incoming tcp:[::]:49152: Failed to bind socket: Address already in use

libvirtd allocates a port number within a configured range (we use defaults):

root@node-549:~# grep migration_port /etc/libvirt/qemu.conf
#migration_port_min = 49152
#migration_port_max = 49215

The problem with that is that libvirtd does not actually *bind* a socket, so there is a race condition between allocation done internally in libvirtd and start of a qemu process (which may fail on bind(), as we see in our case).

libvirtd *must* handle concurrent live migrations on one target host properly by itself, but the default port range seem to intersect with the ephemeral ports (which are used to establish connections to remote endpoints):

root@node-549:~# cat /proc/sys/net/ipv4/ip_local_port_range
32768 61000

i.e. the kernel might have already given the very same port number to some other process which called connect() after libvirtd picked it, but before qemu actually started and bound it.

IMO, we should make sure these two port ranges do no intersect when deploying a compute node.

Roman Podoliaka (rpodolyaka) on 2016-09-26

Changed in mos:
milestone:	9.1 → 9.2
status:	New → Confirmed
importance:	Undecided → Medium

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-09-26:

MOS DA, could you please take a look at my comment above and incorporate it into fuel-library/puppet-nova?

Changed in mos:
assignee:	MOS Nova (mos-nova) → MOS Deployment Automation Team (mos-da)
tags:	added: area-puppet

Nikita Karpin (mkarpin) on 2016-09-26

Changed in mos:
assignee:	MOS Deployment Automation Team (mos-da) → Nikita Karpin (mkarpin)

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-09-27:

Note, that we'll also need to update the firewall rules on compute nodes to include the new port range:

https://github.com/openstack/fuel-library/blob/789dad263bd467343b0ef3d9d423ac4c0c85311a/deployment/puppet/osnailyfacter/manifests/firewall/firewall.pp#L37

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-09-30:

Perhaps we could simply reserve ports in this range by the means of "ip_local_reserved_ports" ( https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt ), like it's done here:

https://github.com/openstack/fuel-library/blob/master/deployment/puppet/openstack/manifests/reserved_ports.pp#L20

so that we do not need to modify puppet-nova at all.

Revision history for this message

Nikita Karpin (mkarpin) wrote on 2016-10-13:

Bug in master was fixed here - https://review.openstack.org/#/c/381060/, now I am backporting it to stable/mitaka

Changed in mos:
status:	Confirmed → In Progress

Revision history for this message

Nikita Karpin (mkarpin) wrote on 2016-10-13:

backport - https://review.openstack.org/#/c/385894/

Revision history for this message

Nikita Karpin (mkarpin) wrote on 2016-10-13:

fix is merged

Changed in mos:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-14: Fix included in openstack/fuel-library 10.0.0rc1

This issue was fixed in the openstack/fuel-library 10.0.0rc1 release candidate.

Dmitry Belyaninov (dbelyaninov) on 2016-10-24

tags:

added: hard-to-verify

Nastya Urlapova (aurlapova) on 2016-11-17

tags:

added: scale

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-12-07: Fix included in openstack/fuel-library 10.0.0

This issue was fixed in the openstack/fuel-library 10.0.0 release.

Andrew Kalach (akndex) on 2017-01-13

Changed in mos:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.