Detailed bug description:
during "boot_attach_live_migrate_and_delete_server_with_secgroups" rally scenario few LiveMigration steps have been failed. From rally logs: http://paste.openstack.org/show/582875/
From nova-compute.log on node-675: http://paste.openstack.org/show/582876/
Steps to reproduce:
1. Deploy Fuel 9.0 from fuel-9.0-mos-495-2016-06-16_18-18-00.iso
2. Update Fuel to 9.1:
yum-config-manager --add-repo http://mirror.fuel-infra.org/mos-repos/centos/mos9.0-centos7/snapshots/proposed-2016-09-15-134324/x86_64/
rpm --import http://mirror.fuel-infra.org/mos-repos/centos/mos9.0-centos7/snapshots/proposed-2016-09-15-134324/RPM-GPG-KEY-mos9.0
yum install -y python-cudet
yum clean all
update-prepare prepare master
update-prepare update master
3. Applying nailgun patch due to bug https://bugs.launchpad.net/fuel/+bug/1596987:
yum -y install patch && curl -s 'https://review.openstack.org/gitweb?p=openstack/fuel-library.git;a=patch;h=15bd6cd3ed4b7b8a42da47d19e9ed1eb4700e7d3' | patch -b -d /etc/puppet/modules/ -p3
4. fuel rel --sync-deployment-tasks --dir /etc/puppet/
5. Create config for nailgun-agent
mkdir /usr/share/fuel_bootstrap_cli/files/trusty/etc/fuel-agent
curl -s 'http://paste.openstack.org/raw/506300/' > /usr/share/fuel_bootstrap_cli/files/trusty/etc/fuel-agent/fuel-agent.conf
6. fuel-bootstrap build --verbose --debug --activate --label 'replaced-nailgun-agent_fixed'
7. Create cluster with 3 controllers, 20 osd( both baremetal) and 1000 computes(qemu-kvm)
8. Create additional repo for cluster "mos9.0-proposed":
name: mos9.0-proposed
uri: deb http://mirror.fuel-infra.org/mos-repos/ubuntu/snapshots/9.0-2016-09-15-162322/ mos9.0-proposed main restricted
priority: 1200
9. Fix syslog NOFILE limit, due to bug https://bugs.launchpad.net/fuel/+bug/1626092
On fuel node create file /etc/systemd/system/rsyslog.service.d/limits.conf with content:
[Service]
LimitNOFILE=16384
10. deploy the following cluster:
3 harware controllers, 20 hardware Ceph OSDs, 1000 virtual computes(KVM, 7VMs per hypervisor host),
vxlan+dvr, Ceph for all, OpenStack and deployment debug enabled
Expected results:
The test has passed
Actual result:
The test has failed
Reproducibility:
tried once
Workaround:
not yet
Impact:
if you'll use 1000-node cluster, spawning and migrating 5 instances at the same time, live migration can be failed.
Description of the environment:
- Operation system: Ubuntu
- Versions of components: MOS 9.1 (9.0-2016-09-15-162322)
- Reference architecture: 3 harware controllers, 20 hardware Ceph OSDs, 1000 virtual computes(KVM, 7VMs per hypervisor host), Ceph for all, OpenStack and deployment debug enabled
- Network model: vxlan+dvr
- Related projects installed: -
Additional information:
Diagnostic Snapshot feature doesn't work due https://bugs.launchpad.net/fuel/+bug/1627477
Logs from fuel node: http://mos-scale-share.mirantis.com/1627476_fuel_logs.tar.gz
Logs from compute node-675: http://mos-scale-share.mirantis.com/1627476_node-675_logs.tar.gz
Logs from compute node-20: http://mos-scale-share.mirantis.com/1627491_node-20_logs.tar.gz
Logs from controller node-1041: http://mos-scale-share.mirantis.com/1627476_node-1041_logs.tar.gz
Logs from controller node-1042: http://mos-scale-share.mirantis.com/1627476_node-1042_logs.tar.gz
Logs from controller node-1043: http://mos-scale-share.mirantis.com/1627476_node-1043_logs.tar.gz
Live migration fails, because qemu can't bind a socket:
2016-09- 24T11:25: 15.164333Z qemu-system-x86_64: -incoming tcp:[::]:49152: Failed to bind socket: Address already in use
libvirtd allocates a port number within a configured range (we use defaults):
root@node-549:~# grep migration_port /etc/libvirt/ qemu.conf
#migration_port_min = 49152
#migration_port_max = 49215
The problem with that is that libvirtd does not actually *bind* a socket, so there is a race condition between allocation done internally in libvirtd and start of a qemu process (which may fail on bind(), as we see in our case).
libvirtd *must* handle concurrent live migrations on one target host properly by itself, but the default port range seem to intersect with the ephemeral ports (which are used to establish connections to remote endpoints):
root@node-549:~# cat /proc/sys/ net/ipv4/ ip_local_ port_range
32768 61000
i.e. the kernel might have already given the very same port number to some other process which called connect() after libvirtd picked it, but before qemu actually started and bound it.
IMO, we should make sure these two port ranges do no intersect when deploying a compute node.