Detailed bug description:
during "boot_attach_live_migrate_and_delete_server_with_secgroups" rally scenario few LiveMigration steps have been failed. From rally logs: http://paste.openstack.org/show/583039/
from nova-compute on node-992: http://paste.openstack.org/show/583042/
If I manually trying to telnet and ping: http://paste.openstack.org/show/583044/
As we can see node-10.domain.tld has been resolved to 10.21.10.159. But 10.21.0.0/16 is admin network. node-10 record is missing in host file on the node-992.
There are 1023 nodes in the env, so, should be 2046 records for 10.41.0.0/16 network:
root@node-992:~# echo "(" 2046 - `grep -c 10.41 /etc/hosts` ")" / 2 | bc
21
But actually we don't have record for 21 host.
checking port listen and iptables on node-10: http://paste.openstack.org/show/583045/
listen on all addresses, but iptables allows only management network 10.41.0.0.16
So, the issue is some records are missing on some nodes
Steps to reproduce:
1. Deploy Fuel 9.0 from fuel-9.0-mos-495-2016-06-16_18-18-00.iso
2. Update Fuel to 9.1:
yum-config-manager --add-repo http://mirror.fuel-infra.org/mos-repos/centos/mos9.0-centos7/snapshots/proposed-2016-09-15-134324/x86_64/
rpm --import http://mirror.fuel-infra.org/mos-repos/centos/mos9.0-centos7/snapshots/proposed-2016-09-15-134324/RPM-GPG-KEY-mos9.0
yum install -y python-cudet
yum clean all
update-prepare prepare master
update-prepare update master
3. Applying nailgun patch due to bug https://bugs.launchpad.net/fuel/+bug/1596987:
yum -y install patch && curl -s 'https://review.openstack.org/gitweb?p=openstack/fuel-library.git;a=patch;h=15bd6cd3ed4b7b8a42da47d19e9ed1eb4700e7d3' | patch -b -d /etc/puppet/modules/ -p3
4. fuel rel --sync-deployment-tasks --dir /etc/puppet/
5. Create config for nailgun-agent
mkdir /usr/share/fuel_bootstrap_cli/files/trusty/etc/fuel-agent
curl -s 'http://paste.openstack.org/raw/506300/' > /usr/share/fuel_bootstrap_cli/files/trusty/etc/fuel-agent/fuel-agent.conf
6. fuel-bootstrap build --verbose --debug --activate --label 'replaced-nailgun-agent_fixed'
7. Create cluster with 3 controllers, 20 osd( both baremetal) and 1000 computes(qemu-kvm)
8. Create additional repo for cluster "mos9.0-proposed":
name: mos9.0-proposed
uri: deb http://mirror.fuel-infra.org/mos-repos/ubuntu/snapshots/9.0-2016-09-15-162322/ mos9.0-proposed main restricted
priority: 1200
9. Fix syslog NOFILE limit, due to bug https://bugs.launchpad.net/fuel/+bug/1626092
On fuel node create file /etc/systemd/system/rsyslog.service.d/limits.conf with content:
[Service]
LimitNOFILE=16384
10. deploy the following cluster:
3 harware controllers, 20 hardware Ceph OSDs, 1000 virtual computes(KVM, 7VMs per hypervisor host),
vxlan+dvr, Ceph for all, OpenStack and deployment debug enabled
11. Applied workaround for https://bugs.launchpad.net/mos/+bug/1627476:
Have changed /etc/libvirt/qemu.conf on each compute node
migration_port_min = 61152
migration_port_max = 61215
changed iptables rules on each compute node:
sed -i s/"49152:49215"/"61152:61215"/ /etc/iptables/rules.v4
service iptables-persistent restart
12. Regarding https://bugs.launchpad.net/mos/+bug/1627491 have found that ceph keyrings on wrong nodes (which wasn't availible in hypervisors list) was different from keyrings on right nodes. Probabaly that occured because of some failures during deployment step. Workarounded that from a working compute node:
root@node-549:~# for i in `cat wrong_hypervisors.list`; do scp /etc/ceph/*.keyring $i:/etc/ceph/; done
Expected results:
The test has passed
Actual result:
The test has failed
Reproducibility:
tried once
Workaround:
add the missed records to the hosts file on each node.
Impact:
Some nodes can't resolve dns records to ip address from management network
Description of the environment:
- Operation system: Ubuntu
- Versions of components: MOS 9.1 (9.0-2016-09-15-162322)
- Reference architecture: 3 harware controllers, 20 hardware Ceph OSDs, 1000 virtual computes(KVM, 7VMs per hypervisor host), Ceph for all, OpenStack and deployment debug enabled
- Network model: vxlan+dvr
- Related projects installed: -
Additional information:
Diagnostic Snapshot feature doesn't work due https://bugs.launchpad.net/fuel/+bug/1627477
Logs from fuel node: http://mos-scale-share.mirantis.com/1627476_fuel_logs.tar.gz
Logs from compute node-675: http://mos-scale-share.mirantis.com/1627476_node-675_logs.tar.gz
Logs from compute node-20: http://mos-scale-share.mirantis.com/1627491_node-20_logs.tar.gz
Logs from controller node-1041: http://mos-scale-share.mirantis.com/1627476_node-1041_logs.tar.gz
Logs from controller node-1042: http://mos-scale-share.mirantis.com/1627476_node-1042_logs.tar.gz
Logs from controller node-1043: http://mos-scale-share.mirantis.com/1627476_node-1043_logs.tar.gz
As a workaround we use this script: http:// paste.openstack .org/raw/ 583077/