some records are missing in hosts file on some hosts

Bug #1628068 reported by Leontii Istomin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Leontii Istomin

Bug Description

Detailed bug description:
 during "boot_attach_live_migrate_and_delete_server_with_secgroups" rally scenario few LiveMigration steps have been failed. From rally logs: http://paste.openstack.org/show/583039/
from nova-compute on node-992: http://paste.openstack.org/show/583042/
If I manually trying to telnet and ping: http://paste.openstack.org/show/583044/
As we can see node-10.domain.tld has been resolved to 10.21.10.159. But 10.21.0.0/16 is admin network. node-10 record is missing in host file on the node-992.
There are 1023 nodes in the env, so, should be 2046 records for 10.41.0.0/16 network:
root@node-992:~# echo "(" 2046 - `grep -c 10.41 /etc/hosts` ")" / 2 | bc
21
But actually we don't have record for 21 host.
checking port listen and iptables on node-10: http://paste.openstack.org/show/583045/
listen on all addresses, but iptables allows only management network 10.41.0.0.16
So, the issue is some records are missing on some nodes

Steps to reproduce:

1. Deploy Fuel 9.0 from fuel-9.0-mos-495-2016-06-16_18-18-00.iso
2. Update Fuel to 9.1:
yum-config-manager --add-repo http://mirror.fuel-infra.org/mos-repos/centos/mos9.0-centos7/snapshots/proposed-2016-09-15-134324/x86_64/
rpm --import http://mirror.fuel-infra.org/mos-repos/centos/mos9.0-centos7/snapshots/proposed-2016-09-15-134324/RPM-GPG-KEY-mos9.0
yum install -y python-cudet
yum clean all
update-prepare prepare master
update-prepare update master
3. Applying nailgun patch due to bug https://bugs.launchpad.net/fuel/+bug/1596987:
yum -y install patch && curl -s 'https://review.openstack.org/gitweb?p=openstack/fuel-library.git;a=patch;h=15bd6cd3ed4b7b8a42da47d19e9ed1eb4700e7d3' | patch -b -d /etc/puppet/modules/ -p3
4. fuel rel --sync-deployment-tasks --dir /etc/puppet/
5. Create config for nailgun-agent
mkdir /usr/share/fuel_bootstrap_cli/files/trusty/etc/fuel-agent
curl -s 'http://paste.openstack.org/raw/506300/' > /usr/share/fuel_bootstrap_cli/files/trusty/etc/fuel-agent/fuel-agent.conf
6. fuel-bootstrap build --verbose --debug --activate --label 'replaced-nailgun-agent_fixed'
7. Create cluster with 3 controllers, 20 osd( both baremetal) and 1000 computes(qemu-kvm)
8. Create additional repo for cluster "mos9.0-proposed":
name: mos9.0-proposed
uri: deb http://mirror.fuel-infra.org/mos-repos/ubuntu/snapshots/9.0-2016-09-15-162322/ mos9.0-proposed main restricted
priority: 1200
9. Fix syslog NOFILE limit, due to bug https://bugs.launchpad.net/fuel/+bug/1626092
On fuel node create file /etc/systemd/system/rsyslog.service.d/limits.conf with content:
[Service]
LimitNOFILE=16384
10. deploy the following cluster:
3 harware controllers, 20 hardware Ceph OSDs, 1000 virtual computes(KVM, 7VMs per hypervisor host),
vxlan+dvr, Ceph for all, OpenStack and deployment debug enabled
11. Applied workaround for https://bugs.launchpad.net/mos/+bug/1627476:
Have changed /etc/libvirt/qemu.conf on each compute node
migration_port_min = 61152
migration_port_max = 61215
changed iptables rules on each compute node:
sed -i s/"49152:49215"/"61152:61215"/ /etc/iptables/rules.v4
service iptables-persistent restart
12. Regarding https://bugs.launchpad.net/mos/+bug/1627491 have found that ceph keyrings on wrong nodes (which wasn't availible in hypervisors list) was different from keyrings on right nodes. Probabaly that occured because of some failures during deployment step. Workarounded that from a working compute node:
root@node-549:~# for i in `cat wrong_hypervisors.list`; do scp /etc/ceph/*.keyring $i:/etc/ceph/; done

Expected results:
 The test has passed
Actual result:
 The test has failed
Reproducibility:
 tried once
Workaround:
 add the missed records to the hosts file on each node.
Impact:
 Some nodes can't resolve dns records to ip address from management network
Description of the environment:
- Operation system: Ubuntu
- Versions of components: MOS 9.1 (9.0-2016-09-15-162322)
- Reference architecture: 3 harware controllers, 20 hardware Ceph OSDs, 1000 virtual computes(KVM, 7VMs per hypervisor host), Ceph for all, OpenStack and deployment debug enabled
- Network model: vxlan+dvr
- Related projects installed: -
Additional information:
 Diagnostic Snapshot feature doesn't work due https://bugs.launchpad.net/fuel/+bug/1627477
 Logs from fuel node: http://mos-scale-share.mirantis.com/1627476_fuel_logs.tar.gz
 Logs from compute node-675: http://mos-scale-share.mirantis.com/1627476_node-675_logs.tar.gz
 Logs from compute node-20: http://mos-scale-share.mirantis.com/1627491_node-20_logs.tar.gz
 Logs from controller node-1041: http://mos-scale-share.mirantis.com/1627476_node-1041_logs.tar.gz
 Logs from controller node-1042: http://mos-scale-share.mirantis.com/1627476_node-1042_logs.tar.gz
 Logs from controller node-1043: http://mos-scale-share.mirantis.com/1627476_node-1043_logs.tar.gz

Tags: scale
Changed in fuel:
assignee: nobody → Oleksiy Molchanov (omolchanov)
Revision history for this message
Leontii Istomin (listomin) wrote :

As a workaround we use this script: http://paste.openstack.org/raw/583077/

Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

I took a look on log from node-1043 and can see that it has 2018 hosts entries created

http://paste.openstack.org/show/583114/

Please provide me with environment to check this live, meanwhile I am marking this as Incomplete.

Changed in fuel:
status: New → Incomplete
assignee: Oleksiy Molchanov (omolchanov) → Fuel Sustaining (fuel-sustaining-team)
Revision history for this message
Leontii Istomin (listomin) wrote :

@Oleksiy,

There should be at least 2023 records:
1023 nodes (1000 computes, 3 controllers, 20 OSDs)
+ 1003 records with prefix "messaging-" for computes and controllers
for example node-10 and node-20 are missing.

We've already provided you with credentials to the lab.
Please don't mark a bug as incomplete without specifying an info which you need.

Changed in fuel:
status: Incomplete → New
Changed in fuel:
status: New → Confirmed
assignee: Fuel Sustaining (fuel-sustaining-team) → Oleksiy Molchanov (omolchanov)
Changed in fuel:
assignee: Oleksiy Molchanov (omolchanov) → Georgy Kibardin (gkibardin)
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
milestone: none → 9.2
importance: Undecided → High
Revision history for this message
Roman Rufanov (rrufanov) wrote :

any update in resolving this?

Revision history for this message
Georgy Kibardin (gkibardin) wrote :

Roman, I'm just back from vacation and going to continue with it soon.

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Georgy Kibardin (gkibardin) wrote :

Are you sure that the node-10 was in the cluster 2 at the time of deployment? According to what I see in logs it wasn't. However by now the env seems to be different: all the nodes (including node-10) are now in cluster 1, except nodes 40-43 which are in the cluster 2 (where everything was before).

Changed in fuel:
status: In Progress → Incomplete
Changed in fuel:
assignee: Georgy Kibardin (gkibardin) → Leontiy Istomin (listomin)
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Setting as Invalid after a month in Incomplete.

Changed in fuel:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.