Mirantis OpenStack

Live Migration failure: Unable to pre-create chardev file console.log: No such file or directory

Bug #1628652 reported by Leontii Istomin on 2016-09-28

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Mirantis OpenStack	Invalid	Medium	MOS Nova	Mirantis OpenStack 9.2

Bug Description

Detailed bug description:
during "boot_attach_live_migrate_and_delete_server_with_secgroups" rally scenario few LiveMigration steps have been failed. From rally logs: http://paste.openstack.org/show/583362/
from nova-compute on node-442: http://paste.openstack.org/show/583363/
As we can see the directory/file was missing

from nova-api on node-1042: http://paste.openstack.org/show/583367/
Here we can see that the instance tried to be migrated to node-784.

from nova-compute on node-784: http://paste.openstack.org/show/583371/
Here we can see that nova tried to create the folder and the file

from nova-compute on node-784: http://paste.openstack.org/show/583372/
And here we can see that we successfully moved the folder later (during deleting process)

root@node-784:~# df -h
Filesystem Size Used Avail Use% Mounted on
udev 2.0G 12K 2.0G 1% /dev
tmpfs 396M 448K 395M 1% /run
/dev/dm-1 141G 2.7G 131G 2% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
none 5.0M 0 5.0M 0% /run/lock
none 2.0G 0 2.0G 0% /run/shm
none 100M 0 100M 0% /run/user
/dev/vda3 197M 59M 129M 32% /boot
/dev/mapper/vm-nova 53G 33M 53G 1% /var/lib/nova

Steps to reproduce:

1. Deploy Fuel 9.0 from fuel-9.0-mos-495-2016-06-16_18-18-00.iso
2. Update Fuel to 9.1:
yum-config-manager --add-repo http://mirror.fuel-infra.org/mos-repos/centos/mos9.0-centos7/snapshots/proposed-2016-09-15-134324/x86_64/
rpm --import http://mirror.fuel-infra.org/mos-repos/centos/mos9.0-centos7/snapshots/proposed-2016-09-15-134324/RPM-GPG-KEY-mos9.0
yum install -y python-cudet
yum clean all
update-prepare prepare master
update-prepare update master
3. Applying nailgun patch due to bug https://bugs.launchpad.net/fuel/+bug/1596987:
yum -y install patch && curl -s 'https://review.openstack.org/gitweb?p=openstack/fuel-library.git;a=patch;h=15bd6cd3ed4b7b8a42da47d19e9ed1eb4700e7d3' | patch -b -d /etc/puppet/modules/ -p3
4. fuel rel --sync-deployment-tasks --dir /etc/puppet/
5. Create config for nailgun-agent
mkdir /usr/share/fuel_bootstrap_cli/files/trusty/etc/fuel-agent
curl -s 'http://paste.openstack.org/raw/506300/' > /usr/share/fuel_bootstrap_cli/files/trusty/etc/fuel-agent/fuel-agent.conf
6. fuel-bootstrap build --verbose --debug --activate --label 'replaced-nailgun-agent_fixed'
7. Create cluster with 3 controllers, 20 osd( both baremetal) and 1000 computes(qemu-kvm)
8. Create additional repo for cluster "mos9.0-proposed":
name: mos9.0-proposed
uri: deb http://mirror.fuel-infra.org/mos-repos/ubuntu/snapshots/9.0-2016-09-15-162322/ mos9.0-proposed main restricted
priority: 1200
9. Fix syslog NOFILE limit, due to bug https://bugs.launchpad.net/fuel/+bug/1626092
On fuel node create file /etc/systemd/system/rsyslog.service.d/limits.conf with content:
[Service]
LimitNOFILE=16384
10. deploy the following cluster:
3 harware controllers, 20 hardware Ceph OSDs, 1000 virtual computes(KVM, 7VMs per hypervisor host),
vxlan+dvr, Ceph for all, OpenStack and deployment debug enabled
11. Applied workaround for https://bugs.launchpad.net/mos/+bug/1627476:
Have changed /etc/libvirt/qemu.conf on each compute node
migration_port_min = 61152
migration_port_max = 61215
changed iptables rules on each compute node:
sed -i s/"49152:49215"/"61152:61215"/ /etc/iptables/rules.v4
service iptables-persistent restart
12. Regarding https://bugs.launchpad.net/mos/+bug/1627491 have found that ceph keyrings on wrong nodes (which wasn't availible in hypervisors list) was different from keyrings on right nodes. Probabaly that occured because of some failures during deployment step. Workarounded that from a working compute node:
root@node-549:~# for i in `cat wrong_hypervisors.list`; do scp /etc/ceph/*.keyring $i:/etc/ceph/; done
13. As a workaround for https://bugs.launchpad.net/fuel/+bug/1628068 we use this script: http://paste.openstack.org/raw/583077/
14. As a workaround for https://bugs.launchpad.net/mos/+bug/1628145:
for i in 1029 1035 162 20 237 259 355 506 541 606 961; do ssh node-$i "virsh secret-set-value a5d0dd94-57c4-ae55-ffe0-7e3732a24455 AQAXg+NXppjkMBAAh4oG4msWbsBY9eQrZeoTdg=="; done
for i in 1029 1035 162 20 237 259 355 506 541 606 961; do ssh node-$i "service libvirtd restart"; done
for i in 1029 1035 162 20 237 259 355 506 541 606 961; do ssh node-$i "service nova-compute restart"; done
Expected results:
The test has passed
Actual result:
The test has failed
Reproducibility:
tried once
Workaround:
not yet
Impact:
Live Migration
Description of the environment:
- Operation system: Ubuntu
- Versions of components: MOS 9.1 (9.0-2016-09-15-162322)
- Reference architecture: 3 harware controllers, 20 hardware Ceph OSDs, 1000 virtual computes(KVM, 7VMs per hypervisor host), Ceph for all, OpenStack and deployment debug enabled
- Network model: vxlan+dvr
- Related projects installed: -
Additional information:
Diagnostic Snapshot feature doesn't work due https://bugs.launchpad.net/fuel/+bug/1627477
logs from compute node-442: mos-scale-share.mirantis.com/1628652_node-784_logs.tar.gz
logs from compute node-442: mos-scale-share.mirantis.com/1628652_node-442_logs.tar.gz
logs from controller node-1041: mos-scale-share.mirantis.com/1628652_node-1041_logs.tar.gz
logs from controller node-1042: mos-scale-share.mirantis.com/1628652_node-1042_logs.tar.gz
logs from controller node-1043: mos-scale-share.mirantis.com/1628652_node-1043_logs.tar.gz

See original description

Tags:

Leontii Istomin (listomin) on 2016-09-28

description:

updated

Leontii Istomin (listomin) on 2016-09-28

description:

updated

Leontii Istomin (listomin) on 2016-09-28

description:	updated
description:	updated

Leontii Istomin (listomin) on 2016-09-28

description:

updated

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-09-29:

It was reproduced only on one compute node from one thousand, so let's treat this as Medium until we see it has bigger impact.

Changed in mos:
milestone:	none → 9.2
importance:	Undecided → Medium
status:	New → Confirmed

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-09-30:

Ok, so we took a closer look at this one and it looks like an infrastructure problem.

The problematic node is node-784, live migration to other nodes succeeds. Despite all migration attempts we did, not a single packet reached libvirtd on node-784:

http://paste.openstack.org/show/583579/

At the same time there are no connection errors on the source nodes (we used node-442 and node-941). When you try to connect manually, it also works:

root@node-941:~# telnet 10.41.0.200 16509
Trying 10.41.0.200...
Connected to 10.41.0.200.
Escape character is '^]'.
HELLO
Connection closed by foreign host.

but ^ packets counter in iptables has *not* been increased.

The problem is that there are 2 different nodes with the same IP address:

root@node-941:~# ping node-784
PING node-784.domain.tld (10.41.0.200) 56(84) bytes of data.
64 bytes from node-784.domain.tld (10.41.0.200): icmp_seq=1 ttl=64 time=0.267 ms
^C
--- node-784.domain.tld ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.267/0.267/0.267/0.000 ms
root@node-941:~# arping -I br-mgmt 10.41.0.200
ARPING 10.41.0.200 from 10.41.3.104 br-mgmt
Unicast reply from 10.41.0.200 [3C:FD:FE:9C:74:41] 0.731ms
Unicast reply from 10.41.0.200 [52:54:00:2F:7B:A3] 0.898ms

The "true" node-784 has the NIC with mac 52:54:00:2F:7B:A3 . 3C:FD:FE:9C:74:41 *does not* belong to any node in this very Fuel environment:

[root@fuel ~]# ansible -i inventory -m shell -a "ifconfig -a | grep -i '3c:fd:fe:9c:74:41'" all > node-macs

[9:47]
[root@fuel ~]# grep rc=1 node-macs | wc -l
1023
[root@fuel ~]# grep rc=0 node-macs | wc -l
0

nailgun=# select * from ip_addrs where ip_addr = '10.41.0.200';
id | network | node | ip_addr | vip_name | is_user_defined | vip_namespace
------+---------+------+-------------+----------+-----------------+---------------
5341 | 3 | 784 | 10.41.0.200 | | f |
(1 row)

At the same time libvirtd (without authentication) is running on this "node", so we are able to connect to it, but obviously live migration fails, as all preparation steps (like creation of a /var/lib/instances/$uuid directory) were performed by nova-compute on the "true" node-784. Communication between nova-compute and controller nodes is not affected, as it's done via RabbitMQ and it's nova-compute which initiates connections, so controller nodes always have correct ip/mac binding in their ARP caches.

I was not able to connect to the problematic node via SSH, as it does not have the master key injected to authorized_keys:

http://paste.openstack.org/show/583584/

I suggest you try to isolate 3C:FD:FE:9C:74:41 and find the node it belongs to and understand why it happened to be in the same L2 segment with the very same IP address (perhaps it's a node from another env? old env that was not deleted properly?).

Ok, so we took a closer look at this one and it looks like an infrastructure problem.

The problematic node is node-784, live migration to other nodes succeeds. Despite all migration attempts we did, not a single packet reached libvirtd on node-784:

http://paste.openstack.org/show/583579/

At the same time there are no connection errors on the source nodes (we used node-442 and node-941). When you try to connect manually, it also works:

root@node-941:~# telnet 10.41.0.200 16509
Trying 10.41.0.200...
Connected to 10.41.0.200.
Escape character is '^]'.
HELLO
Connection closed by foreign host.

but ^ packets counter in iptables has *not* been increased.

The problem is that there are 2 different nodes with the same IP address:

root@node-941:~# ping node-784
PING node-784.domain.tld (10.41.0.200) 56(84) bytes of data.
64 bytes from node-784.domain.tld (10.41.0.200): icmp_seq=1 ttl=64 time=0.267 ms
^C
--- node-784.domain.tld ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.267/0.267/0.267/0.000 ms
root@node-941:~# arping -I br-mgmt 10.41.0.200
ARPING 10.41.0.200 from 10.41.3.104 br-mgmt
Unicast reply from 10.41.0.200 [3C:FD:FE:9C:74:41]  0.731ms
Unicast reply from 10.41.0.200 [52:54:00:2F:7B:A3]  0.898ms

The "true" node-784 has the NIC with mac 52:54:00:2F:7B:A3 . 3C:FD:FE:9C:74:41 *does not* belong to any node in this very Fuel environment:

[root@fuel ~]# ansible -i inventory -m shell -a "ifconfig -a | grep -i '3c:fd:fe:9c:74:41'" all > node-macs

[9:47]  
[root@fuel ~]# grep rc=1 node-macs  | wc -l
1023
[root@fuel ~]# grep rc=0 node-macs  | wc -l
0

nailgun=# select * from ip_addrs where ip_addr = '10.41.0.200';
 id  | network | node |   ip_addr   | vip_name | is_user_defined | vip_namespace 
------+---------+------+-------------+----------+-----------------+---------------
5341 |       3 |  784 | 10.41.0.200 |          | f               | 
(1 row)

I was not able to connect to the problematic node via SSH, as it does not have the master key injected to authorized_keys:

http://paste.openstack.org/show/583584/

Changed in mos:
status:	Confirmed → Invalid

Revision history for this message

Leontii Istomin (listomin) wrote on 2016-09-30:

The root cause of the issue was other nodes in the lab (from previous deployments) was in power on state and had the same management ip address as node-784 -> deuplication ip address.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.