After reboot compute with fuel-master network checker was failed

Bug #1533165 reported by Veronica Krayneva
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Invalid
High
Peter Zhurba
9.x
Invalid
High
Peter Zhurba

Bug Description

Scenario:
1. Deploy cluster with two computes and three controllers
2. Migrate fuel-master
3. Reboot compute with fuel-master via reboot command
4. Wait till compute and fuel-master come up
5. Run OSTF tests
6. Run Network check
7. Check statuses for master’s services

Actual result:
Network checker failed
[root@nailgun ~]# fuel task
id | status | name | cluster | progress | uuid
---|--------|-------------------------|---------|----------|-------------------------------------
6 | ready | deployment | 1 | 100 | 066b4b61-b27f-4d09-98fe-8af69b4e5e32
1 | ready | deploy | 1 | 100 | a7b0dba3-3a29-4080-b875-15963e4fe702
17 | ready | check_dhcp | 1 | 100 | aea61491-4e3b-4f5b-87c0-87e6126bdbeb
18 | error | check_repo_availability | 1 | 100 | 4960ce85-26e0-4d3c-ab67-5883fbc285dd
19 | ready | create_stats_user | 1 | 100 | ebb22607-e7fc-46d6-8b96-127ce33d8a33
16 | error | verify_networks | 1 | 100 | dde7559d-5953-4781-8715-7dddb600aee1
5 | ready | provision | 1 | 100 | 052c4347-0183-43ee-95d1-c202df273384

Also:
root@node-5:~# nova-manage service list
No handlers could be found for logger "oslo_config.cfg"
2016-01-12 11:39:50.103 8299 DEBUG oslo_db.api [req-841473b8-6817-4992-bfe3-cbf80f15147d - - - - -] Loading backend 'sqlalchemy' from 'nova.db.sqlalchemy.api' _load_backend /usr/lib/python2.7/dist-packages/oslo_db/api.py:230
2016-01-12 11:39:50.182 8299 DEBUG oslo_db.sqlalchemy.engines [req-841473b8-6817-4992-bfe3-cbf80f15147d - - - - -] MySQL server mode set to STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,TRADITIONAL,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION _check_effective_sql_mode /usr/lib/python2.7/dist-packages/oslo_db/sqlalchemy/engines.py:256
Binary Host Zone Status State Updated_At
nova-cert node-3.test.domain.local internal enabled :-) 2016-01-12 11:39:40
nova-consoleauth node-3.test.domain.local internal enabled :-) 2016-01-12 11:39:41
nova-scheduler node-3.test.domain.local internal enabled :-) 2016-01-12 11:39:40
nova-conductor node-3.test.domain.local internal enabled :-) 2016-01-12 11:39:41
nova-consoleauth node-5.test.domain.local internal enabled :-) 2016-01-12 11:39:43
nova-scheduler node-5.test.domain.local internal enabled :-) 2016-01-12 11:39:43
nova-conductor node-5.test.domain.local internal enabled :-) 2016-01-12 11:39:40
nova-cert node-5.test.domain.local internal enabled :-) 2016-01-12 11:39:43
nova-consoleauth node-4.test.domain.local internal enabled :-) 2016-01-12 11:39:40
nova-scheduler node-4.test.domain.local internal enabled :-) 2016-01-12 11:39:42
nova-conductor node-4.test.domain.local internal enabled :-) 2016-01-12 11:39:41
nova-cert node-4.test.domain.local internal enabled :-) 2016-01-12 11:39:40
2016-01-12 11:39:50.581 8299 DEBUG nova.servicegroup.drivers.db [req-841473b8-6817-4992-bfe3-cbf80f15147d - - - - -] Seems service is down. Last heartbeat was 2016-01-12 10:27:55. Elapsed time is 4315.581825 is_up /usr/lib/python2.7/dist-packages/nova/servicegroup/drivers/db.py:80
nova-compute node-1.test.domain.local nova enabled XXX 2016-01-12 10:27:55
nova-compute node-2.test.domain.local nova enabled :-) 2016-01-12 11:39:04

ISO #417

Changed in mos:
importance: Undecided → High
milestone: none → 8.0
assignee: nobody → Peter Zhurba (pzhurba)
Ilya Kutukov (ikutukov)
Changed in mos:
status: New → Confirmed
Revision history for this message
Peter Zhurba (pzhurba) wrote :

Main case is
Network interface crash on compute with fuel master. Full log is attached marker [ cut here ]

<6>Jan 12 08:38:38 node-1 kernel: [ 3288.416028] br-fw-admin: port 2(vfm_enp0s3) entered forwarding state
<4>Jan 12 08:40:34 node-1 kernel: [ 3404.998725] ------------[ cut here ]------------
<4>Jan 12 08:40:34 node-1 kernel: [ 3404.998740] WARNING: CPU: 1 PID: 17331 at /build/linux-_xRakU/linux-3.13.0/net/core/dev.c:2228 skb_warn_bad_offload+0xcd/0xda()
<4>Jan 12 08:40:34 node-1 kernel: [ 3404.998743] e1000: caps=(0x0000000200014b89, 0x0000000000000000) len=9854 data_len=9826 gso_size=1480 gso_type=6 ip_summed=3
<4>Jan 12 08:40:34 node-1 kernel: [ 3404.998745] Modules linked in: vhost_net vhost macvtap macvlan xt_mac xt_physdev

........

Revision history for this message
Peter Zhurba (pzhurba) wrote :

It happens during compute restart in case if libvirt perform autostart.
Looks like kernel or qemu or test-env related issue.

But if we are adding delay to vm start, it helps

Revision history for this message
Ksenia Svechnikova (kdemina) wrote :

There is an open bug for nova-compute in XXX state after node rebooting: https://bugs.launchpad.net/mos/+bug/1529810

Snapshot for this issue is needed

Revision history for this message
Peter Zhurba (pzhurba) wrote :

For reproducing bug was used script which is showed bellow .
#!/bin/bash

t-pass(){
# compute IP
cm=10.109.0.12
# show last kernel log
ssh $cm dmesg | tail -n 20
# kill mograted fuel master
ssh $cm virsh destroy fuel_master
# revert snapshot
ssh $cm qemu-img create -b /var/lib/nova/fuel_master.img -f qcow2 /var/lib/nova/fuel_master-b.img
# reboot compute
ssh $cm reboot
# wait for bug. According logs bug appearers after 2 – 3 minutes after boot.
sleep 360
}

while true ; do t-pass; done

For checking was used grep messages by “[ cut”

After upgrading lab-host (see aptitude log) frequency of reproducing was decreased. Bug appeared 2 time from 500 tries. (see messages2.gzip)

Using virtio instead of e1000 100% helps. Also In this case many other kernel's warnings dissapeared too.

There is no possibility reproduce bug on hardware because e1000 is too old.

Also there is no one confirmed case on other test environments.

All above let me consider that bug is not reproducible

Revision history for this message
Peter Zhurba (pzhurba) wrote :
Revision history for this message
Peter Zhurba (pzhurba) wrote :
Revision history for this message
Ksenia Svechnikova (kdemina) wrote :

The behavior doesn't reproduced on ISO#442.
OSTF tests and Network check is finished successfully. I will mark the issue as Invalid. Please, feel free to reopen it in case of reproduction

Changed in mos:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.