Network templates with bond can brake OSTF

Bug #1513472 reported by Dmitry Ukov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Aleksandr Didenko
7.0.x
Confirmed
High
Aleksandr Didenko
8.0.x
Invalid
High
Aleksandr Didenko

Bug Description

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "301"
  build_id: "301"
  nailgun_sha: "4162b0c15adb425b37608c787944d1983f543aa8"
  python-fuelclient_sha: "486bde57cda1badb68f915f66c61b544108606f3"
  fuel-agent_sha: "50e90af6e3d560e9085ff71d2950cfbcca91af67"
  fuel-nailgun-agent_sha: "d7027952870a35db8dc52f185bb1158cdd3d1ebd"
  astute_sha: "6c5b73f93e24cc781c809db9159927655ced5012"
  fuel-library_sha: "5d50055aeca1dd0dc53b43825dc4c8f7780be9dd"
  fuel-ostf_sha: "2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c"
  fuelmain_sha: "a65d453215edb0284a2e4761be7a156bb5627677"

Steps to reproduce
1. Create new environment (Neutron with VxLAN, use Ceph as backend)
2. Add 1 controller, 2 ceph-osd nodes
3. Create additional network group
    # fuel --env=32 network-group --create --node-group 32 --name "cluster" --release 2 --cidr "10.50.107.0/24"
4. Create and upload network template (see attachment)
    # fuel --env=32 network-template --upload --dir ./
5. Hit 'Deploy Changes' button
6. Wait till puppet configured network on Controller node (Task called 'netconfig'). All bonds should be configured once netconfig.pp has finished
7. Artificially put environment into Error state (e.g. kill puppet process multiple times). In real world this can happen due to various reasons: network issues, deployment misconfiguration, HW issues
8. Wait a minute or two (nailgun agent should report node informations to nailgun API)
9. Hit 'Deploy Changes' button one more time.
10. Wait till environment will become ready.
11. Execute 'Request flavor list' OSTF test
     Expected result:
         - Test pass
     Actual result:
        - Test failed

Diagnostic Snapshot: https://drive.google.com/file/d/0B0kV2KAlVj3Na3dhV2VYVm1Xb0k/view?usp=sharing

Problem appears to be related to node information update for nodes in Error state. Once bond created nailgun agent will report 3 interfaces with the same MAC address. Nailgun treats this information like node has lost one interface. Network template does not update node interfaces info and public network is assigned to 2nd interface by default. As a result nailgun will not return information regarding public network for node. This informations seems essential for OSTF.

[root@fuel ~]# cat 1.py
from nailgun import objects
import sys
n=objects.Node.get_by_uid(int(sys.argv[1]))
for i in n.interfaces:
  print i.name, i.assigned_networks

Before Deployment
[root@fuel ~]# python 1.py 94
2015-11-05 12:31:44.727 DEBUG [7f9d7089a700] (settings) Looking for settings.yaml package config using old style __file__
2015-11-05 12:31:44.727 DEBUG [7f9d7089a700] (settings) Trying to read config file /usr/lib/python2.6/site-packages/nailgun/settings.yaml
2015-11-05 12:31:45.031 DEBUG [7f9d7089a700] (settings) Trying to read config file /etc/nailgun/settings.yaml
2015-11-05 12:31:45.050 DEBUG [7f9d7089a700] (settings) Trying to read config file /etc/fuel/version.yaml
eth0 [{'id': 1, 'name': u'fuelweb_admin'}, {'id': 135, 'name': u'management'}, {'id': 136, 'name': u'storage'}, {'id': 137, 'name': u'private'}]
eth1 [{'id': 134, 'name': u'public'}]
eth2 []
eth3 []
eth4 []

After Node put into error state
[root@fuel ~]# python 1.py 94
2015-11-05 12:32:47.772 DEBUG [7f557a065700] (settings) Looking for settings.yaml package config using old style __file__
2015-11-05 12:32:47.772 DEBUG [7f557a065700] (settings) Trying to read config file /usr/lib/python2.6/site-packages/nailgun/settings.yaml
2015-11-05 12:32:48.083 DEBUG [7f557a065700] (settings) Trying to read config file /etc/nailgun/settings.yaml
2015-11-05 12:32:48.107 DEBUG [7f557a065700] (settings) Trying to read config file /etc/fuel/version.yaml
eth0 [{'id': 1, 'name': u'fuelweb_admin'}, {'id': 135, 'name': u'management'}, {'id': 136, 'name': u'storage'}, {'id': 137, 'name': u'private'}]
eth2 []
eth3 []
eth4 []

Revision history for this message
Dmitry Ukov (dukov) wrote :
Dmitry Klenov (dklenov)
Changed in fuel:
milestone: none → 8.0
assignee: nobody → Fuel Library Team (fuel-library)
tags: added: area-library
Changed in fuel:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Aleksey Kasatkin (alekseyk-ru) wrote :

Now NICs info is locked when node is deployed or under deployment. AFAIC, we need to lock NICs info when node is in error state as well.

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Fuel Python Team (fuel-python)
tags: added: area-python
removed: area-library
Dmitry Pyzhov (dpyzhov)
tags: added: feature
tags: added: feature-network-template
removed: feature
Dmitry Pyzhov (dpyzhov)
tags: added: team-network
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Aleksandr Didenko (adidenko)
tags: added: tricky
Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Tried to reproduce on 8.0 - no luck.

First, I've tried to simulate this without templates using bond, but did not manage to reproduce the bug - list of interfaces and networks did not change on the Node, everything worked fine.

Then I tried to use template (had to update it a bit to make 8.0 and virtual lab compatible, see attachment):

List of interfaces before error:
enp0s3 [{'id': 1, 'name': u'fuelweb_admin'}, {'id': 115, 'name': u'management'}, {'id': 116, 'name': u'storage'}, {'id': 117, 'name': u'private'}]
enp0s4 [{'id': 114, 'name': u'public'}]
enp0s5 [{'id': 118, 'name': u'cluster'}]
enp0s6 []
enp0s7 []

List of interfaces after error:
enp0s3 [{'id': 1, 'name': u'fuelweb_admin'}, {'id': 115, 'name': u'management'}, {'id': 116, 'name': u'storage'}, {'id': 117, 'name': u'private'}]
enp0s4 [{'id': 114, 'name': u'public'}]
enp0s5 [{'id': 118, 'name': u'cluster'}]
enp0s6 []
enp0s7 []

List of interfaces after second deploy:
enp0s3 [{'id': 1, 'name': u'fuelweb_admin'}, {'id': 115, 'name': u'management'}, {'id': 116, 'name': u'storage'}, {'id': 117, 'name': u'private'}]
enp0s4 [{'id': 114, 'name': u'public'}]
enp0s5 [{'id': 118, 'name': u'cluster'}]
enp0s6 []
enp0s7 []

I've also compared configuration data (yamls uploaded by nailgin) on the node between those two deployment attempts, if the problem exists then those configuration should differ (different network scheme or metadata, something). But they are the same:
http://paste.openstack.org/raw/482332/

So I'm marking this as incomplete for 8.0.

Changed in fuel:
status: Confirmed → Incomplete
Changed in fuel:
milestone: 8.0 → 9.0
status: Incomplete → New
Changed in fuel:
status: New → Confirmed
status: Confirmed → Incomplete
Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Marked it as invalid for 8.0. If someone is able to reproduce it on 8.0, please provide detailed instructions here and feel free to set the status to confirmed.

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

This bug was marked as Incomplete on 18th of December and no new information since then. Marking as Invalid

Changed in fuel:
status: Incomplete → Invalid
no longer affects: fuel/future
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.