Network configuration changed during the deployment

Bug #1543535 reported by Timur Nurlygayanov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
Undecided
Timur Nurlygayanov
8.0.x
Confirmed
Medium
Fuel Python (Deprecated)

Bug Description

This issue reproduced not is 100% of cases, and it is probably regression issue.
It is Critical because sometimes fuel can randomly change the mapping of ETH interfaces with logical networks and this will lead to fail of deployments / networks verification after the deployment.

This issue was found by MOS QA automated test suite (Neutron destructive suite)
Template which was used for deployment: https://github.com/Mirantis/mos-ci-deployment-scripts/blob/master/3_controllers_2compute_neutron_env_template.yaml
and https://github.com/Mirantis/mos-ci-deployment-scripts/blob/master/mos_tests.yaml

We used fuel-qa code from commit 87df82b0a7d204d2b43025f50ad232997bf0a35b (master branch)

Steps To Reproduce:
1. Run fuel master node and 3 slave nodes
2. Configure environment with 3 controllers, 3 computes, Neutron VxVLAN, L2pop and L3 HA features enabled
3. Check networks configuration and make sure that everything works fine
4. Deploy the environment
5. Start networks check one more time after the deployment

Expected Result:
Deployment will pass, network connectivity checks will pass as well

Observed Result:
Deployment passed (in my case, it depends on the interfaces which will be affected be the issue during the deployment, in fact, it can fail the deployment)
Network verification fails with the error:
AssertionError: Task 'check_networks' has incorrect status. error != ready, 'Some untagged networks are assigned to the same physical interface. You should assign them to different physical interfaces. Affected:
"admin (PXE)", "storage" networks at node "slave-03_controller_cinder"'

Diagnostic snapshot and screenshots are attached.

Tags: area-python
description: updated
Changed in fuel:
milestone: none → 9.0
importance: Undecided → Critical
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :
Download full text (17.9 KiB)

Logs from jenkins:

<<< --------------------------------------[ FINISH Step 004. Create Fuel Environment STEP TOOK 0 min 7 sec ]-------------------------------------- >>>

ok
Add nodes to environment ... 2016-02-09 01:06:29,482 - INFO decorators.py:53 --
<<< -------------------------------------------------[ START Step 005. Add nodes to environment ]------------------------------------------------- >>>

2016-02-09 01:06:29,483 - INFO actions_base.py:186 -- Add nodes to env 1
2016-02-09 01:06:29,483 - INFO actions_base.py:200 -- Set roles ['controller', 'cinder'] to node slave-01
2016-02-09 01:06:29,483 - INFO actions_base.py:200 -- Set roles ['controller', 'cinder'] to node slave-02
2016-02-09 01:06:29,483 - INFO actions_base.py:200 -- Set roles ['controller', 'cinder'] to node slave-03
2016-02-09 01:06:29,483 - INFO actions_base.py:200 -- Set roles ['compute'] to node slave-04
2016-02-09 01:06:29,483 - INFO actions_base.py:200 -- Set roles ['compute'] to node slave-05
2016-02-09 01:06:31,023 - INFO fuel_web_client.py:1391 -- Assigned networks are: {'enp0s6': ['private'], 'enp0s7': ['storage'], 'enp0s4': ['public'], 'enp0s5': ['management'], 'enp0s3': ['fuelweb_admin']}
2016-02-09 01:06:31,600 - INFO decorators.py:61 --
<<< -------------------------------------[ FINISH Step 005. Add nodes to environment STEP TOOK 0 min 2 sec ]-------------------------------------- >>>

ok
Run network checker ... 2016-02-09 01:06:31,672 - INFO decorators.py:53 --
<<< ---------------------------------------------------[ START Step 006. Run network checker ]---------------------------------------------------- >>>

2016-02-09 01:06:31,673 - INFO fuel_web_client.py:1044 -- Run network verification on the cluster 1
2016-02-09 01:07:32,483 - INFO fuel_web_client.py:1358 -- Network verification of cluster 1 finished
2016-02-09 01:07:32,483 - INFO decorators.py:61 --
<<< ----------------------------------------[ FINISH Step 006. Run network checker STEP TOOK 1 min 1 sec ]---------------------------------------- >>>

ok
Deploy environment ... 2016-02-09 01:07:32,495 - INFO decorators.py:53 --
<<< ----------------------------------------------------[ START Step 007. Deploy environment ]---------------------------------------------------- >>>

2016-02-09 01:07:32,495 - INFO fuel_web_client.py:741 -- Deploy cluster 1
2016-02-09 01:07:32,495 - INFO fuel_web_client.py:793 -- Launch deployment of a cluster #1
2016-02-09 01:07:36,204 - INFO fuel_web_client.py:315 -- Assert task {u'status': u'pending', u'name': u'deploy', u'cluster': 1, u'result': {}, u'progress': 0, u'message': None, u'id': 5, u'uuid': u'cd2c26f2-dd11-43db-b4b2-2ff569ea8d42'} is success
2016-02-09 01:07:36,204 - INFO fuel_web_client.py:1104 -- Wait for task 10000 seconds:
 status pending
 name deploy
 cluster 1
 result {}
 progress 0
 message None
 id 5
 uuid cd2c26f2-dd11-43db-b4b2-2ff569ea8d42
2016-02-09 02:46:44,996 - INFO fuel_web_client.py:1121 -- Task finished. Took 5948.77952504 seconds.
 status ready
 name deploy
 cluster 1
 result {}
 progress 100
 message Provision of environment 'mos-tests' is done.
Deplo...

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Diagnostic snapshot:

Changed in fuel:
assignee: nobody → Dennis Dmitriev (ddmitriev)
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Screenshot of network interfaces for controller-3

Other controllers configured correctly.

description: updated
tags: added: area-library area-qa system-tests
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

QA framework team is investigating the issue, it can be the issue in fuel-qa code but it is also can be the regression issue in fuel library / nailgun / astute.

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Timur your snapshot is missing docker-logs, so could you please attach it

Changed in fuel:
status: New → Incomplete
importance: Critical → Undecided
assignee: Dennis Dmitriev (ddmitriev) → Timur Nurlygayanov (tnurlygayanov)
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Looks like it is the product bug.

There is the data that was put to the Nailgun for node slave-03 configuration (node id=4): http://paste.openstack.org/show/486390/

There is the result after network verification: http://paste.openstack.org/show/486397/

Somehow, the network 'fuelweb_admin' was moved from the interface 'enp0s3' to the interface 'enp0s7'.

Thanks to @Tatyanka for the bug that was already filed for this issue: https://bugs.launchpad.net/fuel/+bug/1532823 and was fixed for ISO #506.

But here we have a reproduce on the ISO #541 :
[root@nailgun ~]# cat /etc/fuel/version.yaml
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "541"
  build_id: "541"
  fuel-nailgun_sha: "baec8643ca624e52b37873f2dbd511c135d236d9"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "658be72c4b42d3e1436b86ac4567ab914bfb451b"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "e2d79330d5d708796330fac67722c21f85569b87"
  fuel-ostf_sha: "3bc76a63a9e7d195ff34eadc29552f4235fa6c52"
  fuel-mirror_sha: "fb45b80d7bee5899d931f926e5c9512e2b442749"
  fuelmenu_sha: "78ffc73065a9674b707c081d128cb7eea611474f"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "a43cf96cd9532f10794dce736350bf5bed350e9d"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "87dfb6bc25d4650264f09c338ed77c21a3d6fe87"

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Here are logs from docker-logs folder ^^

Denis checked the logs and found that fuel-qa code sent correct information about the configuration of networks interfaces.
This is an issue with Nailgun / Astute / fuel library.

Status changed to Confirmed, priority changed to Critical because if we will randomly change the network interfaces mapping we will fail deployments. This issue reproduced on virtual lab with 6 VMs, it can be easily reproduced on large deployments (many nodes - many chances to randomly change an interface on one of them).

The possible bad workaround is to redeploy all nodes which will have issues, but we can guarantee that users will not spend several days trying to deploy environments then.

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Please do not mark this bug as a duplicate of #1532823.

#1532823 is already fixed, and here we have a regression that should be fixed separatelly.

tags: added: area-python
removed: area-library area-qa system-tests
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Actually you are wrong, it is not regression, it is mean that previous issue is partly fixed - so move to duplicate

Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.