Network verification has failed after successful HA deployment. Second controller has changed network configuration after deploy (not idempotent)

Bug #1532823 reported by Anastasia Palkina on 2016-01-11
32
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Ihor Kalnytskyi
7.0.x
High
Sergii Rizvan
8.0.x
High
Ihor Kalnytskyi
Mitaka
High
Ihor Kalnytskyi

Bug Description

1. Download and install detach plugins

https://github.com/openstack/fuel-plugin-detach-keystone
https://github.com/openstack/fuel-plugin-detach-database
https://github.com/openstack/fuel-plugin-detach-rabbitmq

2. Create new environment by default
3. Add 3 controllers, 3 database+keystone+rabbitmq, 1 compute, 1 cinder
4. Start network verification. It was successful
5. Start deployment. It was successful
6. Start Network verification. It has failed with error:
Some untagged networks are assigned to the same physical interface. You should assign them to different physical interfaces. Affected: "admin (PXE)", "public" networks at node "Untitled (2d:35)"

I didn't change anything in network configuration before deployment.
But after deployment the second controller (node-7) has changed network configuration (see attached screen)

Logs are here: https://drive.google.com/a/mirantis.com/file/d/0B6SjzarTGFxaX2ZHY0FhaEhfSU0/view?usp=sharing

Controllers: node-6,7,10
Compute: node-8
Cinder: node-9
Database+keystone+rabbitmq: node-3,4,5

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "417"
  build_id: "417"
  fuel-nailgun_sha: "9ebbaa0473effafa5adee40270da96acf9c7d58a"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "df16d41cd7a9445cf82ad9fd8f0d53824711fcd8"
  fuel-nailgun-agent_sha: "92ebd5ade6fab60897761bfa084aefc320bff246"
  astute_sha: "c7ca63a49216744e0bfdfff5cb527556aad2e2a5"
  fuel-library_sha: "7ef751bdc0e4601310e85b8bf713a62ed4aee305"
  fuel-ostf_sha: "214e794835acc7aa0c1c5de936e93696a90bb57a"
  fuel-mirror_sha: "b62f3cce5321fd570c6589bc2684eab994c3f3f2"
  fuelmenu_sha: "2a0def56276f0fc30fd949605eeefc43e5d7cc6c"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "cfeadd34d8d048deeabf0884931708b1d040b8a6"

Anastasia Palkina (apalkina) wrote :
summary: - Network verification has failed after successful HA deployment
+ Network verification has failed after successful HA deployment. Second
+ controller has changed network configuration after deploy
Ilya Kutukov (ikutukov) on 2016-01-11
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
status: New → Confirmed
tags: added: area-library
tags: added: life-cycle-management
summary: Network verification has failed after successful HA deployment. Second
- controller has changed network configuration after deploy
+ controller has changed network configuration after deploy (not
+ idempotent)
Anastasia Palkina (apalkina) wrote :

Bogdan, network verification was successful before deployment.
I added this information to description of the bug.

description: updated
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Dmitry Bilunov (dbilunov)
tags: added: team-bugfix
Changed in fuel:
status: Confirmed → In Progress
Changed in fuel:
assignee: Dmitry Bilunov (dbilunov) → nobody
tags: added: team-network
removed: team-bugfix
tags: added: area-mos
removed: area-library
Dmitry Bilunov (dbilunov) wrote :

For some reason the network configuration has changed without creating any entries in "action_logs" table.
I verify that this bug can be reproduced.
The attached screenshot shows the UI, which fetches data using /api/nodes/:cluster_id/interfaces so it does not look like a UI bug.
For some unknown reason, the affected node has a different count of interfaces in "node_nic_interfaces" grouped by (pxe,state).
Also I don't see any HTTP requests that could possibly affect the network configuration for this node.

Dmitry Pyzhov (dpyzhov) on 2016-01-19
Changed in fuel:
status: In Progress → Confirmed
Maciej Relewicz (rlu) on 2016-01-19
Changed in fuel:
assignee: nobody → Fuel Python Team (fuel-python)
Dmitry Pyzhov (dpyzhov) on 2016-01-19
Changed in fuel:
milestone: 8.0 → 9.0
Dmitry Pyzhov (dpyzhov) on 2016-01-21
tags: added: area-python
removed: area-mos
Tatyanka (tatyana-leontovich) wrote :

+ occurrence https://172.18.160.103/job/8.0.system_test.ubuntu.ha_destructive_ceph_neutron/114/
iso 446
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "466"
  build_id: "466"
  fuel-nailgun_sha: "f81311bbd6fee2665e3f96dcac55f72889b2f38c"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "6823f1d4005a634b8436109ab741a2194e2d32e0"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "fe03d887361eb80232e9914eae5b8d54304df781"
  fuel-ostf_sha: "ab5fd151fc6c1aa0b35bc2023631b1f4836ecd61"
  fuel-mirror_sha: "b62f3cce5321fd570c6589bc2684eab994c3f3f2"
  fuelmenu_sha: "fac143f4dfa75785758e72afbdc029693e94ff2b"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "727f7076f04cb0caccc9f305b149a2b5b5c2af3a"

Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Aleksey Kasatkin (alekseyk-ru)
Aleksey Kasatkin (alekseyk-ru) wrote :

For some reason, at once, nailgun received from agent wrong IP and pxe interface:

2016-01-22 01:54:29.549 DEBUG [7f6c93c44880] (logger) Request PUT /api/nodes/agent/ from 10.109.12.9:42529 {"mac":"64:14:58:D0:F6:3E","ip":"10.109.13.5",

2016-01-22 01:54:29.718 WARNING [7f6c93c44880] (manager) PXE interface info is not consistent for node "slave-02_controller (id=1, mac=64:14:58:d0:f6:3e)"

After that, info from nailgun agent did not contain IP address for particular NIC (as before wrong IP was received), just IP for node (that had right value).

BTW, pxe flag is never calculated correctly by nailgun agent (always false for all interfaces)

So,
1. there is a problem in network info acquisition in nailgun agent or on earlier stage.
2. nailgun could have more reliable algorithm (but more sophisticated) for calculation of pxe interface.
3. wrong info started to come after deployment was started so it should have been skipped in normal situation. probably, node was in 'provisioned' status. we could prohibit network info changes for this status also.
4. log of nailgun receiverd was not found. probably it is not saved into snpshot. it must be added then.

Changed in fuel:
status: Confirmed → Triaged
Aleksey Kasatkin (alekseyk-ru) wrote :

Erro message on node:

Node 'node-1' has IP '10.109.13.5' that does not match its own Admin network '10.109.12.0/24'

Aleksey Kasatkin (alekseyk-ru) wrote :

Seems, node should have been in 'deploying' state:

2016-01-22 01:53:13 INFO [779] Casting message to Nailgun:
{"method"=>"deploy_resp",
 "args"=>
  {"task_uuid"=>"8979961f-37cc-480b-a4f1-8ca363d18719",
   "nodes"=>
    [{"uid"=>"1",
      "progress"=>0,
      "status"=>"deploying",
      "role"=>"primary-controller",
      "task"=>
       {"priority"=>700,
        "type"=>"puppet",
        "id"=>"netconfig",
        "parameters"=>
         {"puppet_modules"=>"/etc/puppet/modules",
          "puppet_manifest"=>
           "/etc/puppet/modules/osnailyfacter/modular/netconfig/netconfig.pp",
          "timeout"=>3600,
          "cwd"=>"/"},
        "uids"=>["1"]}}]}}

But no receiverd log is present. Let's add it into snapshot.

Fix proposed to branch: master
Review: https://review.openstack.org/273574

Changed in fuel:
status: Triaged → In Progress

Reviewed: https://review.openstack.org/273574
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=c467fedf9d6907493f60d82043116a5fa36a8796
Submitter: Jenkins
Branch: master

commit c467fedf9d6907493f60d82043116a5fa36a8796
Author: Aleksey Kasatkin <email address hidden>
Date: Thu Jan 28 16:28:35 2016 +0200

    Deny changing of interfaces when node status is 'provisioned'

    It can help in situation when the first deployment task is started
    (task may affect network interfaces) but the report from astute about
    starting of deployment for that node is not received by receiverd yet.

    Change-Id: I47c58ffa54856cb3dd969de08b703fb1bb73973b
    Partial-Bug: #1532823

Ihor Kalnytskyi (ikalnytskyi) wrote :

Taking into account that there's nothing we can do without receiver logs (and we don't have them in snapshot), move it to Fix Commited. It looks like it could be the only cause of such behaviour.

Reviewed: https://review.openstack.org/274677
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=4bf1d56c5f8a105ecd9bec046156328a4a2036f2
Submitter: Jenkins
Branch: stable/8.0

commit 4bf1d56c5f8a105ecd9bec046156328a4a2036f2
Author: Aleksey Kasatkin <email address hidden>
Date: Thu Jan 28 16:28:35 2016 +0200

    Deny changing of interfaces when node status is 'provisioned'

    It can help in situation when the first deployment task is started
    (task may affect network interfaces) but the report from astute about
    starting of deployment for that node is not received by receiverd yet.

    Change-Id: I47c58ffa54856cb3dd969de08b703fb1bb73973b
    Partial-Bug: #1532823
    (cherry picked from commit c467fedf9d6907493f60d82043116a5fa36a8796)

Reviewed: https://review.openstack.org/274681
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=ea3431c90c5ac2e354bf2f383a8a5f17b86da6f0
Submitter: Jenkins
Branch: master

commit ea3431c90c5ac2e354bf2f383a8a5f17b86da6f0
Author: Julia Aranovich <email address hidden>
Date: Mon Feb 1 17:22:02 2016 +0300

    Forbid interfaces configuration for provisioned nodes

    Related-Bug: #1532823

    Change-Id: I6776ec130a93774bc4dc4abd2c2cd9cc98e46d97

Reviewed: https://review.openstack.org/274685
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=4cbc83c7e86bc3636e386b7974f0b48af7d5bcba
Submitter: Jenkins
Branch: stable/8.0

commit 4cbc83c7e86bc3636e386b7974f0b48af7d5bcba
Author: Julia Aranovich <email address hidden>
Date: Mon Feb 1 17:25:13 2016 +0300

    Forbid interfaces configuration for provisioned nodes

    Related-Bug: #1532823

    Change-Id: I6776ec130a93774bc4dc4abd2c2cd9cc98e46d97

Artem Panchenko (apanchenko-8) wrote :

verified

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "506"
  build_id: "506"
  fuel-nailgun_sha: "8e954abd70ef0083109f34289de2553dcda544d4"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "658be72c4b42d3e1436b86ac4567ab914bfb451b"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "ec7e212972ead554f21b52b9e165156665f659df"
  fuel-ostf_sha: "ab5fd151fc6c1aa0b35bc2023631b1f4836ecd61"
  fuel-mirror_sha: "351d568fa3b3e4dd062054b91d766aa54d379867"
  fuelmenu_sha: "234cb4cbb30fbd2df00f388c28f31606d9cae15f"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "a43cf96cd9532f10794dce736350bf5bed350e9d"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "94507c5e4dad6d8cfbd8f5d41aa8389d5335990a"

Tatyanka (tatyana-leontovich) wrote :

Looks like issue reproduced by MOS-QA on 541 iso
https://bugs.launchpad.net/fuel/+bug/1543535,

But it do not looks like high, according it reproduce only 1 time from all our tests (acceptnace + swarm) and only on net checker that run after deployment.

Tatyana, if network interfaces configured incorrectly it means that whole deployment is incorrect.
The root of the issue in Nailgun / Astute.

It can fail customer deployments.

Please see my detailed comments in https://bugs.launchpad.net/fuel/+bug/1543535,
it looks like this issue was fixed only if Fuel UI, but the root of the issue in Fuel Astute / Nailgun.

Status changed to Confirmed for 8.0 and 9.0 releases because it was reproduced on the latest build #541:
https://bugs.launchpad.net/fuel/+bug/1543535

tags: added: release-notes
Dmitry Pyzhov (dpyzhov) on 2016-02-09
no longer affects: fuel/mitaka
Dmitry Pyzhov (dpyzhov) on 2016-02-11
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Fuel Python Team (fuel-python)
Aleksandr Didenko (adidenko) wrote :

> Tatyana, if network interfaces configured incorrectly it means that whole deployment is incorrect.

I have not seen this in diagnostic snapshots. Could you please provide exact snapshot where it has failed deployment? I can see correct info in serialized data (astute.yaml) and then broken info in Nailgun DB after deployment. So it did not break the deployment itself, only network verification.

Aleksandr, could you please check snapshots and logs which were attached to https://bugs.launchpad.net/fuel/+bug/1543535 ?

Please let me know if it is required to reproduce the issue and provide the access to the environment. It is not easy to reproduce.

Thank you!

Aleksandr Didenko (adidenko) wrote :

Timur, yes, I did. Here's what is stored in network transformation in astute.yaml for the problem node-4:
http://paste.openstack.org/show/486682/
As you can networks are assigned correctly so deployment is correct.

From the other hand, if we hit such bug on environment and then try to add/remove nodes, this could possible break the deployment because in DB we have wrong network assignments. So I'll rise priority of this bug to critical.

Timur, access to the environment could be really helpful, I did not manage to reproduce this bug to troubleshoot it on a live env.

Changed in fuel:
importance: High → Critical
Eugene Bogdanov (ebogdanov) wrote :

Critical bugs must be addressed with initial release. So, changing milestone to 8.0.

Aleksey Kasatkin (alekseyk-ru) wrote :

After looking into docker logs: it is same scenario as for 2016.01.27. We have receiverd.log now and it says that node status should have been set to 'deploying' so Nailgun should nave not been accepted that erroneous info. But another problem is that log level is set to INFO in our tests so we cannot see all messages. Picture is not clear still. Seems, I can just add more output to logs and wait for the next occurrence of this issue.
I.e. it is not clear was node status set to 'deploying' and why right PXE interface was not restored after that wrong message.

Mike Scherbakov (mihgen) wrote :

I assume that we can not manually reproduce this issue, and it is not reproduced on CI in 100% of runs. So I'd suggest to downgrade this issue to High priority, get better analysis/troubleshooting with all logs necessary, and fit it in 9.0 (with possible fix in 8.0 MU1).

Aleksey Kasatkin (alekseyk-ru) wrote :

2016-02-08 23:01:07.722 INFO [7f3969b2e880] (notification) Notification: topic: error message: Node 'node-4' has IP '10.109.9.5' that does not match its own Admin network '10.109.8.0/24'
2016-02-08 23:01:07.959 WARNING [7f3969b2e880] (manager) PXE interface info is not consistent for node "slave-03_controller_cinder (id=4, mac=64:e7:e3:33:43:c3)"

Aleksandr Didenko (adidenko) wrote :

We can reproduce this issue with 100% chance. The problem is in this method https://github.com/openstack/fuel-nailgun-agent/blob/18289c69ffcb9a34208e51dd6a30de5178928e05/agent#L726-L743

How to reproduce:

1. Go to any operational environment

2. Pick a node, download network info:
fuel node --network --node 1 --download
cp /root/node_1/interfaces.yaml /root/node_1/interfaces.yaml.orig

3. Login to that node, edit /usr/bin/nailgun-agent and replace this line https://github.com/openstack/fuel-nailgun-agent/blob/18289c69ffcb9a34208e51dd6a30de5178928e05/agent#L737 with some hardcoded value, like:

            return {:ip => '10.110.2.4', :mac => '64:62:6c:0f:31:41'}

4. Download network info again and compare:
fuel node --network --node 1 --download
diff -U3 /root/node_1/interfaces.yaml.orig /root/node_1/interfaces.yaml

You can also find this log pattern in docker-logs/nailgun/app.log in such case:
2016-02-08 23:01:07.722 INFO [7f3969b2e880] (notification) Notification: topic: error message: Node 'node-4' has IP '10.109.9.5' that does not match its own Admin network '10.109.8.0/24'

Now we need to understand why nailgun-agent reports wrong info and fix it.

Aleksandr Didenko (adidenko) wrote :

OK, so here's what's going on:

1) During netconfig.pp task, nailgun-agent starts to collect info about the system. It may take some time (like 10-40 seconds)

2) Netconfig reconfigures NICs/IPs, in particular it moves admin IP from enp0s3 to br-fw-admin bridge

3) So if nailgun-agent calls _master_ip_and_mac() function in exactly the same moment when admin IP is already down on enp0s3 and not yet up on br-fw-admin, then it won't be able to find admin interface and will default to ohai_system_info() defaults:
https://github.com/openstack/fuel-nailgun-agent/blob/76f48ff6c6a3996a7800a34cd97c5bfd4539107f/agent#L775-L778

4) Then nailgun-agent sends wrong MAC and IP to nailgun, because there's a delay in nailgun-agent work, br-fw-admin is already configured so agent is able to connect to master node and send wrong info.

I suggest to fail nailgun-agent run if it can't find amdin MAC and IP - it will be much safer then sending random MAC/IP to nailgun as nodes new main MAC/IP. If we simply fail, then nailgun-agent will be able to collect correct info during the very next run.

Fix proposed to branch: master
Review: https://review.openstack.org/279620

Changed in fuel:
status: Confirmed → In Progress
Dmitry Pyzhov (dpyzhov) wrote :

I think fix can produce regression and it is risky to merge it in 8.0. Also we agreed that this bug is rare case and can be reduced to High. Let's merge it in 9.0 and backport in MU later.

Aleksandr Didenko (adidenko) wrote :

One more thing to take into account: if nailgun-agent reports random master MAC/IP after deployment (during network restart, for example), then this may change node status from "ready" to "error" and then to "discover".

Change abandoned by Aleksandr Didenko (<email address hidden>) on branch: master
Review: https://review.openstack.org/279620
Reason: Actually, this fix will break multi-rack nodes bootstrap. We need to sort this out on nailgun side.

Reviewed: https://review.openstack.org/280300
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=5925c52524cf4b1a540c021d96ef3e521bc71228
Submitter: Jenkins
Branch: master

commit 5925c52524cf4b1a540c021d96ef3e521bc71228
Author: Igor Kalnitsky <email address hidden>
Date: Mon Feb 15 16:54:16 2016 +0200

    Accept interfaces changes only in discovery/error

    By our design we accept changes to node interfaces only if node in
    DISCOVERY and ERROR states. There are few reasons why we're doing so,
    including one that IP addresses are assigned to bridges, not to NICs
    and we ignore it in order to be able to redeploy nodes.

    However, there's one case when node status is *artificially* changing
    to ERROR state, implicitly allowing to accept interfaces changes. The
    case is:

    * Admin (PXE) interface is down on a node.
    * nailgun-agent sends wrong node's IP address that will be used to
      check whether it belongs to Admin (PXE) network or not.
    * If it doesn't belong, change node status to ERROR.
    * Since the status is ERROR, changes to interfaces are allowed.
    * Move admin network to another interface, which is wrong since
      physical connection wasn't changed.

    The commit introduces additional condition when the check for belonging
    to Admin (PXE) network is allowed.

    Change-Id: I17e87e27d846921d6f0da535b9446e716449db95
    Closes-Bug: #1532823

Changed in fuel:
status: In Progress → Fix Committed
tags: added: 8.0 release-notes-done
removed: release-notes
Aleksandr Didenko (adidenko) wrote :

Nominated this bug on 7.0 since we have a duplicate bug for 7.0:
https://bugs.launchpad.net/fuel/7.0.x/+bug/1513472

tags: added: on-verification
tags: removed: on-verification
tags: added: on-verification

Verified on fuel-9.0-mos-487:
Connectivity check was successful, after deployment of cluster with configuration like in description is done.

[root@nailgun ~]# shotgun2 short-report
cat /etc/fuel_build_id:
 487
cat /etc/fuel_build_number:
 487
cat /etc/fuel_release:
 9.0
cat /etc/fuel_openstack_version:
 mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
 fuel-release-9.0.0-1.mos6349.noarch
 fuel-misc-9.0.0-1.mos8459.noarch
 python-packetary-9.0.0-1.mos140.noarch
 fuel-bootstrap-cli-9.0.0-1.mos285.noarch
 fuel-migrate-9.0.0-1.mos8459.noarch
 shotgun-9.0.0-1.mos90.noarch
 fuel-notify-9.0.0-1.mos8459.noarch
 nailgun-mcagents-9.0.0-1.mos750.noarch
 python-fuelclient-9.0.0-1.mos325.noarch
 fuel-9.0.0-1.mos6349.noarch
 fuel-utils-9.0.0-1.mos8459.noarch
 fuel-setup-9.0.0-1.mos6349.noarch
 fuel-provisioning-scripts-9.0.0-1.mos8742.noarch
 fuel-library9.0-9.0.0-1.mos8459.noarch
 network-checker-9.0.0-1.mos74.x86_64
 fuel-agent-9.0.0-1.mos285.noarch
 fuel-ui-9.0.0-1.mos2717.noarch
 fuel-ostf-9.0.0-1.mos936.noarch
 fuelmenu-9.0.0-1.mos274.noarch
 fuel-nailgun-9.0.0-1.mos8742.noarch
 rubygem-astute-9.0.0-1.mos750.noarch
 fuel-mirror-9.0.0-1.mos140.noarch
 fuel-openstack-metadata-9.0.0-1.mos8742.noarch

tags: removed: on-verification
Sergii Rizvan (srizvan) wrote :

Tried to reproduce the bug on 7.0 - no luck. That's why I'm about to set the status for 7.0 as Invalid.

Sergii Rizvan (srizvan) wrote :

For 8.0 the same situation as for 7.0. Seems bug was fixed by adding respective changes in Nailgun. That's why closing bug for 8.0 as Invalid.

Hi Sergii, this issue is hard to reproduce, but ok, let's leave it in invalid status while someone reproduce the issue on the previous releases as well.

Change abandoned by Igor Kalnitsky (<email address hidden>) on branch: stable/8.0
Review: https://review.openstack.org/313694
Reason: no one is interest

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers