Deploy with bonded admin interfaces failed: network is unreachable for nodes that are routed through the master node

Bug #1492147 reported by Dennis Dmitriev
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Critical
Vladimir Kuklin
7.0.x
Fix Released
Critical
Fuel Library (Deprecated)
8.0.x
Fix Released
Critical
Vladimir Kuklin

Bug Description

Issue was started at least from ISO#246 (https://product-ci.infra.mirantis.net/view/7.0_swarm/job/7.0.system_test.ubuntu.bonding_ha/73/)

Reproduced on CI: https://product-ci.infra.mirantis.net/view/7.0_swarm/job/7.0.system_test.ubuntu.bonding_ha/81/

Scenario:
            1. Create cluster with active-backup bonding and Neutron VXLAN
            2. Add 3 nodes with controller role
            3. Add 1 node with compute role
            4. Add 1 node with cinder role
            5. Setup bonding for all interfaces (including admin interface
               bonding)
            6. Run network verification
            7. Deploy the cluster

Expected result: Deploy passed.

Actual result: Network verification from step 6 passed, but deploy failed with the following resutls:

==============================
node-5 2015-09-04T01:12:34.107224 notice: (Scope(Class[main])) MODULAR: connectivity_tests.pp
node-5 2015-09-04T01:12:34.239737 err: ERROR: Unable to fetch url 'http://mirror.seed-cz1.fuel-infra.org/pkgs/ubuntu-2015-09-02-170157/', error 'Network is unreachable - connect(2)'. Please verify node connectivity to this URL, or remove it from the settings page if it is invalid. on node node-5.test.domain.local
...
node-5 2015-09-04T01:13:00.260174 notice: (Scope(Class[main])) MODULAR: configure_default_route.pp
...
node-5 2015-09-04 01:14:52 +0000 Service[cinder-volume](provider=upstart) (debug): Could not find cinder-volume.conf in /etc/init
node-5 2015-09-04 01:14:52 +0000 Service[cinder-volume](provider=upstart) (debug): Could not find cinder-volume.conf in /etc/init.d
node-5 2015-09-04 01:14:52 +0000 Service[cinder-volume](provider=upstart) (debug): Could not find cinder-volume in /etc/init
node-5 2015-09-04 01:14:52 +0000 Service[cinder-volume](provider=upstart) (debug): Could not find cinder-volume in /etc/init.d
node-5 2015-09-04 01:14:52 +0000 Service[cinder-volume](provider=upstart) (debug): Could not find cinder-volume.sh in /etc/init
node-5 2015-09-04 01:14:52 +0000 Service[cinder-volume](provider=upstart) (debug): Could not find cinder-volume.sh in /etc/init.d
node-5 2015-09-04 01:14:52 +0000 /Stage[main]/Main/Service[cinder-volume] (err): Could not evaluate: Could not find init script or upstart conf file for 'cinder-volume'
===============================

There is no package cinder-volume on node-5:

===============================
root@node-5:~# apt-cache policy cinder-volume
cinder-volume:
  Installed: (none)
  Candidate: 1:2015.1.1-1~u14.04+mos3052
  Version table:
     1:2015.1.1-1~u14.04+mos3052 0
       1050 http://10.109.45.2:8080/2015.1.0-7.0/ubuntu/x86_64/ mos7.0/main amd64 Packages
     1:2014.1.5-0ubuntu1 0
       1001 http://mirror.seed-cz1.fuel-infra.org/pkgs/ubuntu-2015-09-02-170157/ trusty-updates/main amd64 Packages
     1:2014.1.3-0ubuntu1.1 0
       1001 http://mirror.seed-cz1.fuel-infra.org/pkgs/ubuntu-2015-09-02-170157/ trusty-security/main amd64 Packages
     1:2014.1-0ubuntu1 0
       1001 http://mirror.seed-cz1.fuel-infra.org/pkgs/ubuntu-2015-09-02-170157/ trusty/main amd64 Packages
root@node-5:~#
===============================

Repositories are availabe on the environment (after configure_default_route.pp was executed during deployment):
http://paste.openstack.org/show/445134/

[root@nailgun ~]# fuel --fuel-version
DEPRECATION WARNING: /etc/fuel/client/config.yaml exists and will be used as the source for settings. This behavior is deprecated. Please specify the path to your custom settings file in the FUELCLIENT_CUSTOM_SETTINGS environment variable.
api: '1.0'
astute_sha: ad6d59812b775bc12e7bd7aec8f81374595ffa63
auth_required: true
build_id: '268'
build_number: '268'
feature_groups:
- mirantis
fuel-agent_sha: 082a47bf014002e515001be05f99040437281a2d
fuel-library_sha: f3780484874f5f4a1831714710ff552f33522915
fuel-nailgun-agent_sha: d7027952870a35db8dc52f185bb1158cdd3d1ebd
fuel-ostf_sha: 582a81ccaa1e439a3aec4b8b8f6994735de840f4
fuelmain_sha: 9ab01caf960013dc882825dc9b0e11ccf0b81cb0
nailgun_sha: f882c428db97ee1eb93a4871f9d5857c5a7771b2
openstack_version: 2015.1.0-7.0
production: docker
python-fuelclient_sha: 9643fa07f1290071511066804f962f62fe27b512
release: '7.0'
release_versions:
  2015.1.0-7.0:
    VERSION:
      api: '1.0'
      astute_sha: ad6d59812b775bc12e7bd7aec8f81374595ffa63
      build_id: '268'
      build_number: '268'
      feature_groups:
      - mirantis
      fuel-agent_sha: 082a47bf014002e515001be05f99040437281a2d
      fuel-library_sha: f3780484874f5f4a1831714710ff552f33522915
      fuel-nailgun-agent_sha: d7027952870a35db8dc52f185bb1158cdd3d1ebd
      fuel-ostf_sha: 582a81ccaa1e439a3aec4b8b8f6994735de840f4
      fuelmain_sha: 9ab01caf960013dc882825dc9b0e11ccf0b81cb0
      nailgun_sha: f882c428db97ee1eb93a4871f9d5857c5a7771b2
      openstack_version: 2015.1.0-7.0
      production: docker
      python-fuelclient_sha: 9643fa07f1290071511066804f962f62fe27b512
      release: '7.0'

Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :
description: updated
description: updated
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Similar issue (post-deployment hooks are executed even if deployment fail) was fixed in https://bugs.launchpad.net/fuel/+bug/1422834 , but only for 'critical' nodes.

Here was failed a node with 'cinder' role.

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Fuel Python Team (fuel-python)
description: updated
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Vladimir Kuklin (vkuklin)
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

According to the logs we have this issue because the nodes that do not have public address configured get br-fw-admin bridge entering forwarding state too late

2015-09-04T02:41:20.232613+00:00 err: ERROR: Unable to fetch url 'http://mirror.seed-cz1.fuel-infra.org/pkgs/ubuntu-2015-09-02-170157/', error 'Network is unreachable - co
nnect(2)'. Please verify node connectivity to this URL, or remove it from the settings page if it is invalid. on node node-4.test.domain.local

2015-09-04T02:41:24.137105+00:00 err: ERROR: Unable to fetch url 'http://mirror.seed-cz1.fuel-infra.org/pkgs/ubuntu-2015-09-02-170157/', error 'Network is unreachable - co
nnect(2)'. Please verify node connectivity to this URL, or remove it from the settings page if it is invalid. on node node-4.test.domain.local

2015-09-04T02:41:26.553633+00:00 info: [ 3863.584058] br-fw-admin: port 1(lnx-bond1) entered forwarding state

It seems that for each bridge we need to wait for it to get into forwarding state before proceeding.

Changed in fuel:
status: New → Triaged
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

The user impact is that if you configure bond for admin interface on the slave nodes. Currently, we do not wait for admin bridge to get into forwarding state before we try to send anything to it. This breaks the deployment completely as connectivity checks fail.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

So far it seems that the bridge did not enter forwarding state as its port active slave was not in the active state on the host node bridge.

summary: - Deploy with bonded interfaces failed: Could not find init script or
- upstart conf file for 'cinder-volume'
+ Deploy with bonded admin interfaces failed: network is unreachable for
+ nodes that are routed through the master node
Revision history for this message
Dmitry Kalashnik (dkalashnik) wrote :

That bug is blocker for telco-team.

Changed in fuel:
importance: High → Critical
Stanislav Makar (smakar)
Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → Stanislav Makar (smakar)
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

The initial analysis went wrong - We could not find why the network is not reachable. The first assumption was that it was because primary bond interface had been switched to a different one. The idea was to set primary_reselect bond option to 2 to keep master the same each time until it fails. Unfortunately, it did not help. It seems that the issue happens when we hotplug the bond first time. In this case Ubuntu may want to change bridge MAC address by stopping the interface and cloning the MAC from the first port attached to it which is lnx-bond1. In this case we have post-up sleep 15 command which may actually affect the time when interface becomes usable.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

I set delay_while_up option to 5 and it worked. So this could be a working work-around for this bug. If we do not find the real fix we can provide it in release notes.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

so it seems we have an issue when the bridge interface is not working for 15/45 seconds if (non-lacp/lacp) bond is plugged into it first time. There could be an easy workaround, I think by introducing sleep 15/45 into connectivity tests check

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/220740

Changed in fuel:
assignee: Stanislav Makar (smakar) → Vladimir Kuklin (vkuklin)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/220740
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=f81fdabe6c05be7a3d11d88a7c3a8f3931921c73
Submitter: Jenkins
Branch: master

commit f81fdabe6c05be7a3d11d88a7c3a8f3931921c73
Author: Vladimir Kuklin <email address hidden>
Date: Sat Sep 5 20:15:38 2015 +0300

    Dirty sleep to wait for interfaces

    Introduce 45 sleep as a w/a
    for not ready networking
    when ubuntu decides to respin
    the interface. Look into bug
    comments

    Change-Id: Ia565cd133081a704dca809b5b91c1ec67db0cbb5
    Partial-bug: #1492147

Andrey Maximov (maximov)
tags: added: release-notes
Revision history for this message
Alexey Shtokolov (ashtokolov) wrote :

W/a was merged and verified on ISO#281

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Dmitry Kalashnik (dkalashnik) wrote :
Changed in fuel:
status: Fix Committed → In Progress
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

Why don't we reverse proxy Ubuntu (and MOS) mirrors [1] instead of playing dirty tricks?

[1] https://review.openstack.org/#/c/213771

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/221089

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/221089
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=ddf395c143f88e776c5fc344a427b56c68126675
Submitter: Jenkins
Branch: master

commit ddf395c143f88e776c5fc344a427b56c68126675
Author: Vladimir Kuklin <email address hidden>
Date: Mon Sep 7 21:01:03 2015 +0300

    Fix l3 clear route provider to work with subIFs

    This fix ensures that l3 route has been cleared
    before we try to set default route. This is very
    important for admin interface on the nodes that
    do not have default gateway configured as they
    need to go through admin node during deployment
    stage.

    This issue affects environments with subinterfaces
    for admin interface as we cannot add new route
    through new interface in runtime as we already
    have default gateway through the different interface
    which is eth0/eth1 almost each time.

    E.g.

    default via 10.109.17.1 dev eth1

    should become

    default via 10.109.17.1 dev br-fw-admin

    But l3_clear_route does not understand that these routes
    are different and thus does not clear the first one.
    This in fact leads to inability to set default route
    by puppet l3_ifconfig provider. Look into
    https://bugs.launchpad.net/fuel/+bug/1447638 for more
    details.

    This fix adds 'interface' property into l3_clear_route
    puppet type which is a hacky w/a for l23network ip route
    management.

    It also ensures that if route iface is changed we need to
    recreate the route.

    Change-Id: I44e45ce1e13a4836552b95440cdfb706a5c177c5
    Closes-bug: #1492147
    Related-bug: #1447638

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/7.0)

Fix proposed to branch: stable/7.0
Review: https://review.openstack.org/221106

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/7.0)

Reviewed: https://review.openstack.org/221106
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=976c4b881f70f2e6eeb87c36d9b6f885696af142
Submitter: Jenkins
Branch: stable/7.0

commit 976c4b881f70f2e6eeb87c36d9b6f885696af142
Author: Vladimir Kuklin <email address hidden>
Date: Mon Sep 7 21:01:03 2015 +0300

    Fix l3 clear route provider to work with subIFs

    This fix ensures that l3 route has been cleared
    before we try to set default route. This is very
    important for admin interface on the nodes that
    do not have default gateway configured as they
    need to go through admin node during deployment
    stage.

    This issue affects environments with subinterfaces
    for admin interface as we cannot add new route
    through new interface in runtime as we already
    have default gateway through the different interface
    which is eth0/eth1 almost each time.

    E.g.

    default via 10.109.17.1 dev eth1

    should become

    default via 10.109.17.1 dev br-fw-admin

    But l3_clear_route does not understand that these routes
    are different and thus does not clear the first one.
    This in fact leads to inability to set default route
    by puppet l3_ifconfig provider. Look into
    https://bugs.launchpad.net/fuel/+bug/1447638 for more
    details.

    This fix adds 'interface' property into l3_clear_route
    puppet type which is a hacky w/a for l23network ip route
    management.

    It also ensures that if route iface is changed we need to
    recreate the route.

    Change-Id: I44e45ce1e13a4836552b95440cdfb706a5c177c5
    Closes-bug: #1492147
    Related-bug: #1447638
    (cherry picked from commit ddf395c143f88e776c5fc344a427b56c68126675)

Stanislav Makar (smakar)
tags: added: on-verification
Revision history for this message
Stanislav Makar (smakar) wrote :

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "288"
  build_id: "288"
  nailgun_sha: "93477f9b42c5a5e0506248659f40bebc9ac23943"
  python-fuelclient_sha: "1ce8ecd8beb640f2f62f73435f4e18d1469979ac"
  fuel-agent_sha: "082a47bf014002e515001be05f99040437281a2d"
  fuel-nailgun-agent_sha: "d7027952870a35db8dc52f185bb1158cdd3d1ebd"
  astute_sha: "a717657232721a7fafc67ff5e1c696c9dbeb0b95"
  fuel-library_sha: "121016a09b0e889994118aa3ea42fa67eabb8f25"
  fuel-ostf_sha: "1f08e6e71021179b9881a824d9c999957fcc7045"
  fuelmain_sha: "6b83d6a6a75bf7bca3177fcf63b2eebbf1ad0a85"

tags: removed: on-verification
Revision history for this message
Vasily Gorin (vgorin) wrote :
Dmitry Pyzhov (dpyzhov)
tags: added: area-library
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
milestone: 7.0 → 8.0
tags: added: 8.0 release-notes-done
removed: release-notes
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.