Network interfaces are down after cluster restart on nodes that aren't connected to public network

Bug #1510072 reported by Andrey Sledzinskiy on 2015-10-26
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Stanislav Makar
7.0.x
High
slava valyavskiy

Bug Description

Scenario:
1. Create cluster with active-backup bonding and Neutron VXLAN
2. Add 3 nodes with controller role
3. Add 1 node with compute role
4. Add 1 node with cinder role
5. Setup bonding for all interfaces (including admin interface
    bonding)
6. Run network verification
7. Deploy the cluster
8. After deployment restart all cluster nodes

Actual result - node-1, node-2 aren't reachable, br-fw-admin interface failed to start on them

api: '1.0'
astute_sha: eebbb2470cb800e532de19c29673558aeb86aae4
auth_required: true
build_id: '66'
build_number: '66'
feature_groups:
- mirantis
fuel-agent_sha: e4056a7923dd607521d97763d5dfb6de8a33ab5d
fuel-createmirror_sha: 0315aa30aee56e10f142683a25340c3c9d2f1e85
fuel-library_sha: bc044a0562cda204245b2a9136fa4bd6d7ef723e
fuel-nailgun-agent_sha: e377e83268abd406f22b656b76014656077a6a74
fuel-nailgun_sha: 2476325f95f3bbdc0ff5dbd827868f2ab243e1b4
fuel-ostf_sha: 9f500668555292add5d87c942e0cd804aefa6df2
fuelmain_sha: 21b84eb3d09883a7da526ebc4bd21458d2e9844a
openstack_version: 2015.1.0-8.0
production: docker
python-fuelclient_sha: 8ea3b64d21c4d729d1069f3aa5528ede3c76b412
release: '8.0'
release_versions:
  2015.1.0-8.0:
    VERSION:
      api: '1.0'
      astute_sha: eebbb2470cb800e532de19c29673558aeb86aae4
      build_id: '66'
      build_number: '66'
      feature_groups:
      - mirantis
      fuel-agent_sha: e4056a7923dd607521d97763d5dfb6de8a33ab5d
      fuel-createmirror_sha: 0315aa30aee56e10f142683a25340c3c9d2f1e85
      fuel-library_sha: bc044a0562cda204245b2a9136fa4bd6d7ef723e
      fuel-nailgun-agent_sha: e377e83268abd406f22b656b76014656077a6a74
      fuel-nailgun_sha: 2476325f95f3bbdc0ff5dbd827868f2ab243e1b4
      fuel-ostf_sha: 9f500668555292add5d87c942e0cd804aefa6df2
      fuelmain_sha: 21b84eb3d09883a7da526ebc4bd21458d2e9844a
      openstack_version: 2015.1.0-8.0
      production: docker
      python-fuelclient_sha: 8ea3b64d21c4d729d1069f3aa5528ede3c76b412
      release: '8.0'

tags: added: swarm-blocker
Dmitry Pyzhov (dpyzhov) on 2015-10-27
tags: added: area-library
Sergey Vasilenko (xenolog) wrote :

this behavior depends of physical network topology.

Do you sure both networks (bridges on the host system), which handles traffic both bonded interfaces, mixed to one bridge before pass traffic to the master node?

Sergey Vasilenko (xenolog) wrote :

PLease provide full physical network topology.

Changed in fuel:
status: New → Incomplete
Dmitry Klenov (dklenov) wrote :

@Andrey, this issue is still considered a swarm blocker - so please provide all the details needed by Sergey.

Artem Panchenko (apanchenko-8) wrote :

@Sergey, doesn't matter what network architecture you have, after reboot non-controller nodes (which doesn't have access) lose their network configuration due to broken config files. This bug is a regression caused by https://review.openstack.org/#/c/232479/5 , configure_default_route.pp saves configs for admin and management bridges in OVS format, because there 2 default providers for l23_stored_config now:

2015-11-10 03:18:34 +0000 Scope(Class[main]) (notice): MODULAR: configure_default_route.pp
2015-11-10 03:18:35 +0000 Puppet (warning): Found multiple default providers for l23_stored_config: ovs_ubuntu, lnx_ubuntu; using ovs_ubuntu
2015-11-10 03:18:36 +0000 Puppet (debug): Prefetching ovs_ubuntu resources for l23_stored_config
2015-11-10 03:18:36 +0000 Puppet::Type::L23_stored_config::ProviderOvs_ubuntu (debug): format_file('/etc/network/interfaces.d/ifcfg-br-fw-admin')::properties: {:ovs_type=>"OVSIntPort", :bridge=>:absent, :ipaddr=>"10.109.10.6/24", :bond_slaves=>[:absent]}
2015-11-10 03:18:36 +0000 Puppet::Type::L23_stored_config::ProviderOvs_ubuntu (debug): format_file('/etc/network/interfaces.d/ifcfg-br-fw-admin')::content: ["auto br-fw-admin", "allow-absent br-fw-admin", "iface br-fw-admin inet static", "address 10.109.10.6/24", "ovs_type OVSIntPort"]
2015-11-10 03:18:37 +0000 Puppet::Type::L23_stored_config::ProviderOvs_ubuntu (debug): format_file('/etc/network/interfaces.d/ifcfg-br-mgmt')::properties: {:ovs_type=>"OVSIntPort", :bridge=>:absent, :ipaddr=>"192.168.0.6/24", :gateway=>"192.168.0.1", :bond_slaves=>[:absent]}
2015-11-10 03:18:37 +0000 Puppet::Type::L23_stored_config::ProviderOvs_ubuntu (debug): format_file('/etc/network/interfaces.d/ifcfg-br-mgmt')::content: ["auto br-mgmt", "allow-absent br-mgmt", "iface br-mgmt inet static", "address 192.168.0.6/24", "gateway 192.168.0.1", "ovs_type OVSIntPort"]

Looks like we need to set provider type explicitly while configuring default route.

BTW, commenting of this line also solves the problem:

https://github.com/openstack/fuel-library/blob/master/deployment/puppet/l23network/lib/puppet/provider/l23_stored_config/ovs_ubuntu.rb#L9

Changed in fuel:
status: Incomplete → Triaged
summary: - Network is unreachable for nodes that are routed through the master node
- after cluster restart
+ Network interfaces are down after cluster restart on nodes that aren't
+ connected to public network
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Stanislav Makar (smakar)

Fix proposed to branch: master
Review: https://review.openstack.org/244017

Changed in fuel:
status: Triaged → In Progress

Reviewed: https://review.openstack.org/244017
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=f93b87f54d673e7229970c190626086a6cbdb721
Submitter: Jenkins
Branch: master

commit f93b87f54d673e7229970c190626086a6cbdb721
Author: Stanislav Makar <email address hidden>
Date: Wed Nov 11 08:19:59 2015 +0000

    Fix the problem with regression after reboot

    *Add new if_type vport
    *Test coverage

    Change-Id: I65cbbad1c35a34dac86b7331a04468fc0d060d83
    Closes-bug: #1510072

Changed in fuel:
status: In Progress → Fix Committed
Download full text (3.5 KiB)

8.0.system_test.ubuntu.ceph_ha_one_controller

release_versions:
  2015.1.0-8.0:
    VERSION:
      api: '1.0'
      astute_sha: 959b06c5ef8143125efd1727d350c050a922eb12
      build_id: '152'
      build_number: '152'
      feature_groups:
      - mirantis
      fuel-agent_sha: 07560a9fc3ce5301ace04d2d3e5d68db6ee4f8d5
      fuel-createmirror_sha: a034dcb06520df58a7338816900a431a6b61d83f
      fuel-library_sha: 31f6ae4ced72927287b513e9c4e3a24d367e7736
      fuel-nailgun-agent_sha: 3e9d17211d65c80bf97c8d83979979f6c7feb687
      fuel-nailgun_sha: e72e94138d159308e85a16c382e90b54c7bc7c79
      fuel-ostf_sha: f169d495691ea3d40d3d6d0278265698d3f6ed14
      fuel-upgrade_sha: 1e894e26d4e1423a9b0d66abd6a79505f4175ff6
      fuelmain_sha: b5eb33ca7147dfda7a943a7f8f58c28e86d63992
      fuelmenu_sha: 8a32c53c1fa13b036000f589f96e876277dbd071
      network-checker_sha: a57e1d69acb5e765eb22cab0251c589cd76f51da
      openstack_version: 2015.1.0-8.0
      production: docker
      python-fuelclient_sha: e685d68c1c0d0fa0491a250f07d9c3a8d0f9608c
      release: '8.0'
      shotgun_sha: 25dd78a3118267e3616df0727ce746e7dead2d67
shotgun_sha: 25dd78a3118267e3616df0727ce746e7dead2d67

Scenario:
            1. Create cluster in Ha mode with 1 controller
            2. Add 1 node with controller role
            3. Add 1 node with compute and Ceph OSD roles
            4. Add 1 node with Ceph OSD role
            5. Deploy the cluster
            6. Check Ceph status
            7. Read current partitions
            8. Warm-reboot Ceph nodes
            9. Read partitions again
            10. Check Ceph health
            11. Cold-reboot Ceph nodes
            12. Read partitions again
            13. Check Ceph health

======================================================================
FAIL: Check that Ceph OSD partitions are remounted after reboot
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/case.py", line 296, in testng_method_mistake_capture_func
    compatability.capture_type_error(s_func)
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/compatability/exceptions_2_6.py", line 27, in capture_type_error
    func()
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/case.py", line 350, in func
    func(test_case.state.get_state())
  File "/home/jenkins/workspace/8.0.system_test.ubuntu.ceph_ha_one_controller/fuelweb_test/helpers/decorators.py", line 80, in wrapper
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/8.0.system_test.ubuntu.ceph_ha_one_controller/fuelweb_test/tests/test_ceph.py", line 879, in check_ceph_partitions_after_reboot
    [self.fuel_web.environment.d_env.get_node(name=node)])
  File "/home/jenkins/workspace/8.0.system_test.ubuntu.ceph_ha_one_controller/fuelweb_test/models/fuel_web_client.py", line 1593, in warm_restart_nodes
    self.warm_start_nodes(devops_nodes)
  File "/home/jenkins/workspace/8.0.system_test.ubuntu.ceph_ha_one_controller/fuelweb_test/models/fuel_web_client.py", line 1586, in warm_sta...

Read more...

Changed in fuel:
status: Fix Committed → Confirmed

Fix proposed to branch: master
Review: https://review.openstack.org/246296

Changed in fuel:
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/246296
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=4c36911a8175fa8f7572513fdc61efde3ad24ad3
Submitter: Jenkins
Branch: master

commit 4c36911a8175fa8f7572513fdc61efde3ad24ad3
Author: Stanislav Makar <email address hidden>
Date: Tue Nov 17 10:10:42 2015 +0000

    Refactor function configure_default_route

    Before the function configure_default_route was a little part of function
    generate_network_config which changed default gateway only, due to the provider
    for interfaces was not picked correctly.
    Now if default route is needed to change we just modify network_scheme
    and call generate_network_config with this network_scheme, if no - do
    nothing.
    Leave only provider lnx as default for l23_stored_config.

    Change-Id: I33e88550af5d5cce2886254444ee5d450e578a1c
    Closes-bug: #1510072

Changed in fuel:
status: In Progress → Fix Committed
tags: added: on-verification
Grigory Mikhailov (gmikhailov) wrote :

Verified on ISO #247.
Environment created via dos.py.
Described bug is not observed.

VERSION:
  feature_groups: - mirantis
  production: "docker"
  release: "8.0"
  openstack_version: "2015.1.0-8.0"
  api: "1.0"
  build_number: "247"
  build_id: "247"
  fuel-nailgun_sha: "86cebc1d92c7cc9ca25b00f5590954a7c4f880a0"
  python-fuelclient_sha: "91474bd8c526f4f536ab13368feb4a5c1b84d185"
  fuel-agent_sha: "660c6514caa8f5fcd482f1cc4008a6028243e009"
  fuel-nailgun-agent_sha: "a33a58d378c117c0f509b0e7badc6f0910364154"
  astute_sha: "b60624ee2c5f1d6d805619b6c27965a973508da1"
  fuel-library_sha: "032c707ec800f11044b32733dd4d395e06c209d0"
  fuel-ostf_sha: "65de07b5dce50349e7bc414f364505483c34e2b1"
  fuel-mirror_sha: "bfe7af26b7e6fdd46a16480481cc757f67958177"
  fuelmenu_sha: "fcb15df4fd1a790b17dd78cf675c11c279040941"
  shotgun_sha: "a0bd06508067935f2ae9be2523ed0d1717b995ce"
  network-checker_sha: "a3534f8885246afb15609c54f91d3b23d599a5b1"
  fuel-upgrade_sha: "1e894e26d4e1423a9b0d66abd6a79505f4175ff6"
  fuelmain_sha: "fda7c87dea9fb54c08bd3844d277b2e4778924e4"

Changed in fuel:
status: Fix Committed → Fix Released
tags: removed: on-verification
Anton Matveev (amatveev) wrote :

sla1 for MOS 7.0

tags: added: customer-found sla1

Reviewed: https://review.openstack.org/267055
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=3315017b92a0e92243932fbd24c4c1827e75ef36
Submitter: Jenkins
Branch: stable/7.0

commit 3315017b92a0e92243932fbd24c4c1827e75ef36
Author: Stanislav Makar <email address hidden>
Date: Wed Nov 11 08:19:59 2015 +0000

    Fix the problem with regression after reboot

    * Add new if_type vport
    * Test coverage

    Change-Id: I65cbbad1c35a34dac86b7331a04468fc0d060d83
    Closes-bug: #1510072

tags: added: 7.0-mu-2
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers