kernel bug hit by netconfig.pp when deploying HA cluster with ceph and network templates

Bug #1507613 reported by Vasily Gorin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Stanislav Makar
7.0.x
Invalid
High
MOS Maintenance

Bug Description

CI: https://product-ci.infra.mirantis.net/job/8.0.system_test.ubuntu.network_templates/22/testReport/(root)/deploy_ceph_net_tmpl/

Scenario:
            1. Revert snapshot with 5 slaves
            2. Create cluster (HA) with Neutron VLAN/VXLAN/GRE
            3. Add 3 controller + ceph nodes
            4. Add 2 compute + ceph nodes
            5. Upload 'ceph' network template
            6. Create custom network groups basing
               on template endpoints assignments
            7. Run network verification
            8. Deploy cluster

Expected result:
Successfully deploy

Actual result:
Deployment failed
All controllers get status offline .

Vasily Gorin (vgorin)
Changed in fuel:
importance: Undecided → High
assignee: nobody → Fuel Python Team (fuel-python)
Changed in fuel:
milestone: none → 8.0
Revision history for this message
Alex Schultz (alex-schultz) wrote :

Looking at the environment, the controllers are setting with load 200+ with many running ip commands. It appears that they are locked trying to make changes to the network interfaces and have run into a kernel bug. See screen shot.

Changed in fuel:
status: New → Confirmed
assignee: Fuel Python Team (fuel-python) → Alex Schultz (alex-schultz)
summary: - Deploy HA environment with Ceph, Neutron and network template Failed
+ kernel bug hit by netconfig.pp when deploying HA cluster with ceph and
+ network templates
Changed in fuel:
assignee: Alex Schultz (alex-schultz) → Fuel Library Team (fuel-library)
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Kernel logs from node-4:
http://paste.openstack.org/show/13Fqyk2Ev2ZpyI6nuRGM/
Maybe the duplicate MAC is partly to blame

Dmitry Pyzhov (dpyzhov)
tags: added: area-library
tags: added: swarm-blocker
Stanislav Makar (smakar)
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Stanislav Makar (smakar)
Stanislav Makar (smakar)
Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Stanislav Makar (smakar) wrote :
Download full text (5.8 KiB)

So as we see during bond disassembling(removing last slave) we are getting kernel BUG but under some unknown CONDITIONS
Why some unknown conditions because all other tries - all is ok

1. We have kernel bug reproducible every time
2. We are assemble / disassemble bond every puppet run - this is not puppet way, we should do it if we slaves are changed

LOGS:

Puppet log:
2015-11-09 13:40:42 +0000 L2_bond[lnxbond0](provider=lnx) (debug): Disassemble bond 'lnxbond0'
2015-11-09 13:40:42 +0000 L2_bond[lnxbond0](provider=lnx) (debug): Remove interface 'eth2.555' from bond 'lnxbond0'
2015-11-09 13:40:42 +0000 Puppet::Type::L2_bond::ProviderLnx (debug): SET sys.property: /sys/class/net/lnxbond0/bonding/slaves << -eth2.555
2015-11-09 13:40:42 +0000 L2_bond[lnxbond0](provider=lnx) (debug): Remove interface 'eth2.666' from bond 'lnxbond0'
2015-11-09 13:40:42 +0000 Puppet::Type::L2_bond::ProviderLnx (debug): SET sys.property: /sys/class/net/lnxbond0/bonding/slaves << -eth2.666

<3>Nov 9 13:40:42 node-1 kernel: [ 4078.458077] tried to remove device eth2 from br-fake
<2>Nov 9 13:40:42 node-1 kernel: [ 4078.458173] kernel BUG at /build/linux-XHaR1x/linux-3.13.0/net/core/dev.c:4766!
<1>Nov 9 13:40:42 node-1 kernel: [ 4078.460096] RIP [<ffffffff81629b9c>] __netdev_adjacent_dev_remove+0x14c/0x180

<4>Nov 9 13:40:42 node-1 kernel: [ 4078.458150] ------------[ cut here ]------------
<4>Nov 9 13:40:42 node-1 kernel: [ 4078.458205] invalid opcode: 0000 [#1] SMP
<4>Nov 9 13:40:42 node-1 kernel: [ 4078.458228] Modules linked in: xt_REDIRECT xt_nat xt_mark ip6table_raw iscsi_target_mod target_core_mod co
nfigfs ipt_REJECT nf_conntrack_netlink nfnetlink ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat veth xt_conntrack iptable_raw xt_CT xt_comment x
t_multiport ip6table_filter ip6_tables 8021q garp mrp bridge stp llc bonding openvswitch(OX) gre vxlan ip_tunnel btrfs ufs qnx4 hfsplus hfs min
ix ntfs msdos jfs xfs libcrc32c xt_CHECKSUM xt_tcpudp iptable_mangle iptable_filter ip_tables x_tables kvm_intel kvm crct10dif_pclmul crc32_pcl
mul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd mac_hid serio_raw nf_conntrack_proto_gre nf_conntrac
k_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack nls_utf8 isofs raid10 raid456 async_raid6_recov async_memcpy async_pq async
_xor async_tx xor raid6_pq raid1 raid0 multipath psmouse linear pata_acpi e1000
<4>Nov 9 13:40:42 node-1 kernel: [ 4078.458691] CPU: 0 PID: 19413 Comm: puppet Tainted: G OX 3.13.0-67-generic #110-Ubuntu
<4>Nov 9 13:40:42 node-1 kernel: [ 4078.458746] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
<4>Nov 9 13:40:42 node-1 kernel: [ 4078.458800] task: ffff88003b34c800 ti: ffff8800743aa000 task.ti: ffff8800743aa000
<4>Nov 9 13:40:42 node-1 kernel: [ 4078.458831] RIP: 0010:[<ffffffff81629b9c>] [<ffffffff81629b9c>] __netdev_adjacent_dev_remove+0x14c/0x180
<4>Nov 9 13:40:42 node-1 kernel: [ 4078.458888] RSP: 0018:ffff8800743abd48 EFLAGS: 00010286
<4>Nov 9 13:40:42 node-1 kernel: [ 4078.458913] RAX: 0000000000000028 RBX: ffff880036330098 RCX: 0000000000000006
<4>Nov 9 13:40:42 node-1 kernel: [ 4078.458944]...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/247498

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/247498
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=40e486a8c078348118eb29334e2a006934434a73
Submitter: Jenkins
Branch: master

commit 40e486a8c078348118eb29334e2a006934434a73
Author: Stanislav Makar <email address hidden>
Date: Thu Nov 19 13:13:02 2015 +0000

    Re-assembling bond

    * Now we only re-assemble bond if we change bond mode. Before we re-assembled
    bond every puppet run, that is not needed and sometimes led to kernel panic.
    * Refactor bond downing and upping behaviour during configuration:
    we only down and up bond if we change bond configuration, if nothing is
    changed - we do nothing.

    Change-Id: Ie9d15c7474c27dd396ae5e46cf0e41ff25786574
    Closes-bug: #1507613

Changed in fuel:
status: In Progress → Fix Committed
tags: added: on-verification
Revision history for this message
Dmitriy Kruglov (dkruglov) wrote :

Verification of the issue fix is blocked by https://bugs.launchpad.net/fuel/+bug/1525926

tags: removed: on-verification
Revision history for this message
Dmitry Tyzhnenko (dtyzhnenko) wrote :

Bug doesn't reproduced on last iso - https://product-ci.infra.mirantis.net/job/8.0.system_test.ubuntu.network_templates/133/

Fuel version 8.0-510

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "510"
  build_id: "510"
  fuel-nailgun_sha: "41170db11c366af5fe04c1c539c11b2e3e388ef9"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "658be72c4b42d3e1436b86ac4567ab914bfb451b"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "ec7e212972ead554f21b52b9e165156665f659df"
  fuel-ostf_sha: "5fe41945c2a49f26c849df1fd46329f6db1ab6b0"
  fuel-mirror_sha: "351d568fa3b3e4dd062054b91d766aa54d379867"
  fuelmenu_sha: "234cb4cbb30fbd2df00f388c28f31606d9cae15f"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "a43cf96cd9532f10794dce736350bf5bed350e9d"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "94507c5e4dad6d8cfbd8f5d41aa8389d5335990a"

Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Setting as Invalid as on 7.0 this test doesn't fail.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.