Compute re-deployment with network bonding and DPDK fails: Puppet (err): Can't add bond 'bond0'

Bug #1580541 reported by Artem Panchenko on 2016-05-11
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Aleksey Kasatkin
Mitaka
High
Aleksey Kasatkin

Bug Description

Fuel version info (9.0 build #303): http://paste.openstack.org/show/496680/

When I try to re-deploy environment after failure (due to bug #1571763) it fails because compute node with network bonds goes down:

"Deployment has failed. All nodes are finished. Failed tasks: Task[firewall/9] Stopping the deployment process!"

2016-05-11 09:11:48 WARNING [27942] Puppet agent 9 didn't respond within the allotted time
...
2016-05-11 09:24:31 WARNING [27942] Validation of node:
{"uid"=>nil,
 "status"=>"error",
 "error_type"=>"deploy",
 "error_msg"=>
  "All nodes are finished. Failed tasks: Task[firewall/9] Stopping the deployment process!"}
 for report failed: Node uid is not provided

Steps to reproduce:

0. Enable 'experimental' feature group for nailgun
1. Create cluster with VLAN and KVM
2. Add 1 controller and 1 compute node
3. Enable HugePages (256MB) for DPDK on compute node
4. Configure active-backup bond on compute using 2 NICs
5. Assign 'private' network to the bond
6. Enable DPDK for the bond
7. Verify networks
8. Deploy changes
9. Deployment should fail on non-primary controllers after netconfig on computes is done
10. Try to re-deploy cluster w/o reset (deploy changes again)
11. Deployment fails with the same error on controllers, but compute node also has erros in puppet.log http://paste.openstack.org/show/496683/
12. Reset environment
13. Run network verification
14. Deploy environment

Expected result:

deployment is done or fails with error on controller nodes

Actual:

deployment fails on compute node with DPDK for bond and the node becomes inaccessible via network

Diagnostic snapshot: https://drive.google.com/file/d/0BzaZINLQ8-xkZU81dkUwMFJpSjg/view?usp=sharing

Vladimir Eremin (yottatsa) wrote :

Looks like that astute.yaml contains neither DPDK interfaces, nor interfaces in bond. This is why deployment fails. Diag. snapshot/node-9/etc/astute.yaml:

  interfaces:
    eno1:
      vendor_specific: {bus_info: '0000:03:00.0', driver: igb}
    enp10s0f0:
      vendor_specific: {bus_info: '0000:0a:00.0', driver: igb}
    enp10s0f1:
      vendor_specific: {bus_info: '0000:0a:00.1', driver: igb}
    ens3f1:
      vendor_specific: {bus_info: '0000:03:00.1', driver: igb}
  ...
  - action: add-bond
    bond_properties: {mode: active-backup}
    bridge: br-prv
    interface_properties:
      vendor_specific: {disable_offloading: true}
    interfaces: []
    name: bond0
    provider: dpdkovs

Database dump shows that no interfaces assigned to bond too.

Dmitry Klenov (dklenov) on 2016-05-11
Changed in fuel:
status: New → Confirmed
tags: added: area-python
Changed in fuel:
assignee: nobody → Networking (l23-network)
Aleksey Kasatkin (alekseyk-ru) wrote :

When node has 'error' status interfaces info from nailgun-agent is accepted by nailgun. In this case nailgun-agent will not find interfaces that were configured with DPDK so these interfaces will be excluded from DB. When node is bootstrapped interfaces appear again (in info from nailgun-agent, so in DB) but relations with corresponding bonds will be lost.

We can just disable accepting NICs info from nailgun-agent when node has 'error' status. So, nailgun will accept this info only for nodes with 'discover' status.

Changed in fuel:
status: Confirmed → Triaged

Fix proposed to branch: master
Review: https://review.openstack.org/315445

Changed in fuel:
assignee: Networking (l23-network) → Aleksey Kasatkin (alekseyk-ru)
status: Triaged → In Progress
Aleksey Kasatkin (alekseyk-ru) wrote :

The same problem can be observed when bonds are in use w/o DPDK. So, it does not really depend on DPDK configuration.

tags: added: feature-bonding
removed: feature-dpdk

Reviewed: https://review.openstack.org/315445
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=2576e69e9629859d4d28259ed0885523c122962f
Submitter: Jenkins
Branch: master

commit 2576e69e9629859d4d28259ed0885523c122962f
Author: Aleksey Kasatkin <email address hidden>
Date: Thu May 12 12:18:16 2016 +0300

    Allow update of network configuration for bootstrap nodes only

    Before this change, bonds get broken (their slaves become lost)
    when node gets 'error' status after netconfig is done.
    So, redeployment of such node would not be successful.

    Change-Id: I9de989d0566cf4e5e1e735ec432cb921f43522eb
    Closes-Bug: 1580541

Changed in fuel:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/318083
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=b9f50509baf0fd0816330eded48e9e4bcdc49093
Submitter: Jenkins
Branch: stable/mitaka

commit b9f50509baf0fd0816330eded48e9e4bcdc49093
Author: Aleksey Kasatkin <email address hidden>
Date: Thu May 12 12:18:16 2016 +0300

    Allow update of network configuration for bootstrap nodes only

    Before this change, bonds get broken (their slaves become lost)
    when node gets 'error' status after netconfig is done.
    So, redeployment of such node would not be successful.

    Change-Id: I9de989d0566cf4e5e1e735ec432cb921f43522eb
    Closes-Bug: 1580541

tags: added: on-verification

Verified on
[root@nailgun ~]# shotgun2 short-report
cat /etc/fuel_build_id:
 495
cat /etc/fuel_build_number:
 495
cat /etc/fuel_release:
 9.0
cat /etc/fuel_openstack_version:
 mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
 fuel-release-9.0.0-1.mos6349.noarch
 fuel-misc-9.0.0-1.mos8460.noarch
 python-packetary-9.0.0-1.mos140.noarch
 fuel-bootstrap-cli-9.0.0-1.mos285.noarch
 fuel-migrate-9.0.0-1.mos8460.noarch
 rubygem-astute-9.0.0-1.mos750.noarch
 fuel-mirror-9.0.0-1.mos140.noarch
 shotgun-9.0.0-1.mos90.noarch
 fuel-openstack-metadata-9.0.0-1.mos8743.noarch
 fuel-notify-9.0.0-1.mos8460.noarch
 nailgun-mcagents-9.0.0-1.mos750.noarch
 python-fuelclient-9.0.0-1.mos325.noarch
 fuel-9.0.0-1.mos6349.noarch
 fuel-utils-9.0.0-1.mos8460.noarch
 fuel-setup-9.0.0-1.mos6349.noarch
 fuel-provisioning-scripts-9.0.0-1.mos8743.noarch
 fuel-library9.0-9.0.0-1.mos8460.noarch
 network-checker-9.0.0-1.mos74.x86_64
 fuel-agent-9.0.0-1.mos285.noarch
 fuel-ui-9.0.0-1.mos2717.noarch
 fuel-ostf-9.0.0-1.mos936.noarch
 fuelmenu-9.0.0-1.mos274.noarch
 fuel-nailgun-9.0.0-1.mos8743.noarch
[root@nailgun ~]#

After error deployment, computes had persisted dpdk-related information.

I made failure manually, so I have tested only behaviour of "re-deploy cluster w/o reset does not erase dpdk ovs info"

As it is seen in code, erasure is prevented in objects/node. Probably, there are more scenarios which should be checked, for that one it is fixed.

tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers