VMs are losing connectivity and their IPs after HA failover

Bug #1371104 reported by Aviram Bar-Haim
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Sergii Golovatiuk
5.1.x
Invalid
High
Oleksiy Molchanov
6.0.x
Fix Committed
High
Sergii Golovatiuk

Bug Description

Env:
HA cluster with Mellanox SR-IOV enabled.

After HA installation with default neutron mechanism driver (ovs), Cirros based VMs launched successfully with ping to 8.8.8.8 and assigning floating IP works (same in CentOS and Ubuntu).

A few minutes (about 5 minutes) after powering off the primary controller, console and Openstack API worked but existing VMs are losing their private and floating IPs (same in CentOS and Ubuntu).

In CentOS HA, dhcp port didn't recovered and no ping to outside world.
In Ubuntu HA, dhcp port is up after 15 minutes and ports got IP but still no ping to outside world.

Reboot/network restart of the VMs aren't helping.

Version:
{"build_id": "2014-09-12_05-20-22", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "24", "auth_required": true, "api": "1.0", "nailgun_sha": "d389bc6489fe296c9c210f7c65ac84e154a8b82b", "production": "docker", "fuelmain_sha": "d899675a5a393625f8166b29099d26f45d527035", "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13", "feature_groups": ["experimental"], "release": "5.1", "release_versions": {"2014.1.1-5.1": {"VERSION": {"build_id": "2014-09-12_05-20-22", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "24", "api": "1.0", "nailgun_sha": "d389bc6489fe296c9c210f7c65ac84e154a8b82b", "production": "docker", "fuelmain_sha": "d899675a5a393625f8166b29099d26f45d527035", "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13", "feature_groups": ["experimental"], "release": "5.1", "fuellib_sha": "395fd9d20a003603cc9ad26e16cb13c1c45e24e6"}}}, "fuellib_sha": "395fd9d20a003603cc9ad26e16cb13c1c45e24e6"}

CentOS diagnostic snapshot: https://docs.google.com/uc?id=0BzuAt0EZGLAMamZzMGxKZG5VY2M&export=download

Possible workaround:
Apply this patch to Fuel 6.0 https://review.openstack.org/#/c/143996/

Doc team:
This bug should be noted as a known issue in Release notes.

Roman Vyalov (r0mikiam)
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
importance: Undecided → Medium
status: New → Triaged
milestone: none → 6.0
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Changed in fuel:
status: Triaged → Confirmed
importance: Medium → High
Revision history for this message
Irina Povolotskaya (ipovolotskaya) wrote :

Should this be included into Release notes?
It there any workaround?

Revision history for this message
Mike Scherbakov (mihgen) wrote :

If the behavior is confirmed, then this bug is Critical.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Do we really have this behaviour confirmed?

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Any update on confirming if this is still present after fix for bug #1370510 was merged (9/22 for 6.0, 10/28 for 5.1.1)?

Revision history for this message
Gil Meir (gilmeir-d) wrote :

Our QA guys in Mellanox tested Fuel 5.1 GA with a few failovers and it passed ok.
We still haven't tested this on v6.0 / v5.1.1, we'll update as soon as we get to that.

Revision history for this message
Tomasz 'Zen' Napierala (tzn) wrote :

Thanks Gil! Do you have any ETA when you could test at least 5.1.1 ?

Revision history for this message
Gil Meir (gilmeir-d) wrote :

First as I said before it worked on Fuel 5.1 GA, I verified again with QA this is correct.
I can't really commit on an ETA, I'll try pushing QA to prioritize this to be done in the next week..

Revision history for this message
Sergey Vasilenko (xenolog) wrote :

I can't reproduce this behavior on 6.0.

Revision history for this message
Gil Meir (gilmeir-d) wrote :

Update: re v5.1.1 we couldn't reproduce it, re v6.0 we haven't got to it yet.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Thanks Gil, marking as Incomplete in 5.1.x.

Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Can't reproduce on 5.1.1 and 6.

Revision history for this message
Gil Meir (gilmeir-d) wrote :

This was reproduced again here with 6.0 RC3 (build 56), build info is below.
* I can't attach snapshot do to problems in launchpad site, I'll try again later *

The bug isn't always reproduces, but happened here more than once with this build.
I noticed this error on pacemaker log:
<29>Dec 24 15:32:29 node-4 lrmd[26113]: notice: operation_finished: p_neutron-dhcp-agent_start_0:38178:stderr [ cat: /var/run/resource-agents/neutron-agent-dhcp/neutron-agent-dhcp.pid: No such file or directory ]

I saw once there was no dnsmasq on any of the controllers (not even the master), the second time it happened (this snapshot) the IP addresses on instances got back (and the dnsmasq service on the master controller too) after more than 1 hour.

{"build_id": "2014-12-18_01-32-01", "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4", "build_number": "56", "auth_required": true, "api": "1.0", "nailgun_sha": "5f91157daa6798ff522ca9f6d34e7e135f150a90", "production": "docker", "fuelmain_sha": "45caacadb878abfbd9d60e134d72229698b469c9", "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91", "feature_groups": ["mirantis"], "release": "6.0", "release_versions": {"2014.2-6.0": {"VERSION": {"build_id": "2014-12-18_01-32-01", "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4", "build_number": "56", "api": "1.0", "nailgun_sha": "5f91157daa6798ff522ca9f6d34e7e135f150a90", "production": "docker", "fuelmain_sha": "45caacadb878abfbd9d60e134d72229698b469c9", "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91", "feature_groups": ["mirantis"], "release": "6.0", "fuellib_sha": "73332192a257ea02c40a39885c502ad1ebdf3eda"}}}, "fuellib_sha": "73332192a257ea02c40a39885c502ad1ebdf3eda"}

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/143996

Changed in fuel:
assignee: Oleksiy Molchanov (omolchanov) → Sergii Golovatiuk (sgolovatiuk)
status: Incomplete → In Progress
tags: added: mellanox partner
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/6.0)

Fix proposed to branch: stable/6.0
Review: https://review.openstack.org/143998

Revision history for this message
Andrey Danin (gcon-monolake) wrote :

The bug is reproduced on Mellanox enabled environments with some VM images (it seems CirrOS is not affected but Ubuntu is).
DHCP lease time in Fuel 6.0 is 60 seconds.
During a Controller failover Dnsmasq can start up to 10 minutes. That's why VMs loose IP addresses and connectivity. As a workaround, this patch can be applied https://review.openstack.org/#/c/143996/

Taking into account all above I decrease the Bug importance to High.

tags: added: docs
description: updated
Revision history for this message
Irina Povolotskaya (ipovolotskaya) wrote :

https://review.openstack.org/#/c/144094/1 - Known Issue is described here.
Gil and Andrey, please review it and put +1 once everything looks good.

tags: added: release-notes
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/143996
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=238a26c07394bf141acdd1db8de28cb6651c60d5
Submitter: Jenkins
Branch: master

commit 238a26c07394bf141acdd1db8de28cb6651c60d5
Author: Sergii Golovatiuk <email address hidden>
Date: Thu Dec 25 16:41:59 2014 +0100

    Increase lease time for dhcp agent

    Usually failover takes a couple of minutes. 120 seconds is not enough so
    during failover VMs loose IP address, breaking the connectivity.
    Increasing the value to 600 allows

    * Minimaze the number of DHCPDISCOVER requests during failover
    * Minimaze the load on dnsmasq as it will be less frequently renewing IP
      addresses for VMs

    Change-Id: I7dc57e3f79ed8ee5e83c26e0c80d6a44e0840b4a
    Closes-Bug: 1371104

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-docs (master)

Reviewed: https://review.openstack.org/144094
Committed: https://git.openstack.org/cgit/stackforge/fuel-docs/commit/?id=e4516d379983af2a933a6ca1030e8138c498a416
Submitter: Jenkins
Branch: master

commit e4516d379983af2a933a6ca1030e8138c498a416
Author: Irina Povolotskaya <email address hidden>
Date: Fri Dec 26 10:56:13 2014 +0300

    Release Notes 6.0 -- a short DHCP timeout issue is discovered

    Related bug: #1371104
    Change-Id: I353c40230e6364b952f89464264655a3d1a97efe

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/6.0)

Reviewed: https://review.openstack.org/143998
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=20f0b5a5201f65ee7fc7a1e5f081c46ca59b4726
Submitter: Jenkins
Branch: stable/6.0

commit 20f0b5a5201f65ee7fc7a1e5f081c46ca59b4726
Author: Sergii Golovatiuk <email address hidden>
Date: Thu Dec 25 16:41:59 2014 +0100

    Increase lease time for dhcp agent

    Usually failover takes a couple of minutes. 120 seconds is not enough so
    during failover VMs loose IP address, breaking the connectivity.
    Increasing the value to 600 allows

    * Minimaze the number of DHCPDISCOVER requests during failover
    * Minimaze the load on dnsmasq as it will be less frequently renewing IP
      addresses for VMs

    Change-Id: I7dc57e3f79ed8ee5e83c26e0c80d6a44e0840b4a
    Closes-Bug: 1371104
    Signed-off-by: Sergii Golovatiuk <email address hidden>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.