Floating IP takes too long to update in nova and even longer for multiple VMs

Bug #1262529 reported by Yair Fried
56
This bug affects 9 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Unassigned
devstack
Invalid
Undecided
Unassigned
neutron
Fix Released
Medium
Unassigned
tempest
Invalid
Undecided
Unassigned

Bug Description

Associating Floating IP with neutron takes too long to show up in VM's details ('nova show' or 'compute_client.servers.get()') and even longer when there's more than 1 VM involved.

when launching 2 VMs with floating IP you can see in the log that it passes once:
"unchecked floating IPs: {}"
and fails
"Timed out while waiting for the floating IP assignments to propagate"

http://logs.openstack.org/01/55101/28/check/check-tempest-dsvm-neutron/0541dff/console.html
http://logs.openstack.org/01/55101/28/check/check-tempest-dsvm-neutron/f383f4b/console.html
http://logs.openstack.org/01/55101/31/check/check-tempest-dsvm-neutron/321413a/console.html
http://logs.openstack.org/97/62697/5/check/check-tempest-dsvm-neutron/960c6ad/console.html

also - the floating ip is accessible long time before it is updated in nova DB

How to reproduce:
https://review.openstack.org/#/c/62697/

So the problem is both:
1. the time it takes for nova to get the update
and
2. the timeout defined in the tempest neutron-gate

since I don't see this in my local setup (rhos-4.0), I don't know if this is due to stress in neutron or nova, or if it's a devstack issue

Revision history for this message
Yair Fried (yfried) wrote :
summary: - Floating IP takes too long to update in nova and even longer
+ Floating IP takes too long to update in nova and even longer for
+ multiple VMs
Yair Fried (yfried)
description: updated
Revision history for this message
Adalberto Medeiros (adalbas) wrote :

Does not seem to be a tempest issue.

Changed in tempest:
status: New → Invalid
Revision history for this message
Yair Fried (yfried) wrote :

@adalbas this is a tempest issue because the test needs to change, as agreed in the mailing list

Changed in tempest:
status: Invalid → Incomplete
Revision history for this message
Feng Ju (jufeng) wrote :

HI Yair,
As I know, if you use quantum/neutron to associate floating ip.
1. First the neutron-server will update FloatingIP table in database (which will make the `neutron floatingip-show` command show the floating ip that has associated with port, actually the flaoting ip hasn't associate with port), then notify l3-agent to update qrouter device and nat rules. (https://github.com/openstack/neutron/blob/master/neutron/db/l3_db.py#L659)
2. l3-agent periodically update routeres (assocate floating ip to qrouter device and apply nat rules.) every 60 seconds which is the default value (https://github.com/openstack/neutron/blob/master/neutron/agent/l3_agent.py#L733)
After this step, floating ip will can be pinged.
3. Nova compute also periodically update InstanceInfoCache table which is queried by `nova show <instance>` command every 60 seconds which is the default value. (https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L4391)

The reason of this issue I guess maybe is that l3-agent or nova compute has something wrong when run periodical tasks.
maybe l3-agent/nova-compute is to busy to run periodical tasks.
maybe the peridoical task raise excepitons.

About the timeout value in tempest to test floating ip, I think 120 seconds is the mininum value.
if we want to check out who make the floating ip not appear in `nova show` command.
we can first check step 2 (but this step is not easy to check, maybe using the default value 60 seconds to check. which is checking the floating ip connectivity with a timeout of 60 seconds),
then check step 3.

Maybe my understanding of floating ip is not completely correct, so how do you think?

Revision history for this message
Yair Fried (yfried) wrote :

I'm thinking:

Tempest sollution:
1. Removing "wait_for_floating_ip" from network_basic_ops should be the first step as it doesn't check for network connectivity and fails the test for no reason
2. Create a scenario that will test both DBs (neutron and nova) and try to stress them (as describe in mailing list)

Analyzing the new scenario should tell us how to solve the issue in neutron/nova

Revision history for this message
Yair Fried (yfried) wrote :

This patch removes this test from network_basic_ops

Revision history for this message
Yair Fried (yfried) wrote :
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Hi Yair,

I agree with your suggestion about removing the check in network basic ops.
In principle I also agree with your advice about creating a new scenario for 'association' (it's actually not in my opinion an association, but let's stick to the point!); however this effort should be coupled with a parallel effort aimed at solving the underlying problem in nova, which perhaps is connected to the way neutron floating ips updates are propagated back to nova.

Revision history for this message
Alexander Ignatov (aignatov) wrote :

Savanna team faced the same issue during the integrating testing with using novaclient with comparison of neutronclient:
https://bugs.launchpad.net/savanna/+bug/1277501

Changed in neutron:
status: New → Triaged
tags: added: nova-neutron
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
Yair Fried (yfried) wrote :

I think the Tempest portion can be closed - it remains a neutron-nova issue

Revision history for this message
Phil Hopkins (phil-hopkins-a) wrote :

I too have been having a problen in that it may take up to 15 minutes for a floating IP to appear on an instance when I run a nova list command after associating a floating IP to an instance.

No errors appear in any of the log files.

I am running Havana using Neutron ML2/Open vSwitch on Ubuntu 1204.

My environment is a controller node, a network node and 7 compute/hypervisor nodes.

The controller, network and 5 hypervisor nodes are dual proc 12 core Xeon processors with 72G of RAM.
The other two hypervisor nodes only have one 12 core proc with 72G RAM.

I have a script that I run on the controller node which creates a project, two users in the project, 4 networks, modifies the security group rules, starts 4 instances that attach to some or all of the networks and then assigns a floating IP to one interface on one VM.

This script may loop to create up to 25 of these environments.

Watching nova list --all-tenants it can take up to 20 to 30 minutes for all of the VMs to have their floating IPs appear. The floating IPs seem to work right away but the delay for them to show on nova list is strange.

Revision history for this message
Craig Anderson (canderso) wrote :

Can confirm that this also affects me, using Neutron OVS on Ubuntu 12.04 (Havana). Had the same problem on Grizzly as well (using quantum linuxbridge on Ubuntu 12.04).

Revision history for this message
Mauro S M Rodrigues (maurorodrigues) wrote :

As per yfried's comment closing tempest portion.

Changed in tempest:
status: Incomplete → Invalid
Revision history for this message
Aaron Rosen (arosen) wrote :

Fixed in Icehouse with the addition of nova-neutron events

Changed in nova:
status: New → Invalid
status: Invalid → Fix Released
Changed in devstack:
status: New → Invalid
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Changing to Fix Released per Aaron's comment.

Changed in neutron:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.