Delay in network access after instance resize/migration using linuxbridge and vlan

Bug #1450874 reported by Jesse Keating
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Expired
Undecided
Unassigned

Bug Description

Performing an instance resize which migrates the instance to another host. When the new instance gets built up, the new VIF gets plugged, however connectivity to the IP is delayed. arping from the neutron router gets no response for about a minute. Same with attempts to access via a floating IP.

If a resize is reverted and the instance goes back to the original host, connectivity is restored almost instantly.

I've included some neutron config, let me know if more is desired.

This is on Juno.

Neutron.conf (secrets munged):
[DEFAULT]
debug = False
verbose = True

# Logging #
log_dir = /var/log/neutron

agent_down_time = 20

api_workers = 3

auth_strategy = keystone
core_plugin = neutron.plugins.ml2.plugin.Ml2Plugin
service_plugins = neutron.services.l3_router.l3_router_plugin.L3RouterPlugin
allow_overlapping_ips = False

rabbit_host = 10.233.19.1
rabbit_port = 5672
rabbit_userid = openstack
rabbit_password = MUNGE
rpc_backend = neutron.openstack.common.rpc.impl_kombu

bind_host = 0.0.0.0
bind_port = 9696

api_paste_config = api-paste.ini

control_exchange = neutron

notification_driver = neutron.openstack.common.notifier.no_op_notifier

notification_topics = notifications

lock_path = $state_path/lock

# ======== neutron nova interactions ==========
notify_nova_on_port_data_changes = True
notify_nova_on_port_status_changes = True
nova_url = https://bbg-staging-01.openstack.blueboxgrid.com:8777/v2
nova_region_name = RegionOne
nova_admin_username = neutron
nova_admin_tenant_id = MUNGE
nova_admin_password = MUNGE
nova_admin_auth_url = https://bbg-staging-01.openstack.blueboxgrid.com:5001/v2.0
nova_ca_certificates_file = /etc/ssl/certs/ca-certificates.crt

[QUOTAS]

[DEFAULT_SERVICETYPE]

[SECURITYGROUP]

[AGENT]
report_interval = 4

[keystone_authtoken]
identity_uri = https://bbg-staging-01.openstack.blueboxgrid.com:35358
auth_uri = https://bbg-staging-01.openstack.blueboxgrid.com:5001/v2.0
admin_tenant_name = service
admin_user = neutron
admin_password = MUNGE
signing_dir = /var/cache/neutron/api
cafile = /etc/ssl/certs/ca-certificates.crt

[DATABASE]
sqlalchemy_pool_size = 60

l3_agent.ini:
[DEFAULT]
debug = False

state_path = /var/lib/neutron

interface_driver = neutron.agent.linux.interface.BridgeInterfaceDriver

auth_url = https://bbg-staging-01.openstack.blueboxgrid.com:35358/v2.0
admin_tenant_name = service
admin_user = neutron
admin_password = MUNGE
metadata_ip = bbg-staging-01.openstack.blueboxgrid.com
use_namespaces = True
external_network_bridge =

[AGENT]
root_helper = sudo /usr/local/bin/neutron-rootwrap /etc/neutron/rootwrap.conf

Revision history for this message
Kevin Benton (kevinbenton) wrote :

i'm not too familiar with the resizing mechanisms in nova.

Is it possible for you to give some steps to reproduce here? bonus points if they are all using nova cli commands :)

Revision history for this message
Jesse Keating (jesse-keating) wrote :

nova boot --nic net-id=5c64a309-a86e-482a-b371-a7d68e6aa76c --image cirros --flavor 2 test-resize

ensure instance connectivity

nova resize --flavor 3 test-resize

ensure instance connectivity, this is where the delay is

nova resize-revert test-resize

ensure instance connectivity, should be very quick.

Revision history for this message
James Denton (james-denton) wrote :

Hi Jesse. I ran the same tests in our environment:

1. Booted a cirros image, landed on compute03
2. Issued continuous ping from DHCP namespace
3. Issued a 'nova resize <instance> <flavor>'. Nova put instance into RESIZE state, new instance landed on compute04. Neutron unplugged tap from brq bridge on compute03. Tap was plugged into brq bridge on compute04.
4. Continuous ping from DHCP namespace resumed after 68 seconds. Nova state was VERIFY_RESIZE.
5. Waited 2 minutes. Issued a 'nova resize-revert <instance>'. Tap removed from brq on compute04. Tap plugged into brq on compute03. Nova put instance immediately into ACTIVE state.
6. Continuous ping from DHCP namespace resumed after 62 seconds.

Considering the time it takes to make a snapshot, copy it from one compute to the other and boot it, 60 seconds could be reasonable. Until you confirm the resize, the old instance remains on the old compute node in /var/lib/nova/instances/<instance>_resize. This is probably why it recovers quickly when you do a 'nova resize-revert'. When you issue a 'nova resize-confirm <instance>', that old data gets removed and the state goes from VERIFY_RESIZE to ACTIVE.

Revision history for this message
Jesse Keating (jesse-keating) wrote :

What I'm seeing is that the instance goes to VERIFY_RESIZE in about 14 seconds (quite a small instance). Then it's another 45 seconds before arping picks back up.

On the way back the instance goes to ACTIVE in just a few seconds and network picks up right away.

I'll try with a larger image/instance.

Revision history for this message
Jesse Keating (jesse-keating) wrote :

With a larger image / flavor as the base it takes 1:45~ to reach VERIFY_RESIZE, and another 15 seconds after that for arping to pick up. Reverting was 15 seconds to reach ACTIVE and 5 more seconds for arping.

 If that's an acceptable amount of time, then no bug here. I honestly don't know what should be reasonable.

Revision history for this message
James Denton (james-denton) wrote :

While it's snapshoting/migrating it will be in RESIZE or similar state. When it flips to VERIFY_RESIZE, I believe it is booting/ready on the new host. It will not change to ACTIVE on the new host until you confirm the resize with nova resize-confirm. The revert will be fast because the instance data still exists on the former host until you confirm the resize; it simply needs to boot. Hard to define what's acceptable. It would be a combination of hardware, disk, and network specs. Guess you could do it all manually and compare :)

Revision history for this message
Jesse Keating (jesse-keating) wrote :

Hrm, maybe I misunderstood the VERIFY_RESIZE state. When using the nova client and using --wait for the resize, it waits until in VERIFY_RESIZE. Attempts to access the network at that time to verify will fail, and continue to fail, perhaps longer than ssh connection timeouts.

Anyway, I was asked to file the bug, but it seems like there might not be an actual bug, just another "networking" thing to shrug shoulders at and work around.

Feel free to close.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

This bug is > 365 days without activity. We are unsetting assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid, then update the bug status.

Changed in neutron:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.