"Failed to allocate the network" during starting an instance

Bug #1556858 reported by Leontii Istomin
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
Critical
MOS Maintenance
8.0.x
Invalid
Critical
Alexey Stupnikov
9.x
Fix Released
Critical
Kevin Benton

Bug Description

During boot_and_list_server rally scenario: http://paste.openstack.org/show/490356/ we faced with the rally error: http://paste.openstack.org/show/490349/
from mysql we know compute: http://paste.openstack.org/show/490351/
from nova.log on the compute: http://paste.openstack.org/show/490352/

Environment description:
3controllers, 20computes+ceph, 176computes, vxlan, ceph_for_all

[root@fuel ~]# cat /etc/fuel/8.0/version.yaml
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "570"
  build_id: "570"
  fuel-nailgun_sha: "558ca91a854cf29e395940c232911ffb851899c1"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "658be72c4b42d3e1436b86ac4567ab914bfb451b"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "c2a335b5b725f1b994f78d4c78723d29fa44685a"
  fuel-ostf_sha: "3bc76a63a9e7d195ff34eadc29552f4235fa6c52"
  fuel-mirror_sha: "fb45b80d7bee5899d931f926e5c9512e2b442749"
  fuelmenu_sha: "78ffc73065a9674b707c081d128cb7eea611474f"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "a43cf96cd9532f10794dce736350bf5bed350e9d"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "d605bcbabf315382d56d0ce8143458be67c53434"

rally log and report are attached.
Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2016-03-14_09-37-03.tar.gz

Revision history for this message
Leontii Istomin (listomin) wrote :
Revision history for this message
Leontii Istomin (listomin) wrote :

rally report

description: updated
Revision history for this message
Oleg Bondarev (obondarev) wrote :

Logs show that VM boots are failing with
 2016-03-14 07:16:57.344 15053 WARNING nova.virt.libvirt.driver [req-9ee1b8e5-b37b-41b0-8761-0ecddee83b17 3260af1094fb4622acd45db97be281ed 8c904fa82a8a4cf884e3121c8aa52b7c - - -] [instance: caca76dc-c694-4ddf-910e-f6f584e97e5a] Timeout waiting for vif plugging callback for instance caca76dc-c694-4ddf-910e-f6f584e97e5a

So neutron does not handles new VM port in time and nova just timeouts.
Neuton ovs agent logs are full of:
 Command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'conntrack', '-D', '-f', 'ipv4', '-d', '100.1.1.182', '-w', '6', '-s', '100.1.6.246']

which is related to clearing connection from conntrack once security group is updated (new or deleted member for example).
It seems that ovs agent is overloaded with handling such updates so is not responsive to new ovs devices being plugged.

This is an initial analysis. needs further investigation

Changed in mos:
importance: Undecided → High
status: New → Confirmed
Dina Belova (dbelova)
tags: added: area-neutron
Revision history for this message
Jay Pipes (jaypipes) wrote :

Do we know about how many VMs were on the compute hosts before this starts happening?

Revision history for this message
Leontii Istomin (listomin) wrote :

Jay, as I can see from rally report (comment #2) the first iteration has been failed. But it has been the second test on the environment. There could be some influence from the first test which described here: https://bugs.launchpad.net/mos/+bug/1556851. There was 88 iterations before failure occurred.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

@Oleg, I believe this is indeed being caused by the conntrack operations. It should be addressed by this patch upstream: https://review.openstack.org/#/c/293239/

Revision history for this message
Kevin Benton (kevinbenton) wrote :

After analyzing the logs further, I'm convinced that the agent fell behind due to processing these conntrack operations.

The port UUID in question is: 61afaae9-fdd1-40cb-ab97-de502fe1c69a
The port name: qvo61afaae9-fd

If you examine the agent logs from node-184[1], you can see that the ovsdb monitor picks up the interface (search for qvo61afaae9-fd) during a time in which it's running thousands of conntrack operations. The interface is removed from ovs 5 minutes later which is also picked up by the monitor.

If you look at all of the log messages between those two events, the agent was stuck processing security group changes with these conntrack operations the entire time. Based on the terrible growth factor of this cleanup operation, it's possible these were leftover events from the previous test teardown still being processed.

I believe my patch[2] should address the issue (it did in my local scenario that suffered from this), but it would be good to get it on a scale lab.

1. path in dump file var/log/dump/fuel-snapshot-2016-03-14_09-37-03/node-184/var/log/neutron/ovs-agent.log.2.gz.

Revision history for this message
Alexander Ignatov (aignatov) wrote :
Changed in mos:
status: Confirmed → In Progress
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/neutron (openstack-ci/fuel-8.0/liberty)

Fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Oleg Bondarev <email address hidden>
Review: https://review.fuel-infra.org/18162

Revision history for this message
Alexander Ignatov (aignatov) wrote :

Put it to Critical since in upstream it's also critical.

Changed in mos:
importance: High → Critical
Changed in mos:
assignee: MOS Neutron (mos-neutron) → MOS Maintenance (mos-maintenance)
Revision history for this message
Alexander Ignatov (aignatov) wrote :

Closing in 9.0 since fix is grabbed by regular stable/mitaka sync.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/neutron (openstack-ci/fuel-8.0/liberty)

Change abandoned by Alexey Stupnikov <email address hidden> on branch: openstack-ci/fuel-8.0/liberty
Review: https://review.fuel-infra.org/18162
Reason: This patch was already taken from the upstream.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.