Nova compute manager failed to create virtual interface

Bug #1761536 reported by Eric Vasquez
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned
OpenStack Neutron Gateway Charm
Fix Released
Critical
David Ames
OpenStack Neutron Open vSwitch Charm
Invalid
Undecided
Unassigned

Bug Description

Rally test scenario: NovaServers.boot_server_associate_and_dissociate_floating_ip fails.
All 5 nova-compute-kvm instances timeout:

--------------------------------------------------------------------------------
Task 2ccf3cf6-c252-4e0f-8fdd-ca58ad819aff has 5 error(s)
--------------------------------------------------------------------------------

TimeoutException: Rally tired waiting 300.00 seconds for Server s_rally_504bd98b_fLz3akho:23cbd6ad-67f7-4e0f-9095-390f50897b62 to become ('ACTIVE') current status BUILD

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/rally/task/runner.py", line 71, in _run_scenario_once
    getattr(scenario_inst, method_name)(**scenario_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/rally/plugins/openstack/scenarios/nova/servers.py", line 1116, in run
    server = self._boot_server(image, flavor, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/rally/task/atomic.py", line 91, in func_atomic_actions
    f = func(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/rally/plugins/openstack/scenarios/nova/utils.py", line 86, in _boot_server
    check_interval=CONF.openstack.nova_server_boot_poll_interval
  File "/usr/local/lib/python2.7/dist-packages/rally/task/utils.py", line 252, in wait_for_status
    timeout=timeout)
TimeoutException: Rally tired waiting 300.00 seconds for Server s_rally_504bd98b_fLz3akho:23cbd6ad-67f7-4e0f-9095-390f50897b62 to become ('ACTIVE') current status BUILD

Revision history for this message
Eric Vasquez (envas) wrote :

Attaching juju crash dump for this test

Revision history for this message
Eric Vasquez (envas) wrote :

Example of nova-compute kvm nova log:
https://pastebin.canonical.com/p/s2pGzKhNXB/

Revision history for this message
Matt Riedemann (mriedem) wrote :

This indicates vif plugging failed in neutron. Check the neutron agent logs for failures. Also, the pastebin isn't accessible.

Changed in nova:
status: New → Invalid
Revision history for this message
Eric Vasquez (envas) wrote :
Revision history for this message
Eric Vasquez (envas) wrote :
Revision history for this message
Eric Vasquez (envas) wrote :

Attaching crash dump of latest failure instance

Revision history for this message
Frode Nordahl (fnordahl) wrote :

The crashdump shows multiple occurrences of RPC Timeouts in neutron-openvswitch-agent.log:
Timeout in RPC method tunnel_sync. Waiting for 54 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 6ecd4578ba6942769d00f39be63017a7

Right before the server build failure this message is logged:
2018-04-05 11:19:04.188 232295 ERROR neutron.agent.linux.openvswitch_firewall.firewall [req-8b07034a-1536-4fb0-a356-5df4be8a2c8c - - - - -] Initializing unfiltered port e84c9f3c-d412-45de-80fc-acf2af4ab56b that does not exist in ovsdb: Port e84c9f3c-d412-45de-80fc-acf2af4ab56b is not managed by this agent..: OVSFWPortNotFound: Port e84c9f3c-d412-45de-80fc-acf2af4ab56b is not managed by this agent.

It seems to me that the Neutron OpenvSwitch Agent on compute node nova-compute-kvm_4 is out of sync and that the worker-multiplier and/or rpc-response-timeout config options of the deployment need adjusting.

Changed in charm-neutron-gateway:
status: New → Invalid
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

It seems unlikely that any of those options are to blame, given this happens 100% of the time on queens and never happens on pike.

Revision history for this message
Chris Gregan (cgregan) wrote :

This issue is currently blocking all Xenial Queens testing in our lab. The field is not allowed to deploy a solution that has not been tested. We need this issue resolved so FE deployments can continue post Pike on Xenial. An environment is up with the issue reproduced now in our lab.

Escalated to Critical

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

It's not just rally that's failing - creating any instance attached to the 'ubuntu-net' tenant network fails.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

that works fine in pike, and fails everytime in queens.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

bundle from latest reproduce: https://pastebin.canonical.com/p/Y5zzk7Zz8m/

Revision history for this message
David Ames (thedac) wrote :
Download full text (6.2 KiB)

Root cause of the instance boot failures is app armor on the neutron-gateway blocking neutron agents from creating temporary directories:

1vlRG/" pid=1412869 comm="neutron-dhcp-ag" requested_mask="c" denied_mask="c" fsuid=115 ouid=115
[76035.437502] audit: type=1400 audit(1524677252.781:36019): apparmor="DENIED" operation="mkdir" profile="/usr/bin/neutron-dhcp-agent" name="/tmp/tmp4AIVtB/" pid=1412869 comm="neutron-dhcp-ag" requested_mask="c" denied_mask="c" fsuid=115 ouid=115

Both the dhcp-agent and the l3-agent both show the problem.

Assigning this bug to neutron-gateway for the app armor bug

A secondary issue that is as yet not root caused is DBConnection errors from all the API charms connecting to percona cluster. After changing the neutron-gateway aa-profile-mode to complain we saw these errors much less frequently but they did not go away entirely.

2018-04-25 17:49:33.562 617800 ERROR oslo_db.sqlalchemy.engines [req-dcf87632-8b6e-4071-a336-64b1442dc7fe - - - - -] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1']
2018-04-25 17:49:33.562 617800 ERROR oslo_db.sqlalchemy.engines Traceback (most recent call last):
2018-04-25 17:49:33.562 617800 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python2.7/dist-packages/oslo_db/sqlalchemy/engines.py", line 73, in _connect_ping_listener
2018-04-25 17:49:33.562 617800 ERROR oslo_db.sqlalchemy.engines connection.scalar(select([1]))
2018-04-25 17:49:33.562 617800 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 877, in scalar
2018-04-25 17:49:33.562 617800 ERROR oslo_db.sqlalchemy.engines return self.execute(object, *multiparams, **params).scalar()
2018-04-25 17:49:33.562 617800 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 945, in execute
2018-04-25 17:49:33.562 617800 ERROR oslo_db.sqlalchemy.engines return meth(self, multiparams, params)
2018-04-25 17:49:33.562 617800 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python2.7/dist-packages/sqlalchemy/sql/elements.py", line 263, in _execute_on_connection
2018-04-25 17:49:33.562 617800 ERROR oslo_db.sqlalchemy.engines return connection._execute_clauseelement(self, multiparams, params)
2018-04-25 17:49:33.562 617800 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1053, in _execute_clauseelement
2018-04-25 17:49:33.562 617800 ERROR oslo_db.sqlalchemy.engines compiled_sql, distilled_params
2018-04-25 17:49:33.562 617800 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1189, in _execute_context
2018-04-25 17:49:33.562 617800 ERROR oslo_db.sqlalchemy.engines context)
2018-04-25 17:49:33.562 617800 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1398, in _handle_dbapi_exception
2018-04-25 17:49:33.562 617800 ERROR oslo_db.sqlalchemy.engines util.raise_from_cause(newraise, exc_info)
2018-04-25 17:49...

Read more...

Changed in charm-neutron-gateway:
status: Invalid → Triaged
importance: Undecided → Critical
assignee: nobody → David Ames (thedac)
milestone: none → 18.05
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-neutron-gateway (master)

Fix proposed to branch: master
Review: https://review.openstack.org/564347

Changed in charm-neutron-gateway:
status: Triaged → In Progress
David Ames (thedac)
Changed in charm-neutron-openvswitch:
status: New → Invalid
Revision history for this message
David Ames (thedac) wrote :

While we wait for the above PR to get OSCI validation, review and approval, please test in QA with:
cs:~thedac/neutron-gateway-1

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

using aa-profile-mode=complain for neutron-gateway got us past this issue and we finished a successful run on xenial-queens.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-neutron-gateway (master)

Reviewed: https://review.openstack.org/564347
Committed: https://git.openstack.org/cgit/openstack/charm-neutron-gateway/commit/?id=a59b4d606fcdf647b89906b437fb4e79d74481ee
Submitter: Zuul
Branch: master

commit a59b4d606fcdf647b89906b437fb4e79d74481ee
Author: David Ames <email address hidden>
Date: Wed Apr 25 21:35:40 2018 +0000

    Apparmor profiles for Queens

    Apparmor profiles were limiting queens deployments of neutron-gateway
    when aa-profile-mode was set to enforce. It led to failed instance
    deployments due to neutron agents failing to execute their necessary
    functions.

    This change updates the profiles to be Queens ready.

    Closes-Bug: #1761536

    Change-Id: I2e08a2de9e4ae8139ab8e4be131631883652d029

Changed in charm-neutron-gateway:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-neutron-gateway (stable/18.02)

Fix proposed to branch: stable/18.02
Review: https://review.openstack.org/564538

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-neutron-gateway (stable/18.02)

Reviewed: https://review.openstack.org/564538
Committed: https://git.openstack.org/cgit/openstack/charm-neutron-gateway/commit/?id=36f889424ebb54fd3f092cda4b60a0a19ab232e4
Submitter: Zuul
Branch: stable/18.02

commit 36f889424ebb54fd3f092cda4b60a0a19ab232e4
Author: David Ames <email address hidden>
Date: Wed Apr 25 21:35:40 2018 +0000

    Apparmor profiles for Queens

    Apparmor profiles were limiting queens deployments of neutron-gateway
    when aa-profile-mode was set to enforce. It led to failed instance
    deployments due to neutron agents failing to execute their necessary
    functions.

    This change updates the profiles to be Queens ready.

    Closes-Bug: #1761536

    Change-Id: I2e08a2de9e4ae8139ab8e4be131631883652d029
    (cherry picked from commit a59b4d606fcdf647b89906b437fb4e79d74481ee)

David Ames (thedac)
Changed in charm-neutron-gateway:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.