neutron can't setup basic network connnectivity in gate jobs

Bug #1210664 reported by Sean Dague
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
devstack
Fix Released
Undecided
Maru Newby
tempest
Fix Released
Undecided
Maru Newby

Bug Description

During a 4 hour period on Friday when we were accidentally passing failed runs in the gate, something merged that prevents the major connectivity test for neutron in the gate from functioning. We're now getting a 100% fail rate on this on neutron smoke:

2013-08-09 20:57:18.320 | process-returncode
2013-08-09 20:57:18.320 | process-returncode ... FAIL
2013-08-09 20:57:18.487 |
2013-08-09 20:57:18.487 | ======================================================================
2013-08-09 20:57:18.487 | FAIL: tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_008_check_public_network_connectivity[gate,smoke]
2013-08-09 20:57:18.487 | tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_008_check_public_network_connectivity[gate,smoke]
2013-08-09 20:57:18.487 | ----------------------------------------------------------------------
2013-08-09 20:57:18.487 | _StringException: Empty attachments:
2013-08-09 20:57:18.487 | stderr
2013-08-09 20:57:18.488 | stdout
2013-08-09 20:57:18.488 |
2013-08-09 20:57:18.488 | Traceback (most recent call last):
2013-08-09 20:57:18.488 | File "tempest/scenario/test_network_basic_ops.py", line 255, in test_008_check_public_network_connectivity
2013-08-09 20:57:18.488 | self._check_vm_connectivity(ip_address, ssh_login, private_key)
2013-08-09 20:57:18.488 | File "tempest/scenario/manager.py", line 494, in _check_vm_connectivity
2013-08-09 20:57:18.488 | timeout=self.config.compute.ssh_timeout),
2013-08-09 20:57:18.489 | File "tempest/scenario/manager.py", line 484, in _is_reachable_via_ssh
2013-08-09 20:57:18.489 | return ssh_client.test_connection_auth()
2013-08-09 20:57:18.489 | File "tempest/common/ssh.py", line 144, in test_connection_auth
2013-08-09 20:57:18.489 | connection = self._get_ssh_connection()
2013-08-09 20:57:18.489 | File "tempest/common/ssh.py", line 65, in _get_ssh_connection
2013-08-09 20:57:18.489 | timeout=self.timeout, pkey=self.pkey)
2013-08-09 20:57:18.489 | File "/usr/local/lib/python2.7/dist-packages/paramiko/client.py", line 311, in connect
2013-08-09 20:57:18.490 | t.start_client()
2013-08-09 20:57:18.490 | File "/usr/local/lib/python2.7/dist-packages/paramiko/transport.py", line 465, in start_client
2013-08-09 20:57:18.490 | raise e
2013-08-09 20:57:18.490 | SSHException: Error reading SSH protocol banner

http://logs.openstack.org/95/41195/1/check/gate-tempest-devstack-vm-neutron/6718173/console.html is an example of the console.

This is about as critical of a bug as neutron can have. We're skipping this test entirely now for neutron, this should be the project top priority to address, because without this test running we're largely not really testing neutron end to end in the gate.

Tags: l3-ipam-dhcp
Sean Dague (sdague)
Changed in neutron:
status: New → Confirmed
importance: Undecided → Critical
Maru Newby (maru)
Changed in neutron:
assignee: nobody → Maru Newby (maru)
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :
Download full text (4.8 KiB)

Just shooting in the dark.
We recently merged a devstack change that creates the db by running migrations rather than letting the models autogenerate.
This means that if there's something wacky in migrations this could be the cuplrit, and you can blame me as the author of that patch.

I am suspicious because of this error i am seeing in: http://logs.openstack.org/95/41195/1/check/gate-tempest-devstack-vm-neutron/6718173/logs/screen-q-svc.txt.gz

2013-08-09 20:56:11.341 27802 DEBUG neutron.openstack.common.rpc.amqp [-] unpacked context: {'user_id': None, 'roles': [u'admin'], 'tenant_id': None, 'is_admin': True, 'timestamp': u'2013-08-09 20:43:50.109330', 'project_id': None, 'read_deleted': u'no'} _safe_log /opt/stack/new/neutron/neutron/openstack/common/rpc/common.py:276
2013-08-09 20:56:11.357 27802 ERROR neutron.openstack.common.rpc.amqp [-] Exception during message handling
2013-08-09 20:56:11.357 27802 TRACE neutron.openstack.common.rpc.amqp Traceback (most recent call last):
2013-08-09 20:56:11.357 27802 TRACE neutron.openstack.common.rpc.amqp File "/opt/stack/new/neutron/neutron/openstack/common/rpc/amqp.py", line 424, in _process_data
2013-08-09 20:56:11.357 27802 TRACE neutron.openstack.common.rpc.amqp **args)
2013-08-09 20:56:11.357 27802 TRACE neutron.openstack.common.rpc.amqp File "/opt/stack/new/neutron/neutron/common/rpc.py", line 44, in dispatch
2013-08-09 20:56:11.357 27802 TRACE neutron.openstack.common.rpc.amqp neutron_ctxt, version, method, namespace, **kwargs)
2013-08-09 20:56:11.357 27802 TRACE neutron.openstack.common.rpc.amqp File "/opt/stack/new/neutron/neutron/openstack/common/rpc/dispatcher.py", line 172, in dispatch
2013-08-09 20:56:11.357 27802 TRACE neutron.openstack.common.rpc.amqp result = getattr(proxyobj, method)(ctxt, **kwargs)
2013-08-09 20:56:11.357 27802 TRACE neutron.openstack.common.rpc.amqp File "/opt/stack/new/neutron/neutron/db/l3_rpc_base.py", line 47, in sync_routers
2013-08-09 20:56:11.357 27802 TRACE neutron.openstack.common.rpc.amqp plugin.auto_schedule_routers(context, host, router_ids)
2013-08-09 20:56:11.357 27802 TRACE neutron.openstack.common.rpc.amqp File "/opt/stack/new/neutron/neutron/db/agentschedulers_db.py", line 303, in auto_schedule_routers
2013-08-09 20:56:11.357 27802 TRACE neutron.openstack.common.rpc.amqp self, context, host, router_ids)
2013-08-09 20:56:11.357 27802 TRACE neutron.openstack.common.rpc.amqp File "/opt/stack/new/neutron/neutron/scheduler/l3_agent_scheduler.py", line 113, in auto_schedule_routers
2013-08-09 20:56:11.357 27802 TRACE neutron.openstack.common.rpc.amqp context.session.add(binding)
2013-08-09 20:56:11.357 27802 TRACE neutron.openstack.common.rpc.amqp File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 456, in __exit__
2013-08-09 20:56:11.357 27802 TRACE neutron.openstack.common.rpc.amqp self.commit()
2013-08-09 20:56:11.357 27802 TRACE neutron.openstack.common.rpc.amqp File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 368, in commit
2013-08-09 20:56:11.357 27802 TRACE neutron.openstack.common.rpc.amqp self._prepare_impl()
2013-08-09 20:56:11.357 27802 TRAC...

Read more...

Changed in neutron:
milestone: none → havana-3
Revision history for this message
Attila Fazekas (afazekas) wrote :

http://logs.openstack.org/95/41195/1/check/gate-tempest-devstack-vm-neutron/6718173/logs/screen-q-meta.txt.gz

The guests does not gets metadata. (HTTP 500.)

The metadata contains the ssh public key, without the public key the key based auth will not work.

Revision history for this message
Maru Newby (maru) wrote :

Salvatore: From what I can tell the db error occurs during cleanup, so it is unlikely to be causing the connectivity problem.

Attila: The problem is clearly deeper than a failure to retrieve metadata. Without the public key an auth failure would occur, not the protocol failure we are seeing.

Revision history for this message
Maru Newby (maru) wrote :

Attila: I take my previous comment back - you're absolutely correct. The metadata service returning 500s was delaying the vm boot long enough to cause the reported errors.

Revision history for this message
Maru Newby (maru) wrote :

Salvatore: I've filed a different bug for the spurious db errors you pointed out: https://bugs.launchpad.net/neutron/+bug/1210877

Revision history for this message
Attila Fazekas (afazekas) wrote :

/etc/neutron $ grep -R %SERVICE_USER%
metadata_agent.ini:admin_user = %SERVICE_USER%

tags: added: l3-ipam-dhcp
Maru Newby (maru)
Changed in devstack:
assignee: nobody → Maru Newby (maru)
status: New → Fix Committed
status: Fix Committed → Fix Released
Maru Newby (maru)
Changed in tempest:
assignee: nobody → Maru Newby (maru)
no longer affects: neutron
Changed in tempest:
status: New → In Progress
Revision history for this message
Maru Newby (maru) wrote :

The original problem was fixed, but guest access to the metadata service is still failing: https://bugs.launchpad.net/devstack/+bug/1211829

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tempest (master)

Reviewed: https://review.openstack.org/42018
Committed: http://github.com/openstack/tempest/commit/db0560201e1c9de7f2122c958fc04bf51c172d31
Submitter: Jenkins
Branch: master

commit db0560201e1c9de7f2122c958fc04bf51c172d31
Author: Maru Newby <email address hidden>
Date: Wed Aug 14 14:55:45 2013 -0700

    Remove skip of neutron connectivity check

    Devstack has been updated to ensure the metadata proxy is correctly
    configured, which should resolve the test failure.

    Closes-Bug: #1210664

    Change-Id: Ibff5e4146be297180529337683b384768f46cf54

Changed in tempest:
status: In Progress → Fix Committed
Sean Dague (sdague)
Changed in tempest:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.