test_snapshot_pattern fails because Neutron fails max attempts

Bug #1248757 reported by Matt Riedemann
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
New
High
Unassigned
neutron
New
High
Aaron Rosen
Tags: gate-failure
Matt Riedemann (mriedem)
description: updated
Revision history for this message
Matt Riedemann (mriedem) wrote :

Removing tempest and adding nova, glance and neutron given what this test impacts:

FAIL: tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern[compute,image,network]

no longer affects: tempest
tags: added: gate-failure
Revision history for this message
Matt Riedemann (mriedem) wrote :

I didn't see anything in the glance logs so that's probably invalid here, there are a ton of errors in the neutron server log though:

http://logs.openstack.org/55/55455/1/check/check-tempest-devstack-vm-neutron/28d1ed7/logs/screen-q-svc.txt.gz?level=ERROR

I'm having a hard time finding much in the neutron or nova logs too, the n-api log does have this though:

2013-11-06 21:48:41.927 3490 ERROR glanceclient.common.http [-] Request returned failure status.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I do see these in the console log though:

"Dropping user packet because connection is dead"

That's had 15 hits in the last 7 days:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRHJvcHBpbmcgdXNlciBwYWNrZXQgYmVjYXVzZSBjb25uZWN0aW9uIGlzIGRlYWRcIiBBTkQgZmlsZW5hbWU6XCJjb25zb2xlLmh0bWxcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM4Njg4NDM0ODA1NH0=

Also seeing this:

"EOF in transport thread"

150 hits in the last 7 days on that, 100% fail rates for the builds that shows up in:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiRU9GIGluIHRyYW5zcG9ydCB0aHJlYWRcIiBBTkQgZmlsZW5hbWU6XCJjb25zb2xlLmh0bWxcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM4Njg4NTAxNzA4NX0=

Looks like that one is probably due to bug 1253896.

When 'EOF in transport thread' shows up, 'Dropping user packet because connection is dead' doesn't, so 'Dropping user packet because connection is dead' might be the best to write an e-r query against right now.

Alternatively maybe we could check on "timeout=self.timeout, pkey=self.pkey", that shows up 26 times in the last 7 days, all fails:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwidGltZW91dD1zZWxmLnRpbWVvdXQsIHBrZXk9c2VsZi5wa2V5XCIgQU5EIGZpbGVuYW1lOlwiY29uc29sZS5odG1sXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzODY4ODUyOTQwNTl9

Revision history for this message
Matt Riedemann (mriedem) wrote :

"timeout=self.timeout, pkey=self.pkey" isn't good, because this fails here:

http://logs.openstack.org/63/72263/3/gate/gate-tempest-dsvm-neutron/aec8265/console.html

And the line is different:

"timeout=self.channel_timeout, pkey=self.pkey)"

Also, "EOF in transport thread" and 'Dropping user packet because connection is dead' both show up in that log, so either would probably work in a query.

Comparing queries, message:"Dropping user packet because connection is dead" AND filename:"console.html" yields 376 hits in the last 7 days while message:"EOF in transport thread" AND filename:"console.html" yields 931 hits in the last 7 days.

Need to make sure that one or both of those don't overlap when bug 1253896 happens.

Revision history for this message
Matt Riedemann (mriedem) wrote :

"EOF in transport thread" shows up in the log linked for bug 1253896:

http://logs.openstack.org/74/57774/2/gate/gate-tempest-devstack-vm-full/e592961/console.html.gz

However, "Dropping user packet because connection is dead" doesn't show up there so that's probably the one to fingerprint on.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Wondering if paramiko bug 567330 could be related, but that's rather old.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Sean Dague (sdague)
Changed in glance:
importance: Undecided → High
Changed in nova:
importance: Undecided → High
Aaron Rosen (arosen)
Changed in nova:
importance: High → Critical
Changed in neutron:
importance: Undecided → High
Changed in nova:
importance: Critical → High
Changed in neutron:
assignee: nobody → Aaron Rosen (arosen)
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Added a neutron milestone for easier tracking

Changed in neutron:
milestone: none → icehouse-rc1
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

I have noticed these "Dropping user packet because connection is dead" messages in the past, but I thought they were because tempest tests abruptly close the SSH connection.
However, it is interesting than when this message appears there is a 100% failure rate.

On the other hand I randomly looked at 3 failures with this footprint, and actually found out bug 1283522 (the lock wait timeout thing) was the real error cause. The tests appear to go past the "Dropping user packet because connection is dead" unscathed.

I don't know what to say here because 3 out of 1,124 is a very small sample.

Revision history for this message
Alex Xu (xuhj) wrote :

http://logs.openstack.org/11/80511/1/check/check-tempest-dsvm-neutron/79a5ce0/console.html
this looks like similar error, but in different testcase. It's in test_cross_tenant_traffic

Sean Dague (sdague)
no longer affects: glance
summary: - test_snapshot_pattern fails with paramiko ssh EOFError
+ test_snapshot_pattern fails because Neutron fails max attempts
Revision history for this message
Sean Dague (sdague) wrote :

The latest hits on this are the deadlock bug. The "Dropping user packet because connection is dead" is completely useless in identifying the failure, because that's normal dropbear output, and will be seen on the nova-console any time we call it. It shows up in every fail just because we dump the console to the screen. It's kind of like matching "Fail".

This bug should be marked duplicate of the deadlock bug, and new bugs should be openned for real issues here. The content has changed so much since November that there is little in this bug that is useful for getting to root cause now.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.