Timeout while waiting on RPC response - topic: "network", RPC method: "allocate_for_instance" info: "<unknown>"

Bug #1257626 reported by Joe Gordon on 2013-12-04
52
This bug affects 12 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Critical
Unassigned

Bug Description

http://logs.openstack.org/21/59121/6/check/gate-tempest-dsvm-large-ops/fdd1002/logs/screen-n-cpu.txt.gz?level=TRACE#_2013-12-04_06_20_16_658

2013-12-04 06:20:16.658 21854 ERROR nova.compute.manager [-] Instance failed network setup after 1 attempt(s)
<...>
2013-12-04 06:20:16.658 21854 TRACE nova.compute.manager Timeout: Timeout while waiting on RPC response - topic: "network", RPC method: "allocate_for_instance" info: "<unknown>"

It appears there has been a performance regression and that gate-tempest-dsvm-large-ops is now failing because of RPC timeouts to allocate_for_instance

logstash query: message:"nova.compute.manager Timeout: Timeout while waiting on RPC response - topic: \"network\", RPC method: \"allocate_for_instance\""

There seems to have been a major rise in this bug on December 3rd.

Joe Gordon (jogo) wrote :

marking as critical since this is hitting us in the gate

Changed in nova:
milestone: none → icehouse-2
importance: Undecided → Critical
Joe Gordon (jogo) wrote :

elastic-recheck query: https://review.openstack.org/59919

Changed in nova:
status: New → Triaged
Matt Riedemann (mriedem) wrote :

The e-r query for this isn't hitting, so opened bug 1267271 against elastic-recheck for that.

Matt Riedemann (mriedem) wrote :

Nevermind, looks like it is hitting, it reported on this patch today: https://review.openstack.org/#/c/57358/

tags: added: gate-failure network testing
Joe Gordon (jogo) wrote :

It looks like the most recent spike in this bug is due to the introduction of RAX high performance nodes in the gate: https://review.openstack.org/#/c/65236/

Joe Gordon (jogo) wrote :

Looks like https://review.openstack.org/#/c/65760/ helped. this hasn't been seen outside of https://review.openstack.org/#/c/65989/

Reviewed: https://review.openstack.org/65784
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=831da3df616c2340f914d56c96c60b0f07cfa496
Submitter: Jenkins
Branch: master

commit 831da3df616c2340f914d56c96c60b0f07cfa496
Author: Dan Smith <email address hidden>
Date: Thu Jan 9 09:24:08 2014 -0800

    Avoid unnecessary use of rootwrap for some network commands

    Every time we run something as root with rootwrap, it takes about
    ten times longer (setup-wise anyway). For things that don't need
    to be run as root, we should avoid this hit. Nova network does
    this a lot and is also slow enough to cause trouble, so this
    patch attempts to address that for a few situations.

    Related-bug: #1257626

    Change-Id: Idc26776bf96ccfd9f50383e9d40aa47397d4e2cf

Russell Bryant (russellb) wrote :

I believe turning large-ops down to 50 from 100 instances was the solution for this. We were just maxing out the test nodes.

Changed in nova:
status: Triaged → Invalid
milestone: icehouse-2 → icehouse-3
Thierry Carrez (ttx) on 2014-01-14
Changed in nova:
milestone: icehouse-3 → none
Christopher Yeoh (cyeoh-0) wrote :

Looks like this has come back again. TEMPEST_LARGE_OPS_NUMBER has not changed from 50 so something else is triggering it.

Ryan Hsu (rhsu) wrote :

VMware Minesweeper CI is experiencing 100% build failures since around 6PM PST yesterday due to this error message. Logs from one of the afflicted runs here: http://10.148.255.241/logs/nova/67581/5/.

Ryan Hsu (rhsu) wrote :

Sorry, wrong URL. This is the correct link: http://208.91.1.172/logs/nova/67581/5/

Joe Gordon (jogo) wrote :

Christopher, appeared to come back, but all the hits were in the check queue.

Alan Pevec (apevec) wrote :

Hit in the gate queue: https://review.openstack.org/71230

Attila Fazekas (afazekas) wrote :
Changed in nova:
status: Invalid → Confirmed
Joe Gordon (jogo) wrote :

In your example it looks like nova-net didn't start up

Changed in nova:
status: Confirmed → Invalid
jazeltq (jazeltq-k) wrote :

Can some-one also fix this on havana release?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers