Timeout while waiting on RPC response - topic: "network", RPC method: "allocate_for_instance" info: "<unknown>"

Bug #1257626 reported by Joe Gordon
52
This bug affects 12 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Critical
Unassigned

Bug Description

http://logs.openstack.org/21/59121/6/check/gate-tempest-dsvm-large-ops/fdd1002/logs/screen-n-cpu.txt.gz?level=TRACE#_2013-12-04_06_20_16_658

2013-12-04 06:20:16.658 21854 ERROR nova.compute.manager [-] Instance failed network setup after 1 attempt(s)
<...>
2013-12-04 06:20:16.658 21854 TRACE nova.compute.manager Timeout: Timeout while waiting on RPC response - topic: "network", RPC method: "allocate_for_instance" info: "<unknown>"

It appears there has been a performance regression and that gate-tempest-dsvm-large-ops is now failing because of RPC timeouts to allocate_for_instance

logstash query: message:"nova.compute.manager Timeout: Timeout while waiting on RPC response - topic: \"network\", RPC method: \"allocate_for_instance\""

There seems to have been a major rise in this bug on December 3rd.

Revision history for this message
Joe Gordon (jogo) wrote :

marking as critical since this is hitting us in the gate

Changed in nova:
milestone: none → icehouse-2
importance: Undecided → Critical
Revision history for this message
Joe Gordon (jogo) wrote :

elastic-recheck query: https://review.openstack.org/59919

Changed in nova:
status: New → Triaged
Revision history for this message
Matt Riedemann (mriedem) wrote :

The e-r query for this isn't hitting, so opened bug 1267271 against elastic-recheck for that.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Nevermind, looks like it is hitting, it reported on this patch today: https://review.openstack.org/#/c/57358/

tags: added: gate-failure network testing
Revision history for this message
Joe Gordon (jogo) wrote :

It looks like the most recent spike in this bug is due to the introduction of RAX high performance nodes in the gate: https://review.openstack.org/#/c/65236/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/65784

Revision history for this message
Joe Gordon (jogo) wrote :

Looks like https://review.openstack.org/#/c/65760/ helped. this hasn't been seen outside of https://review.openstack.org/#/c/65989/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/65784
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=831da3df616c2340f914d56c96c60b0f07cfa496
Submitter: Jenkins
Branch: master

commit 831da3df616c2340f914d56c96c60b0f07cfa496
Author: Dan Smith <email address hidden>
Date: Thu Jan 9 09:24:08 2014 -0800

    Avoid unnecessary use of rootwrap for some network commands

    Every time we run something as root with rootwrap, it takes about
    ten times longer (setup-wise anyway). For things that don't need
    to be run as root, we should avoid this hit. Nova network does
    this a lot and is also slow enough to cause trouble, so this
    patch attempts to address that for a few situations.

    Related-bug: #1257626

    Change-Id: Idc26776bf96ccfd9f50383e9d40aa47397d4e2cf

Revision history for this message
Russell Bryant (russellb) wrote :

I believe turning large-ops down to 50 from 100 instances was the solution for this. We were just maxing out the test nodes.

Changed in nova:
status: Triaged → Invalid
milestone: icehouse-2 → icehouse-3
Thierry Carrez (ttx)
Changed in nova:
milestone: icehouse-3 → none
Revision history for this message
Christopher Yeoh (cyeoh-0) wrote :

Looks like this has come back again. TEMPEST_LARGE_OPS_NUMBER has not changed from 50 so something else is triggering it.

Revision history for this message
Ryan Hsu (rhsu) wrote :

VMware Minesweeper CI is experiencing 100% build failures since around 6PM PST yesterday due to this error message. Logs from one of the afflicted runs here: http://10.148.255.241/logs/nova/67581/5/.

Revision history for this message
Ryan Hsu (rhsu) wrote :

Sorry, wrong URL. This is the correct link: http://208.91.1.172/logs/nova/67581/5/

Revision history for this message
Joe Gordon (jogo) wrote :

Christopher, appeared to come back, but all the hits were in the check queue.

Revision history for this message
Ryan Hsu (rhsu) wrote :
Revision history for this message
Alan Pevec (apevec) wrote :

Hit in the gate queue: https://review.openstack.org/71230

Revision history for this message
Attila Fazekas (afazekas) wrote :
Changed in nova:
status: Invalid → Confirmed
Revision history for this message
Joe Gordon (jogo) wrote :

In your example it looks like nova-net didn't start up

Changed in nova:
status: Confirmed → Invalid
Revision history for this message
jazeltq (jazeltq-k) wrote :

Can some-one also fix this on havana release?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.