puppet overcloud deployment timeouts

Bug #1492028 reported by Dan Prince
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
In Progress
Critical
Dan Prince

Bug Description

We are seeing a large number of gate-tripleo-ironic-overcloud-f21puppet-nonha fail due to the overcloud stack creation timing out. The console.log failures looks like this:

2015-09-03 14:13:18.033 | Waiting for the overcloud stack to be ready
2015-09-03 14:13:18.033 | + wait_for_stack_ready -w 2100 10 overcloud
2015-09-03 14:48:19.163 | Timing out after 2100 seconds:

If you look at the heat resource-list output most of the resources are still IN_PROGRESS so clearly a lot of things are finishing in the stack.

Looking a bit more closely it appears that almost all of the jobs I've looked at today appear to be processing puppet for the compute role. The last thing I see in the compute nodes os-collect-config.log file is :

Sep 03 18:02:53 overcloud-controller-0 os-collect-config[942]: [2015-09-03 18:02:53,031] (heat-config) [DEBUG] Running /var/lib/heat-config/hooks/puppet < /var/run/heat-config/deployed/68ee6f7f-9d00-4067-a9cd-646992cfe0d1.json

----

Another thought: This could be a valid timeout issue... (as in our jobs are just taking longer). 35 minutes is a long time for a 2 node overcloud stack to be created though.

Revision history for this message
Dan Prince (dan-prince) wrote :

One thing I'm trying to do is get more data about where exactly the compute nodes are in their puppet apply. Trying Martin's patch here might give us that information:

https://review.openstack.org/#/c/188737/

Perhaps we should cherry-pick this into tripleo-ci now?

/me testing this locally

Changed in tripleo:
status: New → In Progress
importance: Undecided → Critical
importance: Critical → High
assignee: nobody → Dan Prince (dan-prince)
Revision history for this message
Dan Prince (dan-prince) wrote :
Revision history for this message
Dan Prince (dan-prince) wrote :

Two more tests:

17:11 < openstackgerrit> Dan Prince proposed openstack-infra/tripleo-ci: Cherry
                         pick the improve puppet output patch
                         https://review.openstack.org/220315
17:14 < openstackgerrit> Dan Prince proposed openstack-infra/tripleo-ci:
                         Increase the default overcloud stack timeout
                         https://review.openstack.org/220317

Revision history for this message
Derek Higgins (derekh) wrote :

I've dug a little into two instances of this
====
http://logs.openstack.org/98/201398/15/check-tripleo/gate-tripleo-ironic-overcloud-f21puppet-nonha/ceda1a6/logs/
root 5776 5148 0 08:27 ? 00:00:00 /bin/systemctl start openstack-nova-compute.service
In this one puppet was stuck starting the nova-compute service for about 24 minutes

=====
http://logs.openstack.org/57/213957/15/check-tripleo/gate-tripleo-ironic-overcloud-f21puppet-nonha/f927ef4/logs/
root 4942 4314 0 08:31 ? 00:00:00 /bin/systemctl start openstack-nova-compute.service
In this one puppet was stuck starting the nova-compute service for about 23 minutes,

I the nova compute logs we see these every 2 seconds for less then a minute
Sep 04 09:32:15 overcloud-novacompute-0 nova-compute[4955]: 2015-09-04 08:32:15.178 4955 ERROR oslo.messaging._drivers.impl_rabbit [req-bbd45241-28d2-4ba0-ad2f-5565442c29a8 - - - - -] AMQP server on 192.0.2.5:56

then followed by 20 minutes of
Sep 04 09:33:03 overcloud-novacompute-0 nova-compute[4955]: 2015-09-04 08:33:03.491 4955 WARNING nova.conductor.api [req-bbd45241-28d2-4ba0-ad2f-5565442c29a8 - - - - -] Timed out waiting for nova-conductor. Is

I don't see any sign of any nova services running on the controller

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-image-elements (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/220544

Revision history for this message
Dan Prince (dan-prince) wrote :

Trying to get more Heat logs on the seed here: https://review.openstack.org/220544

Changed in tripleo:
importance: High → Critical
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

There was a heat fix for bug #1488366 which was merged on 29/8 which would likely cause these symptoms. According to dprince the delorian URL is currently pointing to a build from 26/8, so the version of heat being used likely lacks this fix.

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

If this bug turns out to be cause by 1488366 it should probably be marked as a duplicate

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-image-elements (master)

Change abandoned by Dan Prince (<email address hidden>) on branch: master
Review: https://review.openstack.org/220544

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.