bootstrap of admin node fails sometimes

Bug #1315865 reported by Vladimir Kuklin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Critical
Vladimir Sharshov

Bug Description

When running bootstrap of admin node sometimes admin node fails to bootstrap unable to load storage containers for some reason

look into /var/log/puppet/bootstrap_admin_node.log

VERSION:
  mirantis: "yes"
  production: "docker"
  release: "5.0"
  build_number: "179"
  build_id: "2014-05-04_01-00-26"
  astute_sha: "3cffebde1e5452f5dbf8f744c6525fc36c7afbf3"
  api: "1.0"
  fuellib_sha: "c414bd7e49e7cfb6c5d66b37b55ae06f05dbecc3"
  api: "1.0"
  ostf_sha: "134765fcb5a07dce0cd1bb399b2290c988c3c63b"
  api: "1.0"
  nailgun_sha: "2de1dcf9fa3fc1521999bff6377eaa6f01d825aa"
  api: "1.0"
  fuelmain_sha: "95c35c199c2efc03fb105d090c5a42525430b7b3"

this bug was confirmed on at least 2 environments

the second run of puppet always succeeds

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :
description: updated
description: updated
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

the problem seems to be in the fact that docker is not ready to accept build commands right after the command start.

logs show that three containers were not build due to docker daemon unavailability:

Notice: /Stage[main]/Docker/Exec[build docker containers]/returns: 2014/05/03 22:50:25 Cannot connect to the Docker daemon. Is 'docker -d' running on this host?
Notice: /Stage[main]/Docker/Exec[build docker containers]/returns: 2014/05/03 22:50:25 Cannot connect to the Docker daemon. Is 'docker -d' running on this host?
Notice: /Stage[main]/Docker/Exec[build docker containers]/returns: 2014/05/03 22:50:25 Cannot connect to the Docker daemon. Is 'docker -d' running on this host?

and only the 4th one built successfully

the easiest way is to implement wait cycle using puppet exec and docker utility

Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/92012
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=2348fae80b21c3ec9e5f520395eea2943a510f3d
Submitter: Jenkins
Branch: master

commit 2348fae80b21c3ec9e5f520395eea2943a510f3d
Author: Vladimir Kuklin <email address hidden>
Date: Sun May 4 18:51:33 2014 +0400

    Add waiting for docker daemon

    Add waiting for docker daemon as it
    still may not start listening on its
    socket before docker container building
    starts

    Change-Id: I76cfd4c1864d6cded633da6942c2958fee546d1a
    Closes-Bug: #1315865

Changed in fuel:
status: Confirmed → Fix Committed
Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/92110

Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/92110
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=edaecb643f34ca73be3716c5a722bfdd40e06128
Submitter: Jenkins
Branch: master

commit edaecb643f34ca73be3716c5a722bfdd40e06128
Author: Matthew Mosesohn <email address hidden>
Date: Mon May 5 16:50:31 2014 +0400

    Added check_ready test for some docker containers

    This works around the problem where docker containers
    could take more time to deploy than anticipated on
    systems with heavy load. A series of loops are added
    with timeouts to ensure these apps are ready.

    Change-Id: I731177d1c9a6fa35897e1ec592557c27a983166b
    Partial-Bug: #1315865

Revision history for this message
Mike Scherbakov (mihgen) wrote :

Looks like issue still remain unsolved in #183. Igor M. will attach logs where astute could not connect to AMQP queue. It might be a different issue, so please analyse logs.

Changed in fuel:
status: Fix Committed → Confirmed
Revision history for this message
Igor Marnat (imarnat) wrote :
Mike Scherbakov (mihgen)
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Vladimir Sharshov (vsharshov)
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

It seems that deploy order is broken and Astute docker container start before RabbitMQ.

Good news: Astute try to connection infinite number of time and then RabbitMQ up, all work fine. Only big logs with errors about missing connects provide some inconvenience.

I try to reproduce it using fresh iso

Changed in fuel:
importance: Critical → High
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Critical → High

Because this problem do not fail any functionality at now moment, but may cause problems in the future

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

#184 - could not reproduce.

Revision history for this message
Mike Scherbakov (mihgen) wrote :

Confirm on #189.
I ran installation on my Mac with 1Gb of RAM for master node, and experience the following:
* yes, it needs time to bootstrap admin node
* As soon as I have message that master node is installed, I immediately open 10.20.0.2:8000 and try to generate diag snapshot. And it is failed.

If I wait a few more minutes, and try again - it is passed.

When I checked logs, I saw:
* for about a minute, astute was not able to connect to MQ. It means it is still started before MQ is ready?
* shotgun -c /tmp/dump_config >> /var/log/dump.log 2>&1 && cat /var/www/nailgun/dump/last returned 1
When I tried to run diag_snapshot for a second time, the command above succeeded with 0 return code.

So it obviously needs further debugging and in my opinion even if we need to increase VCPU or RAM, then no more than 2 VCPU / 2 Gb.
I'll attach my diag snapshot with logs.

Revision history for this message
Mike Scherbakov (mihgen) wrote :
Changed in fuel:
importance: High → Critical
Revision history for this message
Mike Scherbakov (mihgen) wrote :

Also, this rather becomes critical - as we can't get reliable results from automated system tests due to this issue. As a simplest workaround for now for system tests, I believe we can simply put a timeout for about 2 minutes after a message that master node is ready.

Revision history for this message
Mike Scherbakov (mihgen) wrote :

In dump.log (BTW should not we rename it to shotgun.log ??), which I can't find in /var/log/ on master node (only on container, why it's not located on master node??), there are very interesting failures which can shed a light on the root cause (for example, looks like I didn't have Network at the moment I ran it.)

Revision history for this message
Mike Scherbakov (mihgen) wrote :
Revision history for this message
Mike Scherbakov (mihgen) wrote :
Revision history for this message
Mike Scherbakov (mihgen) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-web (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/93178

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

according to docker logs mcollective container failed to start and was respawned by supervisord

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-library (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/93179

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-main (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/93180

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/93179
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=ff4e0182a94f9b17e5a02bcc65faaf4452a0ad35
Submitter: Jenkins
Branch: master

commit ff4e0182a94f9b17e5a02bcc65faaf4452a0ad35
Author: Vladimir Kuklin <email address hidden>
Date: Sat May 10 15:00:48 2014 +0400

    Set mcollective as the last container to start

    Change-Id: I55aa3c93a71386e08abb0e9ee3de01e3daf7dfff
    Related-Bug: #1315865

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/93178
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=42e8af79f1610c690f7621dfaf637bad98060e66
Submitter: Jenkins
Branch: master

commit 42e8af79f1610c690f7621dfaf637bad98060e66
Author: Vladimir Kuklin <email address hidden>
Date: Sat May 10 14:26:04 2014 +0400

    Add docker logs to diagnostic snapshot

    Related-Bug: #1315865

    Change-Id: I7b65eb618e000b395b9203e872579a7a193f50ca

Revision history for this message
Mike Scherbakov (mihgen) wrote :

Neither V.Sharshov or me can't reproduce this anymore. I assume that fix which starts mcollective container first, and at the very end - cobbler, actually fixed this issue.

Changed in fuel:
status: Confirmed → Fix Committed
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Issue is fixed now

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.