Fuel for OpenStack

Ubuntu HA deployment fails with puppet error "err: (/Stage[corosync_setup]/Corosync/Package[pacemaker]/ensure) change from purged to present failed"

Bug #1269765 reported by Aleksandr Didenko on 2014-01-16

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Invalid	High	Fuel Library (Deprecated)	Fuel for OpenStack 4.1

Bug Description

Hi,

ISO: {"build_id": "2014-01-14_12-45-02", "ostf_sha": "d5473b3f3bea70b3eecf5910fb1337215bec1f53", "build_number": "32", "nailgun_sha": "86023c668fad368524b3696ecdfa2cc0729e2e8b", "fuelmain_sha": "91b5c989a043769e44faee313007b23af47e4ef8", "astute_sha": "ca787b5b0a3a418e6885b9fd2d795c9fd158ed0a", "release": "4.1", "fuellib_sha": "c8673bb9474ccb0a51fb9077910b009ff2d9034b"}

Environment: bare-metal, Ubuntu, HA, 3 controllers+cephOSD, 1 compute+cephOSD, ceph for images, ceph for volumes, neutron with vlan segmentation. eth3 is used for Fuel admin network (192.168.50.0/24) on all nodes.

First deployment: all nodes failed
Second deployment: 3rd controller failed

Here is part of node-148 puppet-apply.log:

2014-01-16T08:40:36.389756+00:00 info: (/Stage[netconfig]/Advanced_node_netconfig/L23network::L3::Ifconfig[eth3]/File[/etc/network/interfaces.d/ifcfg-eth3]) Scheduling refresh of L3_if_downup[eth3]
2014-01-16T08:40:36.391971+00:00 debug: (Puppet::Type::L3_if_downup::ProviderRuby) Executing '/sbin/ifdown --force eth3'
2014-01-16T08:40:50.443260+00:00 debug: (/Stage[corosync_setup]/Corosync/Exec[rm_corosync_override]) The container Class[Corosync] will propagate my refresh event
2014-01-16T08:40:50.443245+00:00 err: (/Stage[corosync_setup]/Corosync/Package[pacemaker]/ensure) change from purged to present failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install pacemaker' returned 10
0:....
Err http://192.168.50.1/ubuntu/fuelweb/x86_64/ precise/main libltdl7 amd64 2.4.2-1ubuntu1
Could not connect to 192.168.50.1:8080 (192.168.50.1). - connect (113: No route to host)

It looks like eth3 (Fuel admin network interface with access to 192.168.50.0/24 network) was brought down by Puppet::Type::L3_if_downup::ProviderRuby during "netconfig" stage. Due to this, puppet was not able to install needed packages during "corosync_setup" stage.

Diagnostic snapshot attached for "Second deployment: 3rd controller failed".

Tags:

Revision history for this message

Aleksandr Didenko (adidenko) wrote on 2014-01-16:

fuel-snapshot-2014-01-16_08-53-50.tgz Edit (8.7 MiB, application/x-tar)

Revision history for this message

Aleksandr Didenko (adidenko) wrote on 2014-01-16:

fuel-snapshot-2014-01-16_05-03-58.tgz Edit (7.7 MiB, application/x-tar)

Diagnostic snapshot attached for "First deployment: all nodes failed". Please see "fuel-snapshot-2014-01-16_05-03-58/localhost/var/log/remote/node-142.local.int/puppet-apply.log" - it has the same error.

Revision history for this message

Miroslav Anashkin (manashkin) wrote on 2014-01-20:

The very same issue encountered with released Mirantis Openstack 4.0., using eth0 as Admin Network.
Here is error message:
http://paste.openstack.org/show/61592/

And diagnostic snapshot:
https://docs.google.com/a/mirantis.com/file/d/0BwII9gsxwO6UbjkzNHQyRzg3aVU/edit

Miroslav Anashkin (manashkin) on 2014-01-21

Changed in fuel:
status:	New → Confirmed

Revision history for this message

Miroslav Anashkin (manashkin) wrote on 2014-01-21:

Sasha,

Please try to collect and share all the logs, remained locally on such failed node.

Currently it looks like 2 issues are in this bug.

1. Puppet switches admin network NIC down in case of VLAN and GRE segmentation. While admin NIC is down, syslog is not able to deliver part of the puppet apply log to master node.
https://bugs.launchpad.net/fuel/+bug/1271176

2. It looks like puppet incorrectly determines if admin network is up and continues deployment process without admin network available. It leads to the failed deployment.

Changed in fuel:
importance:	Undecided → High
milestone:	none → 4.1

Revision history for this message

Aleksandr Didenko (adidenko) wrote on 2014-01-21:

I've ran a bunch of test deployments on 41+ ISOs and did not hit this bug yet. If I get this error again, I'll gather requested info.

Mike Scherbakov (mihgen) on 2014-01-31

Changed in fuel:
assignee:	nobody → Fuel Library Team (fuel-library)

Revision history for this message

Sergey Vasilenko (xenolog) wrote on 2014-01-31:

Looks like issue of network connectivity between master and deploying node.
If this issue does not reproduce on fresh issues -- I belive, that nothing to fix.

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-02-03:

Closing for a while as non-reproducible