Ubuntu HA deployment fails with puppet error "err: (/Stage[corosync_setup]/Corosync/Package[pacemaker]/ensure) change from purged to present failed"

Bug #1269765 reported by Aleksandr Didenko
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Fuel Library (Deprecated)

Bug Description

Hi,

ISO: {"build_id": "2014-01-14_12-45-02", "ostf_sha": "d5473b3f3bea70b3eecf5910fb1337215bec1f53", "build_number": "32", "nailgun_sha": "86023c668fad368524b3696ecdfa2cc0729e2e8b", "fuelmain_sha": "91b5c989a043769e44faee313007b23af47e4ef8", "astute_sha": "ca787b5b0a3a418e6885b9fd2d795c9fd158ed0a", "release": "4.1", "fuellib_sha": "c8673bb9474ccb0a51fb9077910b009ff2d9034b"}

Environment: bare-metal, Ubuntu, HA, 3 controllers+cephOSD, 1 compute+cephOSD, ceph for images, ceph for volumes, neutron with vlan segmentation. eth3 is used for Fuel admin network (192.168.50.0/24) on all nodes.

First deployment: all nodes failed
Second deployment: 3rd controller failed

Here is part of node-148 puppet-apply.log:

2014-01-16T08:40:36.389756+00:00 info: (/Stage[netconfig]/Advanced_node_netconfig/L23network::L3::Ifconfig[eth3]/File[/etc/network/interfaces.d/ifcfg-eth3]) Scheduling refresh of L3_if_downup[eth3]
2014-01-16T08:40:36.391971+00:00 debug: (Puppet::Type::L3_if_downup::ProviderRuby) Executing '/sbin/ifdown --force eth3'
2014-01-16T08:40:50.443260+00:00 debug: (/Stage[corosync_setup]/Corosync/Exec[rm_corosync_override]) The container Class[Corosync] will propagate my refresh event
2014-01-16T08:40:50.443245+00:00 err: (/Stage[corosync_setup]/Corosync/Package[pacemaker]/ensure) change from purged to present failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install pacemaker' returned 10
0:....
Err http://192.168.50.1/ubuntu/fuelweb/x86_64/ precise/main libltdl7 amd64 2.4.2-1ubuntu1
  Could not connect to 192.168.50.1:8080 (192.168.50.1). - connect (113: No route to host)

It looks like eth3 (Fuel admin network interface with access to 192.168.50.0/24 network) was brought down by Puppet::Type::L3_if_downup::ProviderRuby during "netconfig" stage. Due to this, puppet was not able to install needed packages during "corosync_setup" stage.

Diagnostic snapshot attached for "Second deployment: 3rd controller failed".

Tags: library
Revision history for this message
Aleksandr Didenko (adidenko) wrote :
Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Diagnostic snapshot attached for "First deployment: all nodes failed". Please see "fuel-snapshot-2014-01-16_05-03-58/localhost/var/log/remote/node-142.local.int/puppet-apply.log" - it has the same error.

Revision history for this message
Miroslav Anashkin (manashkin) wrote :

The very same issue encountered with released Mirantis Openstack 4.0., using eth0 as Admin Network.
Here is error message:
http://paste.openstack.org/show/61592/

And diagnostic snapshot:
https://docs.google.com/a/mirantis.com/file/d/0BwII9gsxwO6UbjkzNHQyRzg3aVU/edit

Changed in fuel:
status: New → Confirmed
Revision history for this message
Miroslav Anashkin (manashkin) wrote :

Sasha,

Please try to collect and share all the logs, remained locally on such failed node.

Currently it looks like 2 issues are in this bug.

1. Puppet switches admin network NIC down in case of VLAN and GRE segmentation. While admin NIC is down, syslog is not able to deliver part of the puppet apply log to master node.
https://bugs.launchpad.net/fuel/+bug/1271176

2. It looks like puppet incorrectly determines if admin network is up and continues deployment process without admin network available. It leads to the failed deployment.

Changed in fuel:
importance: Undecided → High
milestone: none → 4.1
Revision history for this message
Aleksandr Didenko (adidenko) wrote :

I've ran a bunch of test deployments on 41+ ISOs and did not hit this bug yet. If I get this error again, I'll gather requested info.

Mike Scherbakov (mihgen)
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
Revision history for this message
Sergey Vasilenko (xenolog) wrote :

Looks like issue of network connectivity between master and deploying node.
If this issue does not reproduce on fresh issues -- I belive, that nothing to fix.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Closing for a while as non-reproducible

Changed in fuel:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.