Deploy new compute node caused whole cluster Failed with Err status in fuel web

Bug #1555932 reported by JohnsonYi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
JohnsonYi

Bug Description

Deployed environment:
Mos 8.0
3 controllers
2 compute+Cinder with LVM
1 ironic
This environment is ready for weeks.

What I do is to deploy a new compute node(compute+cinder), that caused the whole cluster marked as Error on "Fuel" web, I tryed the deploy again. And the whole cluster began redeploying!!

Logs:
2016-03-11 02:22:20 +0000 Puppet (debug): Executing '/usr/bin/dpkg-query -W --showformat '${Status} ${Package} ${Version}\n' python-nova'
2016-03-11 02:22:20 +0000 Puppet (debug): Executing '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install python-nova'
2016-03-11 02:22:20 +0000 Puppet (err): Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install python-nova' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 python-nova : Depends: websockify (>= 0.6.1) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.
/usr/lib/ruby/vendor_ruby/puppet/util/execution.rb:219:in `execute'
/usr/lib/ruby/vendor_ruby/puppet/provider/command.rb:23:in `execute'
/usr/lib/ruby/vendor_ruby/puppet/provider.rb:237:in `block in has_command'
/usr/lib/ruby/vendor_ruby/puppet/provider.rb:463:in `block in create_class_and_instance_method'
/usr/lib/ruby/vendor_ruby/puppet/provider/package/apt.rb:73:in `install'
/etc/puppet/modules/osnailyfacter/lib/puppet/provider/package/apt_fuel.rb:49:in `install'

root@node-9:~# apt-get -q -y -o DPkg::Options::=--force-confold install python-nova
Reading package lists...
Building dependency tree...
Reading state information...
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 python-nova : Depends: websockify (>= 0.6.1) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.
root@node-9:~# apt-cache policy websockify
websockify:
  Installed: (none)
  Candidate: 0.6.1+dfsg1-1~u14.04+mos1
  Version table:
     0.6.1+dfsg1-1~u14.04+mos1 0
       1000 http://10.0.20.2:8080/mirrors/mos-repos/ubuntu/8.0/ mos8.0/main amd64 Packages
     0.5.1+dfsg1-3ubuntu0.14.04.1 0
        500 http://10.0.64.242/ubuntu/ trusty-updates/universe amd64 Packages
     0.5.1+dfsg1-3 0
        500 http://10.0.64.242/ubuntu/ trusty/universe amd64 Packages
root@node-9:~# dpkg -l | grep websockify
root@node-9:~#

I use a local ubuntu & mos repo, no connective issue.

Updates:
2016-05-17
Today I emulated this issue by disable dependence package python-numpy for websockify from local ubuntu repo:
root@node-3:~# apt-cache policy python-numpy
python-numpy:
  Installed: 1:1.8.2-0ubuntu0.1
  Candidate: 1:1.8.2-0ubuntu0.1
  Version table:
 *** 1:1.8.2-0ubuntu0.1 0
        500 http://10.14.64.242/ubuntu/ trusty-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     1:1.8.1-1ubuntu1 0
        500 http://10.14.64.242/ubuntu/ trusty/main amd64 Packages

mv python-numpy_1.8.1-1ubuntu1_amd64.deb python-numpy_1.8.1-1ubuntu1_amd64.deb.bak
mv python-numpy_1.8.2-0ubuntu0.1_amd64.deb python-numpy_1.8.2-0ubuntu0.1_amd64.deb.bak

then deploy a new compute node;

Install python-nova failed as expect:
Get:90 http://10.14.20.2:8080/liberty-8.0/ubuntu/x86_64/ mos8.0/main python-sqlalchemy-ext amd64 1.0.8~u14.04+mos1 [18.4 kB]
Get:91 http://10.14.20.2:8080/liberty-8.0/ubuntu/x86_64/ mos8.0/main websockify amd64 0.6.1+dfsg1-1~u14.04+mos1 [37.6 kB]
Fetched 12.3 MB in 0s (26.7 MB/s)
E: Failed to fetch http://10.14.64.242/ubuntu/pool/main/p/python-numpy/python-numpy_1.8.2-0ubuntu0.1_amd64.deb 404 Not Found

E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
/usr/lib/ruby/vendor_ruby/puppet/util/execution.rb:219:in `execute'
/usr/lib/ruby/vendor_ruby/puppet/provider/command.rb:23:in `execute'
/usr/lib/ruby/vendor_ruby/puppet/provider.rb:237:in `block in has_command'
/usr/lib/ruby/vendor_ruby/puppet/provider.rb:463:in `block in create_class_and_instance_method'

new compute node was marked as "Err", then all controller & other nodes go through the puppet jobs and done smoothly, then all nodes marked as "Err" a bit later which exactly the same as this bug reported.

then restore the package python-numpy, and deploy again, the controller go through the puppet scripts again&again..., stuck at 11%(openstack was not redeployed) almost 1 hour and a half later, all nodes restored to ready status,

So the openstack environment was not actually redeployed and vm creation was still working. I may have misunderstanding for controller became "Deploy" status. But the robustness still may be a problem, as deploy new node may impact the whole environment, potential risk still exist.

Changed in fuel:
assignee: nobody → Sachin Yede (yede-sachin45)
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Please attach diagnostic snapshot.

Changed in fuel:
milestone: none → 8.0-updates
status: New → Incomplete
Changed in fuel:
importance: Undecided → High
Revision history for this message
JohnsonYi (yichengli) wrote :

@Oleksiy,

I forgot to create a snapshot for this, refer to the logs, I seems to be a problem from local repo which was rsynced by command "fuel-createmirror -M",

Candidate: 0.6.1+dfsg1-1~u14.04+mos1
  Version table:
     0.6.1+dfsg1-1~u14.04+mos1 0
       1000 http://10.0.20.2:8080/mirrors/mos-repos/ubuntu/8.0/ mos8.0/main amd64

Even the local repo on fuel is not correct, it should not affect the deployed environment, and no way to specified a single node for re-deploy on Fuel.

Changed in fuel:
status: Incomplete → Confirmed
Changed in fuel:
assignee: Sachin Yede (yede-sachin45) → nobody
Changed in fuel:
assignee: nobody → MOS Linux (mos-linux)
Revision history for this message
JohnsonYi (yichengli) wrote :

The case mentioned in below link is very familiar with this ticket:
http://askubuntu.com/questions/529217/not-able-to-fix-the-error-unable-to-correct-problems-you-have-held-broken-pack

There should be unmet dependencies for package websockify, I should have one more try with command "apt-get install websockify" to find out which package have a unmet version for websockify.

So you can check out the dependencies for websockify, then install a different version package that unmet the dependencies(upper version better), then deploy a compute to simulate this case.

Revision history for this message
JohnsonYi (yichengli) wrote :
Revision history for this message
JohnsonYi (yichengli) wrote :

I update the bug description for what I had tested on my MOS 8.0(with mu1), please refer to the updates in bug description.

description: updated
Revision history for this message
JohnsonYi (yichengli) wrote :

B.T.W
All test are on baremetal environment, not vm.

Revision history for this message
Dmitry Teselkin (teselkin-d) wrote :

I don't see any reason why this bug should be on mos-linux. websockify wasn't built by mos-linux, and python-nova is a mos-packaging area. Given that the bug is for 8.0 updates I'm reassigning it to mos-maintenance.

Changed in fuel:
assignee: MOS Linux (mos-linux) → MOS Maintenance (mos-maintenance)
Revision history for this message
JohnsonYi (yichengli) wrote :

@Dmitry
This is not a simple bug for a package like websockify, but while there is something wrong with repo(let's say repo service down <- I tested this also) during new node deployment, new node deploy will failed and other nodes will be impacted, and controller take too long to repair the controller to "Ready" status, at mean time, I don't think this action has no chance to fail. More atomicity design for new node deployment process will be a good choice.

Changed in fuel:
assignee: MOS Maintenance (mos-maintenance) → Rodion Tikunov (rtikunov)
Revision history for this message
Rodion Tikunov (rtikunov) wrote :

Can't reproduce it.
If you have a local Ubuntu mirror and you have package with unmet dependencies you can add this package in /usr/share/fuel-mirror/ubuntu.yaml on the packages section and recreate local repo again by the commands:
fuel-mirror create -G mos -I /usr/share/fuel-mirror/ubuntu.yaml
fuel-mirror create -G ubuntu -I /usr/share/fuel-mirror/ubuntu.yaml
fuel-mirror apply -G mos -I /usr/share/fuel-mirror/ubuntu.yaml
fuel-mirror apply -G ubuntu -I /usr/share/fuel-mirror/ubuntu.yaml
Please, try to add websockify into your local repo like described above.

Changed in fuel:
status: Confirmed → Incomplete
assignee: Rodion Tikunov (rtikunov) → JohnsonYi (yichengli)
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Bug is incomplete status for a month without response. Marking as invalid. Please reopen if you have more data.

Changed in fuel:
status: Incomplete → Invalid
Revision history for this message
JohnsonYi (yichengli) wrote :

Similar issue can be fixed by fix the dependencies, and re-deploy the puppet scripts by command "fuel node --node NODE_ID --deploy"。

BTW, actions defined in fuel puppet always doing strong consistency verification, so don't touch the default setting deployed by fuel puppet but create new, like network, vlan, default flavors and etc.

Revision history for this message
JohnsonYi (yichengli) wrote :

When one node went through the puppet jobs and got error, the whole cluster may went to "ERR" status, you have to find out the issue and fix the issue, then re-run the puppet jobs to recovery the cluster status to "Ready".

Revision history for this message
JohnsonYi (yichengli) wrote :

BTW, when the fuel went to ERR status, the openstack cluster is still running health, what you have to do is let the puppet jobs run smoothly until you can deploy new nodes.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.