[scale] Remove primary controller and another one after scaling - failed with error on granular_deploy task

Bug #1539003 reported by Dmitry Tyzhnenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Fuel QA Team
8.0.x
Invalid
High
Fuel QA Team
Mitaka
Invalid
High
Fuel QA Team

Bug Description

Redeploy after delete controllers faild with error - "Error Deployment has failed. Method granular_deploy. Deployment failed on nodes 4. Inspect Astute logs for the details"

1 Deploy cluster: 1 controller, Neutron Vxlan, ceph for volumes and images, ceph for ephemeral and Rados
2 Add 2 ceph nodes, verify networks, set replication factor to 2
3 Deploy. Run OSTF
4 Add 2 controllers
5 Re-deploy cluster
6 Verify networks
7 Run OSTF
8 Add 2 controllers, 1 compute
9 Re-deploy cluster
10 Verify networks
11 Run OSTF
12 Delete primary controller and the last added
13 Re-deploy cluster
14 Verify networks
15 Run OSTF

Expected result:
 All step pass

Actual result:
 Failed on 13 step

Astute errors - http://paste.openstack.org/show/485240/

Snapshot log - https://drive.google.com/a/mirantis.com/file/d/0B8U7EvTbuAOlbEEwN1Y1M1NsdTA/view?usp=sharing

Fuel 8.0-478
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "478"
  build_id: "478"
  fuel-nailgun_sha: "ae949905142507f2cb446071783731468f34a572"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "481ed135de2cb5060cac3795428625befdd1d814"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "420c6fa5f8cb51f3322d95113f783967bde9836e"
  fuel-ostf_sha: "ab5fd151fc6c1aa0b35bc2023631b1f4836ecd61"
  fuel-mirror_sha: "b62f3cce5321fd570c6589bc2684eab994c3f3f2"
  fuelmenu_sha: "fac143f4dfa75785758e72afbdc029693e94ff2b"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "6c6b088a3d52dd0eaf43d59f3a3a149c93a07e7e"

description: updated
description: updated
Dmitry Klenov (dklenov)
tags: added: area-library
Changed in fuel:
status: New → Confirmed
tags: added: life-cycle-management
tags: added: area-python
removed: area-library
tags: added: area-astute
tags: added: team-bugfix
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Deployment fails due to setup_repositories task timeout. This is a pretty straightforward task that generates a set of simple files from erb templates. It seems it could be an issue with the amount of RAM and CPU you have for these VMs. Could you please share your environment details?

Revision history for this message
Dmitry Tyzhnenko (dtyzhnenko) wrote :

@vkuklin: 2 vcpu and 6 gb ram per vm

root@node-4:~# free
             total used free shared buffers cached
Mem: 6112660 5950516 162144 173836 130768 774288
-/+ buffers/cache: 5045460 1067200
Swap: 6291452 67864 6223588

root@node-4:~# cat /proc/cpuinfo | grep processor
processor : 0
processor : 1

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Dima, sorry, my question was ambiguous. I actually asked for info about host node machine parameters.

Revision history for this message
Dmitry Tyzhnenko (dtyzhnenko) wrote :

@vkuklin:

$ free
             total used free shared buffers cached
Mem: 65941356 65494040 447316 2820 202336 35545060
-/+ buffers/cache: 29746644 36194712
Swap: 67072328 820284 66252044

Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz

$ cat /proc/cpuinfo | grep processor | wc -l
24

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

I've checked this env with Vladimir Kuklin. For some reason deployment task was send twice with a difference in 10 minutes.

Investigating...

Revision history for this message
Bulat Gaifullin (bulat.gaifullin) wrote :

task setup_repositories call function 'generate_apt_pins'. this function tries to download Release information for each repository that does not have field priority. [1], this function can be cause of timeout error.

[1] https://github.com/openstack/fuel-library/blob/2f446c986a76ba48104a5a1bda88f481244e8157/deployment/puppet/osnailyfacter/lib/puppet/parser/functions/generate_apt_pins.rb#L23-L28

summary: - [scale] Remove primary contorller and another one after scaling - failed
+ [scale] Remove primary controller and another one after scaling - failed
with error on granular_deploy task
Revision history for this message
Bartłomiej Piotrowski (bpiotrowski) wrote :

setup_repositories runs for tasks that do have priority field. It doesn't do any heavy processing except downloading Release file for parsing, so if it timeouts, I'd rather say there might be some connectivity issue.

Revision history for this message
Ihor Kalnytskyi (ikalnytskyi) wrote :

@Bulat, thanks for investigation.

@Bartłomiej, yeah, we encounter some connectivity issues recently.

--

Well, since the issue is not in the case, but accidental fall due to connectivity issues with Fuel Infa, I move this bug to Fuel QA and to Incomplete in order to recheck and ensure that the problem is not occurred anymore.

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Guys, we have 2 tasks in parallel, which works on same nodes. Looks like this is same problem as here: https://bugs.launchpad.net/fuel/+bug/1496411 and in short word problem in fuel tests.

[776] 5feb2d7e-8c40-4bec-81b2-fa8e85936119
10:56:54 INFO [776] 'granular_deploy' method called with data:
[\"5\", \"8\", \"6\", \"3\", \"4\", \"2\"]}

[782] 'granular_deploy' method called with data
2016-01-27 11:06:04 DEBUG [782] Process message from worker queue:
[\"3\", \"4\", \"2\", \"8\", \"6\", \"5\"]

Also we have another task as in bug about controller removing:
2016-01-27 11:08:48 INFO [800] Casting message to Nailgun:
{"method"=>"remove_nodes_resp",
 "args"=>
  {"task_uuid"=>"fad3e6e4-9e3c-4feb-a31b-3490ca55f87b",
   "status"=>"ready",
   "progress"=>100,
   "nodes"=>[{"uid"=>"1"}, {"uid"=>"7"}]}}

Short info about nodes actions:
2016-01-27 09:44:22 INFO [793] Processing RPC call 'verify_networks'
2016-01-27 10:56:52 INFO [780] Processing RPC call 'remove_nodes'
2016-01-27 10:56:53 INFO [776] Processing RPC call 'granular_deploy'
2016-01-27 11:06:04 INFO [800] Processing RPC call 'remove_nodes'
2016-01-27 11:06:05 INFO [782] Processing RPC call 'granular_deploy'
2016-01-27 11:36:16 INFO [795] Processing RPC call 'dump_environment'

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

And also this is similar to bug: https://bugs.launchpad.net/fuel/+bug/1539693 So i will mark it as duplicate.

Revision history for this message
Ihor Kalnytskyi (ikalnytskyi) wrote :

Vladimir S, you're wrong here. It's not about DELETE request. The flow described in this issue goes through Deploy Changes button. So it has nothing to do with parallel tasks.

Revision history for this message
Ihor Kalnytskyi (ikalnytskyi) wrote :

And it was discovered by Bulat, that the problem is in setup_repos. Which makes sense.

Revision history for this message
Dmitry Tyzhnenko (dtyzhnenko) wrote :

The bug not reproduced on 8.0-506. Move it to invalid

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.