Bug #1546604 “If one node goes offline during provisioning step,...” : Bugs : Fuel for OpenStack

Revision history for this message

Sergey Galkin (sgalkin) wrote on 2016-02-17:

#1

controller-1.png Edit (11.1 KiB, image/png)

Revision history for this message

Sergey Galkin (sgalkin) wrote on 2016-02-17:

#2

controller-2.png Edit (13.6 KiB, image/png)

Revision history for this message

Sergey Galkin (sgalkin) wrote on 2016-02-17:

#3

controller-3.png Edit (13.1 KiB, image/png)

Revision history for this message

Sergey Galkin (sgalkin) wrote on 2016-02-17:

#4

offline.png Edit (122.1 KiB, image/png)

Revision history for this message

Sergey Galkin (sgalkin) wrote on 2016-02-17:

#5

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "573"
  build_id: "573"
  fuel-nailgun_sha: "558ca91a854cf29e395940c232911ffb851899c1"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "658be72c4b42d3e1436b86ac4567ab914bfb451b"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "c2a335b5b725f1b994f78d4c78723d29fa44685a"
  fuel-ostf_sha: "3bc76a63a9e7d195ff34eadc29552f4235fa6c52"
  fuel-mirror_sha: "fb45b80d7bee5899d931f926e5c9512e2b442749"
  fuelmenu_sha: "78ffc73065a9674b707c081d128cb7eea611474f"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "a43cf96cd9532f10794dce736350bf5bed350e9d"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "643a1ef27c7dccc1c2a2ad26b85c09226b35a67d"

Revision history for this message

Sergey Galkin (sgalkin) wrote on 2016-02-17:

#6

On env was applied patches from
https://bugs.launchpad.net/fuel/+bug/1543221
https://bugs.launchpad.net/fuel/+bug/1543233

Krzysztof Szukiełojć (kszukielojc) on 2016-02-17

Changed in fuel:
status:	New → Confirmed
importance:	Undecided → Medium
assignee:	nobody → Fuel Library Team (fuel-library)
milestone:	none → 9.0
tags:	added: area-library

Revision history for this message

Sergey Galkin (sgalkin) wrote on 2016-02-18:

#7

fuel-offline-52.png Edit (147.8 KiB, image/png)

Reproduced on the same env after redeployment
But
1. On part of nodes the Ubuntu installed
2. Switched to offline 52 compute-ceph nodes

Revision history for this message

Alexander Gordeev (a-gordeev) wrote on 2016-02-18:

#8

Assigning to fuel-python team.

In short: target nodes were provisioned, and then were rebooted. All nodes were unable to boot.

From first look it look like an issue with bootloader installation which is done during provisioning.

so, to find out the root cause, a one needs to analyze fuel-agent and nailgun agent logs, as well as syslog/kernel messages from any of target node.

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Fuel Python Team (fuel-python)

Revision history for this message

Alexander Gordeev (a-gordeev) wrote on 2016-02-18:

#9

it's high as major feature gets broken.

Changed in fuel:
importance:	Medium → High

Alexander Gordeev (a-gordeev) on 2016-02-18

tags:

added: area-python
removed: area-library

Alexander Gordeev (a-gordeev) on 2016-02-18

tags:

added: tricky

Alexander Gordeev (a-gordeev) on 2016-02-18

tags:

added: module-astute

Revision history for this message

Alexander Gordeev (a-gordeev) wrote on 2016-02-18:

#10

Download full text (6.0 KiB)

http://paste.openstack.org/show/487455/

long story short, what actually happened:

1) provisioning of 50 target nodes started.

2016-02-17 17:57:25 INFO [1071] Starting OS provisioning for nodes: 102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,12
5,126,127,128,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152

2) it went smooth. all changes were successfully applied to cobbler node profiles. Then uploading of provision data started (provision.json). Technically, that uploading is implemented via mcollective service.

3) provision.json was uploaded to nodes 102,103,104,105,106,107,108,109,110,111.

4) for some reasons, the next target node 112 was offline at this moment, hence uploading failed.

last entries in log files ended at 17:30:30

2016-02-17T17:30:30.578712+00:00 debug: 17:30:30.400765 #2746] DEBUG -- : runnerstats.rb:56:in `block in sent' Incrementing replies stat
2016-02-17T17:30:30.578844+00:00 warning: 17:30:30.405476 #2746] WARN -- : netio.rb:387:in `_init_line_read' PLMC7: Exiting after signal: SignalException: SIGTERM
2016-02-17T17:30:30.578844+00:00 debug: 17:30:30.405615 #2746] DEBUG -- : rabbitmq.rb:350:in `disconnect' Disconnecting from RabbitMQ
2016-02-17T17:30:30.578968+00:00 info: 17:30:30.405943 #2746] INFO -- : rabbitmq.rb:20:in `on_disconnect' Disconnected from stomp://mcollective@10.20.0.2:61613

5) astute did 10 retries with no luck.
2016-02-17 17:58:38 DEBUG [1071] Retry #1 to run mcollective agent on nodes: '112'
2016-02-17 17:59:41 DEBUG [1071] Retry #2 to run mcollective agent on nodes: '112'
2016-02-17 18:00:43 DEBUG [1071] Retry #3 to run mcollective agent on nodes: '112'
2016-02-17 18:01:46 DEBUG [1071] Retry #4 to run mcollective agent on nodes: '112'
2016-02-17 18:02:49 DEBUG [1071] Retry #5 to run mcollective agent on nodes: '112'
2016-02-17 18:03:51 DEBUG [1071] Retry #6 to run mcollective agent on nodes: '112'
2016-02-17 18:04:54 DEBUG [1071] Retry #7 to run mcollective agent on nodes: '112'
2016-02-17 18:05:56 DEBUG [1071] Retry #8 to run mcollective agent on nodes: '112'
2016-02-17 18:06:59 DEBUG [1071] Retry #9 to run mcollective agent on nodes: '112'
2016-02-17 18:08:02 DEBUG [1071] Retry #10 to run mcollective agent on nodes: '112'

6) astute gave up with trace:
2016-02-17 18:09:04 ERROR [1071] MCollective agents 'uploadfile' '112' didn't respond within the allotted time.
trace:
["/usr/share/gems/gems/astute-8.0.0/lib/astute/mclient.rb:114:in `check_results_with_retries'",
"/usr/share/gems/gems/astute-8.0.0/lib/astute/mclient.rb:60:in `method_missing'",
"/usr/share/gems/gems/astute-8.0.0/lib/astute/image_provision.rb:46:in `upload_provision'",
"/usr/share/gems/gems/astute-8.0.0/lib/astute/image_provision.rb:22:in `block in provision'",
"/usr/share/gems/gems/astute-8.0.0/lib/astute/image_provision.rb:22:in `each'",
"/usr/share/gems/gems/astute-8.0.0/lib/astute/image_provision.rb:22:in `provision'",
"/usr/share/gems/gems/astute-8.0.0/lib/astute/provision.rb:296:in `image_provision'",
"/usr/share/gems/gems/astute-8.0.0/lib/astute/provision.rb:241:in `block in provision_piece'",
"/usr/share/gems/gems/astu...

http://paste.openstack.org/show/487455/

long story short, what actually happened:

1) provisioning of 50 target nodes started.

2016-02-17 17:57:25 INFO [1071] Starting OS provisioning for nodes: 102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,12
5,126,127,128,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152

2) it went smooth. all changes were successfully applied to cobbler node profiles. Then uploading of provision data started (provision.json). Technically, that uploading is implemented via mcollective service.

3) provision.json was uploaded to nodes 102,103,104,105,106,107,108,109,110,111.

4) for some reasons, the next target node 112 was offline at this moment, hence uploading failed.

last entries in log files ended at 17:30:30

2016-02-17T17:30:30.578712+00:00 debug: 17:30:30.400765 #2746] DEBUG -- : runnerstats.rb:56:in `block in sent' Incrementing replies stat
2016-02-17T17:30:30.578844+00:00 warning: 17:30:30.405476 #2746]  WARN -- : netio.rb:387:in `_init_line_read' PLMC7: Exiting after signal: SignalException: SIGTERM
2016-02-17T17:30:30.578844+00:00 debug: 17:30:30.405615 #2746] DEBUG -- : rabbitmq.rb:350:in `disconnect' Disconnecting from RabbitMQ
2016-02-17T17:30:30.578968+00:00 info: 17:30:30.405943 #2746]  INFO -- : rabbitmq.rb:20:in `on_disconnect' Disconnected from stomp://mcollective@10.20.0.2:61613

5) astute did 10 retries with no luck.
2016-02-17 17:58:38 DEBUG [1071] Retry #1 to run mcollective agent on nodes: '112'
2016-02-17 17:59:41 DEBUG [1071] Retry #2 to run mcollective agent on nodes: '112'
2016-02-17 18:00:43 DEBUG [1071] Retry #3 to run mcollective agent on nodes: '112'
2016-02-17 18:01:46 DEBUG [1071] Retry #4 to run mcollective agent on nodes: '112'
2016-02-17 18:02:49 DEBUG [1071] Retry #5 to run mcollective agent on nodes: '112'
2016-02-17 18:03:51 DEBUG [1071] Retry #6 to run mcollective agent on nodes: '112'
2016-02-17 18:04:54 DEBUG [1071] Retry #7 to run mcollective agent on nodes: '112'
2016-02-17 18:05:56 DEBUG [1071] Retry #8 to run mcollective agent on nodes: '112'
2016-02-17 18:06:59 DEBUG [1071] Retry #9 to run mcollective agent on nodes: '112'
2016-02-17 18:08:02 DEBUG [1071] Retry #10 to run mcollective agent on nodes: '112'

6) astute gave up with trace:
2016-02-17 18:09:04 ERROR [1071] MCollective agents 'uploadfile' '112' didn't respond within the allotted time.
 trace: 
["/usr/share/gems/gems/astute-8.0.0/lib/astute/mclient.rb:114:in `check_results_with_retries'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/mclient.rb:60:in `method_missing'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/image_provision.rb:46:in `upload_provision'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/image_provision.rb:22:in `block in provision'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/image_provision.rb:22:in `each'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/image_provision.rb:22:in `provision'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/provision.rb:296:in `image_provision'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/provision.rb:241:in `block in provision_piece'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/provision.rb:288:in `call'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/provision.rb:288:in `report_image_provision'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/provision.rb:240:in `provision_piece'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/provision.rb:336:in `call'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/provision.rb:336:in `sleep_not_greater_than'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/provision.rb:115:in `loop'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/provision.rb:114:in `catch'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/provision.rb:46:in `provision'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/orchestrator.rb:123:in `provision'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/server/dispatcher.rb:51:in `provision'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/server/dispatcher.rb:37:in `image_provision'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/server/server.rb:189:in `dispatch_message'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/server/server.rb:146:in `block in dispatch'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/server/task_queue.rb:64:in `call'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/server/task_queue.rb:64:in `block in each'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/server/task_queue.rb:56:in `each'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/server/task_queue.rb:56:in `each'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/server/server.rb:144:in `each_with_index'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/server/server.rb:144:in `dispatch'",
 "/usr/share/gems/gems/astute-8.0.0/lib/astute/server/server.rb:123:in `block in perform_main_job'"]
{"status"=>"error",
 "error"=>
2016-02-17 18:09:04 DEBUG [1071] Data send by DeploymentProxyReporter to report it up:
{"status"=>"error",
 "error"=>
2016-02-17 18:09:04 INFO [1071] Changing node netboot state node-102
2016-02-17 18:09:04 INFO [1071] Casting message to Nailgun:
{"method"=>"provision_resp",
 "args"=>
   "status"=>"error",
   "error"=>

7) however, provisioning task proceeded further ignoring that error due to

https://github.com/openstack/fuel-astute/blob/stable/8.0/lib/astute/image_provision.rb#L23

upload_provision() failed, thus run_provision() wasnot executed, neither was failed_uids set correctly

8) failed_uids was empty. This, in turn, leaded to fake positive result of run_provision() execution. So, astute mistakenly assumed that all target nodes were provisioned without errors.

9) astute changed netboot to false for all target nodes and tried to reboot them into target OS due to lines:

https://github.com/openstack/fuel-astute/blob/stable/8.0/lib/astute/provision.rb#L243-L254

11) target nodes tried to boot from local disks, since all disks were wiped out prior provisioning, there was no any valid boot sector. Therefore they threw 'boot sector signature not found' error.

Changed in fuel:
status:	Confirmed → Triaged

Vladimir Sharshov (vsharshov) on 2016-02-18

Changed in fuel:
assignee:	Fuel Python Team (fuel-python) → Vladimir Sharshov (vsharshov)

Revision history for this message

Leontii Istomin (listomin) wrote on 2016-02-18:

#11

The nodes have been failed due network connectivity issues. But as @agorgeev mentioned earlier when some nodes goes offline we shouldn't fail deployment at all.

summary:	- Controllers fail to boot during deployment + If one node goes offline during provisioning step, all deployment will + be failed
description:	updated

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-03-03: Fix proposed to fuel-astute (master)

#12

Fix proposed to branch: master
Review: https://review.openstack.org/288113

Changed in fuel:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-03-17: Fix merged to fuel-astute (master)

#13

Reviewed: https://review.openstack.org/288113
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=79f99adf48de37d33b5e089472f91b2f7e614e55
Submitter: Jenkins
Branch: master

commit 79f99adf48de37d33b5e089472f91b2f7e614e55
Author: Vladimir Sharshov (warpc) <email address hidden>
Date: Thu Mar 3 23:25:26 2016 +0300

Flexible way to work with node provision

Changes:

    - use upload file task instead of magent directly;
    - fault tolerance for uploading errors;
    - big refactoring of image provision;
    - add missing tests for image provision.

Change-Id: I70169855082c899cb287ff5a10c907d90b3f81b5
Closes-Bug: #1546604

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-05-30: Fix proposed to fuel-astute (stable/8.0)

#14

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/322770

Andrew Kalach (akndex) on 2016-06-15

Changed in fuel:
status:	Fix Committed → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-22: Fix merged to fuel-astute (stable/8.0)

#15

Reviewed: https://review.openstack.org/322770
Committed: https://git.openstack.org/cgit/openstack/fuel-astute/commit/?id=17ddf0ecac92475287266179828a6cc03967c876
Submitter: Jenkins
Branch: stable/8.0

commit 17ddf0ecac92475287266179828a6cc03967c876
Author: Michael Polenchuk <email address hidden>
Date: Mon May 30 13:58:40 2016 +0300

Prevent unexpected exception if provision fail

    Squashed commits from the 9.0:
    - 79f99adf48de37d33b5e089472f91b2f7e614e55
      - fault tolerance for uploading errors
      - use upload file task instead of magnet directly
    - e07e74eb5980421b47fbc64b6d6f50a955e7cad1
      - do not fail if no nodes were sent to reboot

    Change-Id: I5b806f3d1411c4445a58b899b73eca035f5931b9
    Closes-Bug: #1546604
    Related-Bug: #1540360

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Released	High	Vladimir Sharshov	Fuel for OpenStack 9.0
	8.0.x	Fix Committed	High	Michael Polenchuk	Fuel for OpenStack 8.0-mu-2

Fuel for OpenStack

If one node goes offline during provisioning step, all deployment will be failed

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches