Fuel for OpenStack

Unexpected error\nActual checksum baca80a92c9d7458f9ae8b45151fed83 mismatches with expected eccea71379f90ad0bd8d933f801d812b for file /dev/sda3\n

Bug #1538645 reported by Anastasia Palkina on 2016-01-27

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Confirmed	High	Vladimir Sharshov	Fuel for OpenStack 10.0
8.0.x	Won't Fix	Medium	Julia Aranovich	Fuel for OpenStack 8.0
Mitaka	Won't Fix	High	Vladimir Sharshov	Fuel for OpenStack 9.0

Bug Description

1. Create new environment
2. Choose Neutron, tunnelling segmentation
3. Choose Ceph for images
4. Choose Ceph RadosGW for objects
5. Choose Sahara, Murano, Ceilometer
6. Add 3 controller, 1 compute, 1 cinder+mongo, 3 ceph, 2 mongo
7. Move Management network to eth1
8. Move Storage network to eth2 and untag it
9. Start deployment
10. Stop deployment during provisioning
11. Wait until nodes become 'Pending addition'
12. Deploy the environment again. It has failed.

Cause is:
2016-01-27 16:07:58 DEBUG [727] 1559e3f3-ae83-4b4b-959e-a798a8fa57cc: MC agent 'execute_shell_command', method 'execute', results:
{:sender=>"9",
:statuscode=>0,
:statusmsg=>"OK",
:data=>
  {:stdout=>"",
   :stderr=>
    "Unexpected error\nActual checksum baca80a92c9d7458f9ae8b45151fed83 mismatches with expected eccea71379f90ad0bd8d933f801d812b for file /dev/sda3\n",
   :exit_code=>255}}

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "478"
  build_id: "478"
  fuel-nailgun_sha: "ae949905142507f2cb446071783731468f34a572"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "481ed135de2cb5060cac3795428625befdd1d814"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "420c6fa5f8cb51f3322d95113f783967bde9836e"
  fuel-ostf_sha: "ab5fd151fc6c1aa0b35bc2023631b1f4836ecd61"
  fuel-mirror_sha: "b62f3cce5321fd570c6589bc2684eab994c3f3f2"
  fuelmenu_sha: "fac143f4dfa75785758e72afbdc029693e94ff2b"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "6c6b088a3d52dd0eaf43d59f3a3a149c93a07e7e"

Tags:

Revision history for this message

Anastasia Palkina (apalkina) wrote on 2016-01-27:

fuel-snapshot-2016-01-27_16-32-23.tar.xz Edit (10.1 MiB, application/octet-stream)

Dmitry Klenov (dklenov) on 2016-01-28

tags:

added: area-python

Alexander Kislitsky (akislitsky) on 2016-01-28

tags:

added: team-enhancements

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2016-02-02:

What happened?

We run provision

2016-01-27 15:33:44 INFO [733] Processing RPC call 'image_provision'

We send stop command 5 minutes after

2016-01-27 15:38:16 INFO [733] Processing RPC call 'stop_deploy_task'
2016-01-27 15:38:17 INFO [733] Processing RPC call 'stop_deploy_task'

After it we run image_provision again 5 minutes after

2016-01-27 15:42:50 INFO [727] Processing RPC call 'image_provision'

Why failed?

We kill main provision process, but such action do not kill fuel-agent which run on master node and build image.

How to fix?

I can add additional behavior for image_provision stop which can kill fuel-agent process.

Andrey Maximov (maximov) on 2016-02-02

tags:

added: move-to-9.0

Revision history for this message

Vladimir Sharshov (vsharshov) wrote on 2016-02-02:

After discussion with Alexander Gordeev it looks like we could not solve it fast, because fuel-agent at now moment do not support interruption. In other case we can run several fuel-agent without any problem. So looks like solution change image and yaml destination for some uniq for every generation (for example, include task id to path). It will safe us from problem with running several fuel-agent for one cluster.

Also we could not guarantee stable work for cluster after stop deployment in 8.0 or early version, so this bug should be high, because it can easily solved by using 'reset' and deploy after.

Revision history for this message

Dmitriy Novakovskiy (dnovakovskiy) wrote on 2016-02-02:

The root cause + user impact is the following: if user hits "Stop deployment" right after "Deploy changes" - some of the processes triggered while IBP builds image may keep running for some time. If user fixes the config (or does whatever else prompted him to Stop Deployment so soon) quickly and hits "Deploy changes" again - it will fail.

We need to add alert on UI on "Stop Deployment" screen - "You're stopping deployment at OS provisioning stage. Please allow $N minutes before pressing "Deploy changes" again". Please consult with Igor Kalnitsky about appropriate $N value

With this condition met we can move the bug to 9.0

Revision history for this message

Dmitry Pyzhov (dpyzhov) wrote on 2016-02-03:

UI team is working on warning message

tags:

added: ui

Revision history for this message

Alexander Gordeev (a-gordeev) wrote on 2016-02-03:

fuel-agent-env-22222.log Edit (12.4 KiB, text/plain)

it turned out that fuel-agent does support interruption (SIGINT, but not SIGTERM).

moreover, it performs cleanup: kills all processes in chroot (if any), stops currently running process, and does all umounts and deletes all image leftovers from temporary directory as well.

So, fuel-agent itself is interrupt-able and doesn't need additional signal handlers.

Therefore, if astute needs to stop image building task (literally means to send SIGINT to fuel-agent process involved into image building), then it should do that without any worries.

Logs attached.

The last thing to asure is that processes in chroot will be killed as expected.

Revision history for this message

Ihor Kalnytskyi (ikalnytskyi) wrote on 2016-02-03:

@Alex,

Why you don't want to handle the same way in SIGTERM? AFAIU, Astute sends SIGTERM as universal signal to terminate something. It doesn't know about a lot of custom signals. I believe that fuel-agent should handle SIGTERM the same way it handles SIGINT.

Is there any reason why you think it shouldn't be this way?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-03: Related fix proposed to fuel-web (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/275702

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-03: Fix proposed to fuel-web (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/275732

Revision history for this message

Alexander Gordeev (a-gordeev) wrote on 2016-02-03:

#10

@Igor,

yes, it makes sense.

Also i was wrong. SIGINT only works if sent from shell as shell propagates SIGINT signal to all subprocesses too. So i need to implement the same for SIGTERM. So, fuel-agent needs to be fixed.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-03: Related fix proposed to fuel-agent (master)

#11

Related fix proposed to branch: master
Review: https://review.openstack.org/275820

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-04: Related fix merged to fuel-web (master)

#12

Reviewed: https://review.openstack.org/275702
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=9dfe300e7e6acbeed883134f0a8c72f204ee14c8
Submitter: Jenkins
Branch: master

commit 9dfe300e7e6acbeed883134f0a8c72f204ee14c8
Author: Julia Aranovich <email address hidden>
Date: Wed Feb 3 16:42:56 2016 +0300

Fix warning when stopping deployment on provisioning stage

Related-Bug: #1538645

Change-Id: If085bf10c2eaaa74f0ea6a864dbefb55021c97a9

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-04: Fix merged to fuel-web (stable/8.0)

#13

Reviewed: https://review.openstack.org/275732
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=741bf0ca72e9c4916257dc1c5b32489ec6ed36c7
Submitter: Jenkins
Branch: stable/8.0

commit 741bf0ca72e9c4916257dc1c5b32489ec6ed36c7
Author: Julia Aranovich <email address hidden>
Date: Wed Feb 3 17:45:31 2016 +0300

Fix warning when stopping deployment on provisioning stage

Closes-Bug: #1538645

Change-Id: If085bf10c2eaaa74f0ea6a864dbefb55021c97a9

Bogdan Dobrelya (bogdando) on 2016-02-04

tags:

added: life-cycle-management

Revision history for this message

Ihor Kalnytskyi (ikalnytskyi) wrote on 2016-02-04:

#14

Warning message is merged to stable/8.0. So I lower prio to Medium and close it as Won't Fixed.

For 9.0, there should be an automatic way to terminate fuel-agent process by sending sigterm signal.

tags:

removed: move-to-9.0 ui

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-10: Related fix merged to fuel-agent (master)

#15

Reviewed: https://review.openstack.org/275820
Committed: https://git.openstack.org/cgit/openstack/fuel-agent/commit/?id=c726948f17c948a5307bbb7241f4b4e1f85f1d23
Submitter: Jenkins
Branch: master

commit c726948f17c948a5307bbb7241f4b4e1f85f1d23
Author: Alexander Gordeev <email address hidden>
Date: Wed Feb 3 19:17:29 2016 +0300

Handle SIGTERM to shut down gracefully

Apparently, fuel-agent doesn't handle any signal received except for SIGINT
which is automatically converted by python to KeyboardInterrupt() exception.

    fuel-agent is unable to send signal for spawned processes, just because
    utils.execute doesn't know PIDs of opened subprocessess. To mitigate that flaw,
    fuel-agent will use process group to distribute signals.

    Process groups are used to control the distribution of signals.
    A signal directed to a process group is delivered individually to all of the
    processes that are members of the group.

That allows fuel-agent to send signals to subprocesses without knowing thier
exact PIDs.

Change-Id: Ie59c0425f031fa94e517b79df0a0fc3d0c3e7a07
Related-Bug: #1538645

Dmitry Pyzhov (dpyzhov) on 2016-03-02

tags:

added: feature-stop-deployment module-astute

Revision history for this message

Bug Checker Bot (bug-checker) wrote on 2016-03-28: Autochecker

#16

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

steps to reproduce

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags:

added: need-info

Dmitry Pyzhov (dpyzhov) on 2016-04-13

Changed in fuel:
milestone:	9.0 → 10.0

Revision history for this message

Dmitry Pyzhov (dpyzhov) wrote on 2016-05-06:

#17

Do we need anything else to close the bug?

Vladimir Sharshov (vsharshov) on 2016-05-11

tags:

added: feature

Bug Checker Bot (bug-checker) on 2016-05-17

tags:

removed: need-info

Revision history for this message

Alexis Pachas (alexis.pachas) wrote on 2016-09-15:

#18

Hello,

I have the same problem. I'm deploying Mirantis OpenStack 9.0 fuel environment with 1 controller node and 1 compute node. I'm using Storage Backends: Cinder LVM over iSCSI for volumes but appear tha same message: "Provision has failed. Too many nodes failed to provision" and chegking the logs, it shows:

2016-09-15 04:16:58 INFO fuel_agent.cmd.agent
2016-09-15 04:16:58 INFO fuel_agent.cmd.agent ImageChecksumMismatchError: Actual checksum b2cca87926503d6f24abf00a4220306f mismatches with expected 97b72fc6d95eca32c5cda6570d827b87 for file /dev/sda3

Plase I'm new in OpenStack and I want to deploy my own Cloud.

Thanks.