Unexpected error\nActual checksum baca80a92c9d7458f9ae8b45151fed83 mismatches with expected eccea71379f90ad0bd8d933f801d812b for file /dev/sda3\n

Bug #1538645 reported by Anastasia Palkina on 2016-01-27
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Vladimir Sharshov
8.0.x
Medium
Julia Aranovich
Mitaka
High
Vladimir Sharshov

Bug Description

1. Create new environment
2. Choose Neutron, tunnelling segmentation
3. Choose Ceph for images
4. Choose Ceph RadosGW for objects
5. Choose Sahara, Murano, Ceilometer
6. Add 3 controller, 1 compute, 1 cinder+mongo, 3 ceph, 2 mongo
7. Move Management network to eth1
8. Move Storage network to eth2 and untag it
9. Start deployment
10. Stop deployment during provisioning
11. Wait until nodes become 'Pending addition'
12. Deploy the environment again. It has failed.

Cause is:
2016-01-27 16:07:58 DEBUG [727] 1559e3f3-ae83-4b4b-959e-a798a8fa57cc: MC agent 'execute_shell_command', method 'execute', results:
{:sender=>"9",
 :statuscode=>0,
 :statusmsg=>"OK",
 :data=>
  {:stdout=>"",
   :stderr=>
    "Unexpected error\nActual checksum baca80a92c9d7458f9ae8b45151fed83 mismatches with expected eccea71379f90ad0bd8d933f801d812b for file /dev/sda3\n",
   :exit_code=>255}}

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "478"
  build_id: "478"
  fuel-nailgun_sha: "ae949905142507f2cb446071783731468f34a572"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "481ed135de2cb5060cac3795428625befdd1d814"
  fuel-nailgun-agent_sha: "b2bb466fd5bd92da614cdbd819d6999c510ebfb1"
  astute_sha: "b81577a5b7857c4be8748492bae1dec2fa89b446"
  fuel-library_sha: "420c6fa5f8cb51f3322d95113f783967bde9836e"
  fuel-ostf_sha: "ab5fd151fc6c1aa0b35bc2023631b1f4836ecd61"
  fuel-mirror_sha: "b62f3cce5321fd570c6589bc2684eab994c3f3f2"
  fuelmenu_sha: "fac143f4dfa75785758e72afbdc029693e94ff2b"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "6c6b088a3d52dd0eaf43d59f3a3a149c93a07e7e"

Anastasia Palkina (apalkina) wrote :
Dmitry Klenov (dklenov) on 2016-01-28
tags: added: area-python
tags: added: team-enhancements
Vladimir Sharshov (vsharshov) wrote :

What happened?

We run provision

2016-01-27 15:33:44 INFO [733] Processing RPC call 'image_provision'

We send stop command 5 minutes after

2016-01-27 15:38:16 INFO [733] Processing RPC call 'stop_deploy_task'
2016-01-27 15:38:17 INFO [733] Processing RPC call 'stop_deploy_task'

After it we run image_provision again 5 minutes after

2016-01-27 15:42:50 INFO [727] Processing RPC call 'image_provision'

Why failed?

We kill main provision process, but such action do not kill fuel-agent which run on master node and build image.

How to fix?

I can add additional behavior for image_provision stop which can kill fuel-agent process.

tags: added: move-to-9.0
Vladimir Sharshov (vsharshov) wrote :

After discussion with Alexander Gordeev it looks like we could not solve it fast, because fuel-agent at now moment do not support interruption. In other case we can run several fuel-agent without any problem. So looks like solution change image and yaml destination for some uniq for every generation (for example, include task id to path). It will safe us from problem with running several fuel-agent for one cluster.

Also we could not guarantee stable work for cluster after stop deployment in 8.0 or early version, so this bug should be high, because it can easily solved by using 'reset' and deploy after.

The root cause + user impact is the following: if user hits "Stop deployment" right after "Deploy changes" - some of the processes triggered while IBP builds image may keep running for some time. If user fixes the config (or does whatever else prompted him to Stop Deployment so soon) quickly and hits "Deploy changes" again - it will fail.

We need to add alert on UI on "Stop Deployment" screen - "You're stopping deployment at OS provisioning stage. Please allow $N minutes before pressing "Deploy changes" again". Please consult with Igor Kalnitsky about appropriate $N value

With this condition met we can move the bug to 9.0

Dmitry Pyzhov (dpyzhov) wrote :

UI team is working on warning message

tags: added: ui
Alexander Gordeev (a-gordeev) wrote :

it turned out that fuel-agent does support interruption (SIGINT, but not SIGTERM).

moreover, it performs cleanup: kills all processes in chroot (if any), stops currently running process, and does all umounts and deletes all image leftovers from temporary directory as well.

So, fuel-agent itself is interrupt-able and doesn't need additional signal handlers.

Therefore, if astute needs to stop image building task (literally means to send SIGINT to fuel-agent process involved into image building), then it should do that without any worries.

Logs attached.

The last thing to asure is that processes in chroot will be killed as expected.

Ihor Kalnytskyi (ikalnytskyi) wrote :

@Alex,

Why you don't want to handle the same way in SIGTERM? AFAIU, Astute sends SIGTERM as universal signal to terminate something. It doesn't know about a lot of custom signals. I believe that fuel-agent should handle SIGTERM the same way it handles SIGINT.

Is there any reason why you think it shouldn't be this way?

Alexander Gordeev (a-gordeev) wrote :

@Igor,

yes, it makes sense.

Also i was wrong. SIGINT only works if sent from shell as shell propagates SIGINT signal to all subprocesses too. So i need to implement the same for SIGTERM. So, fuel-agent needs to be fixed.

Reviewed: https://review.openstack.org/275702
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=9dfe300e7e6acbeed883134f0a8c72f204ee14c8
Submitter: Jenkins
Branch: master

commit 9dfe300e7e6acbeed883134f0a8c72f204ee14c8
Author: Julia Aranovich <email address hidden>
Date: Wed Feb 3 16:42:56 2016 +0300

    Fix warning when stopping deployment on provisioning stage

    Related-Bug: #1538645

    Change-Id: If085bf10c2eaaa74f0ea6a864dbefb55021c97a9

Reviewed: https://review.openstack.org/275732
Committed: https://git.openstack.org/cgit/openstack/fuel-web/commit/?id=741bf0ca72e9c4916257dc1c5b32489ec6ed36c7
Submitter: Jenkins
Branch: stable/8.0

commit 741bf0ca72e9c4916257dc1c5b32489ec6ed36c7
Author: Julia Aranovich <email address hidden>
Date: Wed Feb 3 17:45:31 2016 +0300

    Fix warning when stopping deployment on provisioning stage

    Closes-Bug: #1538645

    Change-Id: If085bf10c2eaaa74f0ea6a864dbefb55021c97a9

tags: added: life-cycle-management
Ihor Kalnytskyi (ikalnytskyi) wrote :

Warning message is merged to stable/8.0. So I lower prio to Medium and close it as Won't Fixed.

For 9.0, there should be an automatic way to terminate fuel-agent process by sending sigterm signal.

tags: removed: move-to-9.0 ui

Reviewed: https://review.openstack.org/275820
Committed: https://git.openstack.org/cgit/openstack/fuel-agent/commit/?id=c726948f17c948a5307bbb7241f4b4e1f85f1d23
Submitter: Jenkins
Branch: master

commit c726948f17c948a5307bbb7241f4b4e1f85f1d23
Author: Alexander Gordeev <email address hidden>
Date: Wed Feb 3 19:17:29 2016 +0300

    Handle SIGTERM to shut down gracefully

    Apparently, fuel-agent doesn't handle any signal received except for SIGINT
    which is automatically converted by python to KeyboardInterrupt() exception.

    fuel-agent is unable to send signal for spawned processes, just because
    utils.execute doesn't know PIDs of opened subprocessess. To mitigate that flaw,
    fuel-agent will use process group to distribute signals.

    Process groups are used to control the distribution of signals.
    A signal directed to a process group is delivered individually to all of the
    processes that are members of the group.

    That allows fuel-agent to send signals to subprocesses without knowing thier
    exact PIDs.

    Change-Id: Ie59c0425f031fa94e517b79df0a0fc3d0c3e7a07
    Related-Bug: #1538645

Dmitry Pyzhov (dpyzhov) on 2016-03-02
tags: added: feature-stop-deployment module-astute

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

steps to reproduce

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
Dmitry Pyzhov (dpyzhov) on 2016-04-13
Changed in fuel:
milestone: 9.0 → 10.0
Dmitry Pyzhov (dpyzhov) wrote :

Do we need anything else to close the bug?

tags: added: feature
tags: removed: need-info
Alexis Pachas (alexis.pachas) wrote :

Hello,

I have the same problem. I'm deploying Mirantis OpenStack 9.0 fuel environment with 1 controller node and 1 compute node. I'm using Storage Backends: Cinder LVM over iSCSI for volumes but appear tha same message: "Provision has failed. Too many nodes failed to provision" and chegking the logs, it shows:

2016-09-15 04:16:58 INFO fuel_agent.cmd.agent
2016-09-15 04:16:58 INFO fuel_agent.cmd.agent ImageChecksumMismatchError: Actual checksum b2cca87926503d6f24abf00a4220306f mismatches with expected 97b72fc6d95eca32c5cda6570d827b87 for file /dev/sda3

Plase I'm new in OpenStack and I want to deploy my own Cloud.

Thanks.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers