controller was not recovered after hard power off - admin interface's config on destroyed node wasn't contained ip address

Bug #1536041 reported by Dmitry Belyaninov
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Fuel Library (Deprecated)
8.0.x
Invalid
High
Fuel Library (Deprecated)
Mitaka
Invalid
High
Fuel Library (Deprecated)

Bug Description

scenario:

Create and deploy next cluster - Neutron Vxlan, ceph for all, ceph replication factor - 3,
3 controller, 2 compute, 3 ceph nodes
Run OSTF
Verify network
Create 2 volumes and 2 instances with attached volumes
Fill ceph storages up to 30%
Cold shutdown of all nodes (destroy virtual machines for kvm case)
Wait 5 minutes
Start cluster nodes one by one <- Error

[root@nailgun ~]# cat /etc/fuel/version.yaml
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "429"
  build_id: "429"
  fuel-nailgun_sha: "12b15b2351e250af41cc0b10d63a50c198fe77d8"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "df16d41cd7a9445cf82ad9fd8f0d53824711fcd8"
  fuel-nailgun-agent_sha: "92ebd5ade6fab60897761bfa084aefc320bff246"
  astute_sha: "c7ca63a49216744e0bfdfff5cb527556aad2e2a5"
  fuel-library_sha: "3eaf4f4a9b88b287a10cc19e9ce6a62298cc4013"
  fuel-ostf_sha: "214e794835acc7aa0c1c5de936e93696a90bb57a"
  fuel-mirror_sha: "b62f3cce5321fd570c6589bc2684eab994c3f3f2"
  fuelmenu_sha: "85de57080a18fda18e5325f06eaf654b1b931592"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "e8e36cff332644576d7853c80b8a53d5b955420a"

Diagnostic snapshot:
https://drive.google.com/a/mirantis.com/file/d/0B1CktchMwAXHeWxUcVlTekdGVDQ/view?usp=sharing
But I suppose that the logs are not very useful.

Please contact me if you need cluster snapshot (error_recover) ASAP.

Maciej Relewicz (rlu)
tags: added: area-library
Changed in fuel:
importance: Undecided → High
assignee: nobody → Fuel Library Team (fuel-library)
milestone: none → 8.0
status: New → Confirmed
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

which disk cache mode do you use for VMs?

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

How long did you wait after nodes powered ON?
Please try to reproduce the case with the sync command issues on all nodes prior to vthe irsh destroy

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

And please elaborate what is the "Error" exactly means

Revision history for this message
Vladimir Khlyunev (vkhlyunev) wrote :

@Bogdan, the main issue that admin interface's config on destroyed node wasn't contained ip address; there was around 30 minutes before shutdown and ~1.5 hours after power up. BTW after second power cycle the node were became online and failed with another error not related to this issue. Move to incomplete - we will reopen it if the issue will be reproduced

Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :

Revert to Incomplete state till successful reproducing

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Note, these errors look not good:
2016-01-18T22:17:44.757310+00:00 node-7 kernel info: [ 4.675577] EXT4-fs (dm-0): re-mounted. Opts: errors=panic
2016-01-18T22:17:45.854814+00:00 node-8 kernel info: [ 5.325040] EXT4-fs (dm-0): re-mounted. Opts: errors=panic
2016-01-18T22:18:32.023989+00:00 node-1 kernel info: [ 4.621264] EXT4-fs (dm-3): re-mounted. Opts: errors=panic
2016-01-18T22:18:34.094786+00:00 node-2 kernel info: [ 4.018672] EXT4-fs (dm-3): re-mounted. Opts: errors=panic
2016-01-18T22:18:35.840098+00:00 node-4 kernel info: [ 4.312123] EXT4-fs (dm-1): re-mounted. Opts: errors=panic
2016-01-19T10:17:09.820294+00:00 node-1 kernel info: [ 2.653363] EXT4-fs (dm-3): re-mounted. Opts: errors=panic
2016-01-19T10:35:26.175298+00:00 node-3 kernel info: [ 2.843008] EXT4-fs (dm-3): re-mounted. Opts: errors=panic
2016-01-19T10:42:43.552903+00:00 node-2 kernel info: [ 2.822818] EXT4-fs (dm-3): re-mounted. Opts: errors=panic
2016-01-19T12:16:25.322168+00:00 node-1 kernel info: [ 2.778047] EXT4-fs (dm-3): re-mounted. Opts: errors=panic
2016-01-19T12:37:16.865503+00:00 node-8 kernel info: [ 2.807707] EXT4-fs (dm-0): re-mounted. Opts: errors=panic

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

another log record points to something was broken in the cinder DB
2016-01-18T22:38:55.538378+00:00 node-1 cinder-volume crit: 2016-01-18 22:38:55.530 13639 CRITICAL cinder [req-f04d6484-6b43-4170-ac02-a12731f00eec - - - - -] ProgrammingError: (_mysql_exceptions.ProgrammingError) (1146, "Table 'cinder.volumes' doesn't exist")

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Here is the filtered events related to the galera cluster

tags: added: team-bugfix
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

According Bogdan's comment moved to Confirm.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I only provided logs snippets. I believe that the test case with unsafe cache (that is only an assumption, I asked for elaboration) may be incorrect and resulted in the corrupted DB. But this may be unrelated to the reported issue, which is "admin interface's config on destroyed node wasn't contained ip address". So I cannot either confirm this nor reject, it is incomplete.

summary: - controller was not recovered after hard power off
+ controller was not recovered after hard power off - admin interface's
+ config on destroyed node wasn't contained ip address
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

@Bogdan, how we can confirm issue?

Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :

I tried to reproduce the error.
So, all controllers started successfully and HC tests are successful, but there is a message in the output of "pcs status":
Failed actions:
    p_mysql_start_0 on node-1.test.domain.local 'unknown error' (1): call=71, status=Timed Out, last-rc-change='Wed Jan 20 20:46:38 2016', queued=0ms, exec=300001ms
    p_mysql_start_0 on node-3.test.domain.local 'unknown error' (1): call=55, status=Timed Out, last-rc-change='Wed Jan 20 20:46:38 2016', queued=0ms, exec=300002ms

Cluster is alive. Contact me if needed.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This looks like another bug now?

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Note, the start of the MySQL due timeouts is known issue, there is a docs guide how to deal with such cases (just increase the start timeout to allow large replicas to finish in time)

Revision history for this message
Nastya Urlapova (aurlapova) wrote :

@Dima, I've moved issue to Invalid, please Confirm it issue you will able reproduce it.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.