Fuel for OpenStack

controller was not recovered after hard power off - admin interface's config on destroyed node wasn't contained ip address

Bug #1536041 reported by Dmitry Belyaninov on 2016-01-20

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Invalid	High	Fuel Library (Deprecated)	Fuel for OpenStack 9.0
8.0.x	Invalid	High	Fuel Library (Deprecated)	Fuel for OpenStack 8.0
Mitaka	Invalid	High	Fuel Library (Deprecated)	Fuel for OpenStack 9.0

Bug Description

scenario:

Create and deploy next cluster - Neutron Vxlan, ceph for all, ceph replication factor - 3,
3 controller, 2 compute, 3 ceph nodes
Run OSTF
Verify network
Create 2 volumes and 2 instances with attached volumes
Fill ceph storages up to 30%
Cold shutdown of all nodes (destroy virtual machines for kvm case)
Wait 5 minutes
Start cluster nodes one by one <- Error

[root@nailgun ~]# cat /etc/fuel/version.yaml
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "429"
  build_id: "429"
  fuel-nailgun_sha: "12b15b2351e250af41cc0b10d63a50c198fe77d8"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "df16d41cd7a9445cf82ad9fd8f0d53824711fcd8"
  fuel-nailgun-agent_sha: "92ebd5ade6fab60897761bfa084aefc320bff246"
  astute_sha: "c7ca63a49216744e0bfdfff5cb527556aad2e2a5"
  fuel-library_sha: "3eaf4f4a9b88b287a10cc19e9ce6a62298cc4013"
  fuel-ostf_sha: "214e794835acc7aa0c1c5de936e93696a90bb57a"
  fuel-mirror_sha: "b62f3cce5321fd570c6589bc2684eab994c3f3f2"
  fuelmenu_sha: "85de57080a18fda18e5325f06eaf654b1b931592"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "e8e36cff332644576d7853c80b8a53d5b955420a"

Diagnostic snapshot:
https://drive.google.com/a/mirantis.com/file/d/0B1CktchMwAXHeWxUcVlTekdGVDQ/view?usp=sharing
But I suppose that the logs are not very useful.

Please contact me if you need cluster snapshot (error_recover) ASAP.

Tags:

Maciej Relewicz (rlu) on 2016-01-20

tags:	added: area-library
Changed in fuel:
importance:	Undecided → High
assignee:	nobody → Fuel Library Team (fuel-library)
milestone:	none → 8.0
status:	New → Confirmed

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-20:

which disk cache mode do you use for VMs?

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-20:

How long did you wait after nodes powered ON?
Please try to reproduce the case with the sync command issues on all nodes prior to vthe irsh destroy

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-20:

And please elaborate what is the "Error" exactly means

Revision history for this message

Vladimir Khlyunev (vkhlyunev) wrote on 2016-01-20:

@Bogdan, the main issue that admin interface's config on destroyed node wasn't contained ip address; there was around 30 minutes before shutdown and ~1.5 hours after power up. BTW after second power cycle the node were became online and failed with another error not related to this issue. Move to incomplete - we will reopen it if the issue will be reproduced

Revision history for this message

Dmitry Belyaninov (dbelyaninov) wrote on 2016-01-20:

Revert to Incomplete state till successful reproducing

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-20:

Note, these errors look not good:
2016-01-18T22:17:44.757310+00:00 node-7 kernel info: [ 4.675577] EXT4-fs (dm-0): re-mounted. Opts: errors=panic
2016-01-18T22:17:45.854814+00:00 node-8 kernel info: [ 5.325040] EXT4-fs (dm-0): re-mounted. Opts: errors=panic
2016-01-18T22:18:32.023989+00:00 node-1 kernel info: [ 4.621264] EXT4-fs (dm-3): re-mounted. Opts: errors=panic
2016-01-18T22:18:34.094786+00:00 node-2 kernel info: [ 4.018672] EXT4-fs (dm-3): re-mounted. Opts: errors=panic
2016-01-18T22:18:35.840098+00:00 node-4 kernel info: [ 4.312123] EXT4-fs (dm-1): re-mounted. Opts: errors=panic
2016-01-19T10:17:09.820294+00:00 node-1 kernel info: [ 2.653363] EXT4-fs (dm-3): re-mounted. Opts: errors=panic
2016-01-19T10:35:26.175298+00:00 node-3 kernel info: [ 2.843008] EXT4-fs (dm-3): re-mounted. Opts: errors=panic
2016-01-19T10:42:43.552903+00:00 node-2 kernel info: [ 2.822818] EXT4-fs (dm-3): re-mounted. Opts: errors=panic
2016-01-19T12:16:25.322168+00:00 node-1 kernel info: [ 2.778047] EXT4-fs (dm-3): re-mounted. Opts: errors=panic
2016-01-19T12:37:16.865503+00:00 node-8 kernel info: [ 2.807707] EXT4-fs (dm-0): re-mounted. Opts: errors=panic

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-20:

another log record points to something was broken in the cinder DB
2016-01-18T22:38:55.538378+00:00 node-1 cinder-volume crit: 2016-01-18 22:38:55.530 13639 CRITICAL cinder [req-f04d6484-6b43-4170-ac02-a12731f00eec - - - - -] ProgrammingError: (_mysql_exceptions.ProgrammingError) (1146, "Table 'cinder.volumes' doesn't exist")

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-20:

mysql_events.log Edit (1.2 MiB, text/plain)

Here is the filtered events related to the galera cluster

Dmitry Bilunov (dbilunov) on 2016-01-20

tags:

added: team-bugfix

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2016-01-20:

According Bogdan's comment moved to Confirm.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-20:

#10

I only provided logs snippets. I believe that the test case with unsafe cache (that is only an assumption, I asked for elaboration) may be incorrect and resulted in the corrupted DB. But this may be unrelated to the reported issue, which is "admin interface's config on destroyed node wasn't contained ip address". So I cannot either confirm this nor reject, it is incomplete.

summary:

- controller was not recovered after hard power off
+ controller was not recovered after hard power off - admin interface's
+ config on destroyed node wasn't contained ip address

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2016-01-20:

#11

@Bogdan, how we can confirm issue?

Revision history for this message

Dmitry Belyaninov (dbelyaninov) wrote on 2016-01-21:

#12

I tried to reproduce the error.
So, all controllers started successfully and HC tests are successful, but there is a message in the output of "pcs status":
Failed actions:
p_mysql_start_0 on node-1.test.domain.local 'unknown error' (1): call=71, status=Timed Out, last-rc-change='Wed Jan 20 20:46:38 2016', queued=0ms, exec=300001ms
p_mysql_start_0 on node-3.test.domain.local 'unknown error' (1): call=55, status=Timed Out, last-rc-change='Wed Jan 20 20:46:38 2016', queued=0ms, exec=300002ms

Cluster is alive. Contact me if needed.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-21:

#13

This looks like another bug now?

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-01-21:

#14

Note, the start of the MySQL due timeouts is known issue, there is a docs guide how to deal with such cases (just increase the start timeout to allow large replicas to finish in time)

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2016-01-29:

#15

@Dima, I've moved issue to Invalid, please Confirm it issue you will able reproduce it.