[fuel] HA deployment with primary-controller+ceph+neutron has failed

Bug #1354384 reported by Anastasia Palkina
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Incomplete
Critical
MOS Linux

Bug Description

"build_id": "2014-08-07_02-01-17",
"ostf_sha": "e33390c275e225d648b36997460dc29b1a3c20ae",
"build_number": "408",
"auth_required": true,
"api": "1.0",
"nailgun_sha": "67c4f1c18ab0833175f6dc7f0f9c49c3eb722287",
"production": "docker",
"fuelmain_sha": "7b2e7ef083f239bd47b5c47aecb1f815c009521f",
"astute_sha": "b52910642d6de941444901b0f20e95ebbcb2b2e9",
"feature_groups": ["mirantis"],
"release": "5.1",
"fuellib_sha": "53633cd9bb149f6c1b9d5ee8321efc85c71cee68"

1. Create new environment (Ubuntu, HA mode)
2. Choose VLAN segmentation
3. Choose both Ceph
4. Add 3 controllers+ceph, compute
5. Start deployment. It has failed
6. Controllers gone away. There are errors in astute.log:

2014-08-07 17:19:26 ERR

[395] 53f45b9a-25e3-4084-b2e1-41019eaff6ae: cmd: ruby -r 'yaml' -e 'y = YAML.load_file("/etc/astute.yaml"); y["nodes"] = YAML.load_file("/tmp/astute.yaml"); File.open("/etc/astute.yaml", "w") { |f| f.write y.to_yaml }'; puppet apply --logdest syslog --debug -e '$settings=parseyaml($::astute_settings_yaml) $nodes_hash=$settings["nodes"] class {"l23network::hosts_file": nodes => $nodes_hash }'
                                               mcollective error: 53f45b9a-25e3-4084-b2e1-41019eaff6ae: MCollective agents '14' didn't respond within the allotted time.

2014-08-07 17:19:26 ERR

[395] MCollective agents '14' didn't respond within the allotted time.

2014-08-07 17:17:26 ERR

[395] 53f45b9a-25e3-4084-b2e1-41019eaff6ae: mcollective upload_file agent error: 53f45b9a-25e3-4084-b2e1-41019eaff6ae: MCollective agents '14' didn't respond within the allotted time.

2014-08-07 17:17:26 ERR

[395] MCollective agents '14' didn't respond within the allotted time.

Logs are here https://drive.google.com/a/mirantis.com/file/d/0B6SjzarTGFxaMlVZMnVkZmtuMzQ/edit?usp=sharing

Tags: ceph puppet
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Wait bug repetition from Nastay. Potentially connected with https://bugs.launchpad.net/fuel/+bug/1353389 (possible common native of problem with Mcollective).

summary: - HA deployment with controllers+ceph has failed
+ [astute] HA deployment with controllers+ceph has failed
tags: added: astute mco
Changed in fuel:
status: New → Triaged
summary: - [astute] HA deployment with controllers+ceph has failed
+ [fuel] HA deployment with primary-controller+ceph+neutron has failed
tags: added: ceph puppet
removed: astute mco
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

After discovery HA:
* controller + neutron vlan -> success
* controller + ceph + simple network -> success
* controller + ceph + neutron vlan -> fail

Symptoms: node with roles: primary-controller and ceph, go offline in the middle of ceph deployment. After it node became unavailable via mco or ssh.

Logs: https://drive.google.com/file/d/0Bz7Fsls9aSjkOVhWdlVITEktY0k/edit?usp=sharing

Revision history for this message
Anastasia Palkina (apalkina) wrote :
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Выжимка ошибок из Puppet лога:

Fri Aug 08 16:09:13 +0000 2014 /Stage[main]/Ceph::Mon/Exec[ceph-deploy mon create]/unless (debug): 2014-08-08 16:09:13.812275 7f2172418700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication
Fri Aug 08 16:09:35 +0000 2014 /Stage[main]/Swift::Ringbuilder/Swift::Ringbuilder::Rebalance[container]/Exec[hours_passed_container]/returns (notice): TypeError: 'NoneType' object does not support item assignment
Fri Aug 08 16:09:37 +0000 2014 /Stage[main]/Swift::Ringbuilder/Swift::Ringbuilder::Rebalance[account]/Exec[hours_passed_account]/returns (notice): TypeError: 'NoneType' object does not support item assignment
Fri Aug 08 16:09:39 +0000 2014 /Stage[main]/Swift::Ringbuilder/Swift::Ringbuilder::Rebalance[object]/Exec[hours_passed_object]/returns (notice): TypeError: 'NoneType' object does not support item assignment
Fri Aug 08 16:17:18 +0000 2014 /Stage[main]/Mysql::Password/Exec[set_mysql_rootpw]/unless (debug): error: 'Access denied for user 'root'@'localhost' (using password: YES)'
Fri Aug 08 18:43:40 +0000 2014 /Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns (notice): [node-3][WARNING] ceph-disk: Error: Device is mounted: /dev/sdb2
Fri Aug 08 18:43:40 +0000 2014 /Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns (notice): [node-3][ERROR ] RuntimeError: command returned non-zero exit status: 1
Fri Aug 08 18:43:44 +0000 2014 /Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns (notice): [node-3][WARNING] OSError: [Errno 16] Device or resource busy: '/var/lib/ceph/tmp/mnt.gH3HgL'
Fri Aug 08 18:43:44 +0000 2014 /Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns (notice): [node-3][ERROR ] RuntimeError: command returned non-zero exit status: 1
Fri Aug 08 18:43:44 +0000 2014 /Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns (notice): [ceph_deploy][ERROR ] GenericError: Failed to create 2 OSDs

Ключи в ls /var/lib/astute/ceph/ присутствуют (публичный и закрытый)

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

English version:

Error message from Puppet logs:

Fri Aug 08 16:09:13 +0000 2014 /Stage[main]/Ceph::Mon/Exec[ceph-deploy mon create]/unless (debug): 2014-08-08 16:09:13.812275 7f2172418700 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication
Fri Aug 08 16:09:35 +0000 2014 /Stage[main]/Swift::Ringbuilder/Swift::Ringbuilder::Rebalance[container]/Exec[hours_passed_container]/returns (notice): TypeError: 'NoneType' object does not support item assignment
Fri Aug 08 16:09:37 +0000 2014 /Stage[main]/Swift::Ringbuilder/Swift::Ringbuilder::Rebalance[account]/Exec[hours_passed_account]/returns (notice): TypeError: 'NoneType' object does not support item assignment
Fri Aug 08 16:09:39 +0000 2014 /Stage[main]/Swift::Ringbuilder/Swift::Ringbuilder::Rebalance[object]/Exec[hours_passed_object]/returns (notice): TypeError: 'NoneType' object does not support item assignment
Fri Aug 08 16:17:18 +0000 2014 /Stage[main]/Mysql::Password/Exec[set_mysql_rootpw]/unless (debug): error: 'Access denied for user 'root'@'localhost' (using password: YES)'
Fri Aug 08 18:43:40 +0000 2014 /Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns (notice): [node-3][WARNING] ceph-disk: Error: Device is mounted: /dev/sdb2
Fri Aug 08 18:43:40 +0000 2014 /Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns (notice): [node-3][ERROR ] RuntimeError: command returned non-zero exit status: 1
Fri Aug 08 18:43:44 +0000 2014 /Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns (notice): [node-3][WARNING] OSError: [Errno 16] Device or resource busy: '/var/lib/ceph/tmp/mnt.gH3HgL'
Fri Aug 08 18:43:44 +0000 2014 /Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns (notice): [node-3][ERROR ] RuntimeError: command returned non-zero exit status: 1
Fri Aug 08 18:43:44 +0000 2014 /Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns (notice): [ceph_deploy][ERROR ] GenericError: Failed to create 2 OSDs

Both keys in /var/lib/astute/ceph/ on problem node presented

Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

Fix for Neutron https://launchpad.net/bugs/1352203 was not helped here. Last cluster deploy with it.

Changed in fuel:
assignee: Vladimir Sharshov (vsharshov) → Fuel Library Team (fuel-library)
Dmitry Ilyin (idv1985)
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Dmitry Ilyin (idv1985)
Changed in fuel:
assignee: Dmitry Ilyin (idv1985) → Dmitry Borodaenko (dborodaenko)
Revision history for this message
Dmitry Ilyin (idv1985) wrote :

Research shows that we have the same problem that we previously had with nova and cinder apis.
https://bugs.launchpad.net/oslo/+bug/1101404

The problem here is when rsyslog is restarted, for example when a new log config is added, neutron keeps sendind messages to the old unix socket. It doesn't work and neutron enters endless loop consuming 100% of cpu and deployment fails. Dometimes nodes even go offline.

This can be fixed bu something like this: http://paste.openstack.org/show/93408/

But MOS guyz previously have decided that is't a Python bug and should be fixed there http://bugs.python.org/issue15179
https://github.com/eventlet/eventlet/issues/63

There are rumors thta it's already fixed in centos
http://hg.python.org/cpython/rev/99f0c0207faa
But this deployment is Ubuntu

Please port this fix either to Python or to Eventlet. It can also affect other Openstack packages.

Changed in fuel:
status: Triaged → Confirmed
assignee: Dmitry Borodaenko (dborodaenko) → MOS Neutron (mos-neutron)
Changed in fuel:
assignee: MOS Neutron (mos-neutron) → MOS Linux (mos-linux)
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

If Dima is right, this should be marked as a duplicate of https://bugs.launchpad.net/mos/+bug/1342068

Revision history for this message
Andrew Woodward (xarses) wrote :

does not reproduce with 423

Changed in fuel:
status: Confirmed → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.