tripleo

Overcloud deployment during ControllerDeployment_Step4 with ceph fails "ObjectNotFound: error opening pool 'metrics'\",

Bug #1749544 reported by John Fulton on 2018-02-14

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	High	John Fulton	tripleo queens-rc1

Bug Description

When deploying TripleO with Ceph using Luminous the deployment fails on step 4 because the ceph metrics pool was not created.

(undercloud) [stack@undercloud74 ~]$ echo -e `heat deployment-show 37bc65f5-6986-4f27-b583-986333b648a4`|grep -i error
WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 192.168.0.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
SubjectAltNameWarning
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 192.168.0.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
SubjectAltNameWarning
\"Error running ['docker', 'run', '--name', 'gnocchi_db_sync', '--label', 'config_id=tripleo_step4', '--label', 'container_name=gnocchi_db_sync', '--label', 'managed_by=paunch', '--label', 'config_data={\\"environment\\": [\\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\\", \\"TRIPLEO_CONFIG_HASH=1a569d012dc804939398b671bf257703\\"], \\"user\\": \\"root\\", \\"volumes\\": [\\"/etc/hosts:/etc/hosts:ro\\", \\"/etc/localtime:/etc/localtime:ro\\", \\"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\\", \\"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\\", \\"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\\", \\"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\\", \\"/dev/log:/dev/log\\", \\"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\\", \\"/etc/puppet:/etc/puppet:ro\\", \\"/var/lib/kolla/config_files/gnocchi_db_sync.json:/var/lib/kolla/config_files/config.json:ro\\", \\"/var/lib/config-data/puppet-generated/gnocchi/:/var/lib/kolla/config_files/src:ro\\", \\"/var/log/containers/gnocchi:/var/log/gnocchi\\", \\"/var/log/containers/httpd/gnocchi-api:/var/log/httpd\\", \\"/etc/ceph:/var/lib/kolla/config_files/src-ceph:ro\\"], \\"image\\": \\"192.168.0.1:8787/rhosp13/openstack-gnocchi-api:13.0-20180112.1\\", \\"detach\\": false, \\"net\\": \\"host\\", \\"privileged\\": false}', '--env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS', '--env=TRIPLEO_CONFIG_HASH=1a569d012dc804939398b671bf257703', '--net=host', '--privileged=false', '--user=root', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/var/lib/kolla/config_files/gnocchi_db_sync.json:/var/lib/kolla/config_files/config.json:ro', '--volume=/var/lib/config-data/puppet-generated/gnocchi/:/var/lib/kolla/config_files/src:ro', '--volume=/var/log/containers/gnocchi:/var/log/gnocchi', '--volume=/var/log/containers/httpd/gnocchi-api:/var/log/httpd', '--volume=/etc/ceph:/var/lib/kolla/config_files/src-ceph:ro', '192.168.0.1:8787/rhosp13/openstack-gnocchi-api:13.0-20180112.1']. [1]\",
\"ObjectNotFound: error opening pool 'metrics'\",
(undercloud) [stack@undercloud74 ~]$

Revision history for this message

John Fulton (jfulton-org) wrote on 2018-02-14:

Root Cause:

The pools were not created and ansible [1] returned the following message from ceph:

"Error ERANGE: pg_num 128 size 3 would mean 768 total pgs, which exceeds max 600 (mon_max_pg_per_osd 200 * num_in_osds 3)"

The workaround is to change any of the above three variables to satisfy the following function when we create, for OpenStack by default, seven pools:

https://github.com/ceph/ceph/blob/e59258943bcfe3e52d40a59ff30df55e1e6a3865/src/mon/OSDMonitor.cc#L5670-L5698

This is new to queens because it's using lumionus which has the above feature. The problem is that EVERY queens deployment that doesn't override the defaults will have this problem.

Here's one workaround which satisfies the function above:

parameter_defaults:
  CephPoolDefaultSize: 3
  CephPoolDefaultPgNum: 128
  CephConfigOverrides:
    mon_max_pg_per_osd: 3072

In the above case I increased mon_max_pg_per_osd based on the closest power of 2 greater than (* 128 3 7).

[1] grep Error /var/log/mistral/ceph-install-workflow.log | grep 128

Revision history for this message

John Fulton (jfulton-org) wrote on 2018-02-14:

THT's low-memory-usage.yaml [1] fits this pattern so I will put something like the workaround from my last comment there. Reasoning:

- The defaults should fit the minimum supported production deployment and those testing with less than that should have an easy way to override them provided they understand it's not for production.

Next steps:
- upstream code change to THT low-memory-usage.yaml
- In addition a solution to https://bugs.launchpad.net/tripleo/+bug/1721817 would not force the user to find out that issue happened at Step 4 as it could have been caught at step 2 instead. it might be better to solve this directly with ceph-ansible.

[1] https://github.com/openstack/tripleo-heat-templates/blob/master/environments/low-memory-usage.yaml

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-02-14: Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/544588

Changed in tripleo:
status:	Triaged → In Progress

Revision history for this message

John Fulton (jfulton-org) wrote on 2018-02-14:

To not run into this issue either:

1. Use hardware that complies with ceph recommended practices
2. Override the defaults if using a development or test-only environment

In order to make #2 easier, simply use '-e environments/low-memory-usage.yaml' with your deployment and after the proposed change to this file merges, the issue should go away.

https://review.openstack.org/#/c/544588/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-02-17: Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/544588
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=fa026d64408e5a158f2f03de440f6c854de8045b
Submitter: Zuul
Branch: master

commit fa026d64408e5a158f2f03de440f6c854de8045b
Author: John Fulton <email address hidden>
Date: Wed Feb 14 18:06:28 2018 +0000

Add non-production ceph defaults to low-memory-usage.yaml

    Ceph Luminous does not create a pool if the pg_number,
    pool size, and mon_max_pg_per_osd are outside of Ceph
    recomended practice for production clusters. TripleO
    development environments which use low-memory-usage.yaml
    may not meet this criteria and fail a deployment with
    Luminous unless the defaults for these values are overriden
    as in this change.

Change-Id: I12ee495b780f29fc098c5c3bd57c46fd946146ae
Closes-Bug: #1749544

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-03-03: Fix included in openstack/tripleo-heat-templates 8.0.0.0rc1

This issue was fixed in the openstack/tripleo-heat-templates 8.0.0.0rc1 release candidate.

Revision history for this message

John Fulton (jfulton-org) wrote on 2020-12-04:

I want to emphasize that you shouldn't do the following hack if you care about your data:

CephConfigOverrides:
mon_max_pg_per_osd: 3072

It was only meant as a way to bypass the overdose protection check (which is trying to help you protect your data).

This value should stay near 200.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.