permission denied: tripleo_ceph_work_dir : symbolic link to tripleo inventory from ceph-ansible work directory

Bug #1880579 reported by John Fulton
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
tripleo
Confirmed
High
Unassigned

Bug Description

When deploying 1 controller, 1 compute, 1 ceph-storage my deployment fails like this:

TASK [tripleo_ceph_work_dir : symbolic link to tripleo inventory from ceph-ansible work directory] ***
task path: /usr/share/ansible/roles/tripleo_ceph_work_dir/tasks/prepare.yml:29
Monday 25 May 2020 14:35:28 +0000 (0:00:00.867) 0:03:52.188 ************
fatal: [undercloud]: FAILED! => changed=false
  msg: 'Error while linking: [Errno 13] Permission denied: b''/home/stack/config-download/overcloud/triple
o-ansible-inventory.yaml'' -> b''/home/stack/config-download/overcloud/ceph-ansible/inventory.yml'''
  path: /home/stack/config-download/overcloud/ceph-ansible/inventory.yml

I'm using: tripleo-ansible-1.3.1-0.20200515212929.8517eb7.el8.noarch

The location of the inventory is:

 /home/stack/config-download/config-download-latest

The permissions are:

(undercloud) [CentOS-8.1 - stack@undercloud ~]$ ls -ld config-download
drwxrwxr-x+ 3 root root 53 May 25 14:59 config-download
(undercloud) [CentOS-8.1 - stack@undercloud ~]$ ls -ld config-download/config-download-latest/
drwxrwxr-x+ 12 stack stack 4096 May 25 14:35 config-download/config-download-latest/
(undercloud) [CentOS-8.1 - stack@undercloud ~]$ ls -ld config-download/config-download-latest/tripleo-ansible-inventory.yaml
-rw-rwxr--+ 1 stack stack 8358 May 25 14:59 config-download/config-download-latest/tripleo-ansible-inventory.yaml
(undercloud) [CentOS-8.1 - stack@undercloud ~]$

The code to make this symlink is:

https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/roles/tripleo_ceph_work_dir/tasks/prepare.yml#L29-L41

I'm using a virtual undercloud deployed by tripleo-operator-ansible (tripleo-lab) with:

 tripleo_repos_version: tripleo-ci-testing

which I built yesterday and deploying as described in:

https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/provisioning/baremetal_provision.html

I'm not using the standalone method to deploy.

Revision history for this message
Giulio Fidente (gfidente) wrote :

I guess when using --no-config-download the ansible user doesn't have privileges to write into the config-download dir or doesn't have +x on one of the dirs in the leading path (like it was for /var/lib/mistral)

Revision history for this message
John Fulton (jfulton-org) wrote :

Adding 'become: true' gets me past this error and onto the next one...

 http://paste.openstack.org/show/793960/

I guess we need yet another become:true. I'm only discovering this because I'm not using --stack-only because of

 https://bugs.launchpad.net/tripleo/+bug/1880577

Revision history for this message
Vasileios Baousis (bbaous) wrote :

I have exactly the same problem. The become=true creates the link but not properly because it the link has wrong permissions.
Even if the permissions the entire directory ~/config-download/overcloud/ceph-ansible/ are set to allow all, then creation of althe files in this dir fails
like
/ceph-ansible/group_vars/all.yml
/ceph-ansible/extra_vars.yml
/ceph-ansible/external_{{ item.cluster }}_extra_vars.yml

and the rest of the code in the code in /usr/share/ansible/roles/tripleo_ceph_work_dir/tasks/prepare.yml

Revision history for this message
John Fulton (jfulton-org) wrote :

WORKAROUND:

Run config-download manually as described in the following docs. This will workaround the permissions issue as the same user running ansible will have write access to the directory. However, we still need to fix this bug.

https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/deployment/ansible_config_download.html#manual-config-download

wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-rc3 → victoria-1
Revision history for this message
Vasileios Baousis (bbaous) wrote :

The workaround does not seem to be working either.
I have got the following error now :
 ./ansible-playbook-command.sh
Running Ansible command
ERROR! the role 'ceph' was not found in /home/stack/config-download/roles:/home/stack/.ansible/roles:/usr/share/ansible/roles:/etc/ansible/roles:/home/stack/config-download

The error appears to be in '/home/stack/config-download/external_deploy_steps_tasks.yaml': line 13, column 13, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

  - import_role:
      role: ceph
            ^ here

Revision history for this message
John Fulton (jfulton-org) wrote :

For this error:

 ERROR! the role 'ceph' was not found in /home/stack/config-download/roles:/home/stack/.ansible/roles:/usr/share/ansible/roles:/etc/ansible/roles:/home/stack/config-download

It looks like config-download was run manually without an ansible.cfg which references all the ansible modules on the undercloud. TripleO can generate an appropriate ansible.cfg in your home directory with:

 openstack tripleo config generate ansible

You can then move it to the config-download directory and re-run.

Changed in tripleo:
importance: Critical → High
Revision history for this message
Vasileios Baousis (bbaous) wrote :
Download full text (9.4 KiB)

Another variable needs to be defined . ceph ansible is installed but it cannot be found

PLAY [External deployment step 1] *****************************************************************************************************************************************************************************

TASK [External deployment step 1] *****************************************************************************************************************************************************************************
Thursday 28 May 2020 20:31:06 +0000 (0:00:01.747) 0:05:08.648 **********
ok: [undercloud -> localhost] => {
    "msg": "Use --start-at-task 'External deployment step 1' to resume from this task"
}

TASK [ensure ceph-ansible is installed] ***********************************************************************************************************************************************************************
Thursday 28 May 2020 20:31:06 +0000 (0:00:00.089) 0:05:08.738 **********

TASK [ceph : Check if ceph-ansible is installed] **************************************************************************************************************************************************************
Thursday 28 May 2020 20:31:06 +0000 (0:00:00.136) 0:05:08.875 **********
ok: [undercloud]

TASK [ceph : Warn about missing ceph-ansible] *****************************************************************************************************************************************************************
Thursday 28 May 2020 20:31:06 +0000 (0:00:00.240) 0:05:09.115 **********
skipping: [undercloud]

TASK [ceph : Fail if ceph-ansible is missing] *****************************************************************************************************************************************************************
Thursday 28 May 2020 20:31:06 +0000 (0:00:00.088) 0:05:09.203 **********
skipping: [undercloud]

TASK [ceph : Get ceph-ansible repository] *********************************************************************************************************************************************************************
Thursday 28 May 2020 20:31:06 +0000 (0:00:00.087) 0:05:09.291 **********
[WARNING]: Consider using the yum module rather than running 'yum'. If you need to use command because yum is insufficient you can add 'warn: false' to this command task or set 'command_warnings=False' in
ansible.cfg to get rid of this message.
ok: [undercloud]

TASK [ceph : Fail if ceph-ansible doesn't belong to the specified repo] ***************************************************************************************************************************************
Thursday 28 May 2020 20:31:27 +0000 (0:00:21.073) 0:05:30.365 **********
[WARNING]: conditional statements should not include jinja2 templating delimiters such as {{ }} or {% %}. Found: (repo.stdout | length == 0 or repo.stdout != "{{ ceph_ansible_repo }}")
fatal: [undercloud]: FAILED! => {"changed": false, "msg": "Make sure ceph-ansible package is installed from centos-ceph-nautilus or configure the repo name you intend to install it from using the 'CephAnsibleRepo' variable provided by tripleo...

Read more...

Revision history for this message
Vasileios Baousis (bbaous) wrote :

In my templele
/home/stack/templates/ceph-config.yaml I addeed the variable

CephAnsibleRepo: "tripleo-centos-ceph-nautilus" and resolved the problem above.

Revision history for this message
Vasileios Baousis (bbaous) wrote :

UPDATE
The external ceph configuration completes now but it fails right after with the following error. Any idea how to proceed?

TASK [tripleo_ceph_run_ansible : search output of ceph-ansible run(s) non-zero return codes] ******************************************************************************************************************
Friday 29 May 2020 12:21:35 +0000 (0:01:58.262) 0:19:24.058 ************
skipping: [undercloud] => (item=None)
skipping: [undercloud]

TASK [tripleo_ceph_run_ansible : print ceph-ansible output in case of failure] ********************************************************************************************************************************
Friday 29 May 2020 12:21:36 +0000 (0:00:00.837) 0:19:24.896 ************
skipping: [undercloud]

TASK [ceph : Check if ceph_mon is deployed] *******************************************************************************************************************************************************************
Friday 29 May 2020 12:21:36 +0000 (0:00:00.200) 0:19:25.096 ************
fatal: [undercloud]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"undercloud\". Make sure this host can be reached over ssh: ssh: Could not resolve hostname undercloud: Name or service not known\r\n", "unreachable": true}

PLAY RECAP ****************************************************************************************************************************************************************************************************
overcloud-controller-0 : ok=417 changed=196 unreachable=0 failed=0 skipped=208 rescued=0 ignored=0
overcloud-controller-1 : ok=411 changed=196 unreachable=0 failed=0 skipped=208 rescued=0 ignored=0
overcloud-controller-2 : ok=411 changed=196 unreachable=0 failed=0 skipped=208 rescued=0 ignored=0
overcloud-novacompute-0 : ok=291 changed=140 unreachable=0 failed=0 skipped=154 rescued=0 ignored=0
overcloud-novacompute-1 : ok=287 changed=140 unreachable=0 failed=0 skipped=154 rescued=0 ignored=0
overcloud-novacompute-2 : ok=287 changed=140 unreachable=0 failed=0 skipped=154 rescued=0 ignored=0
undercloud : ok=58 changed=19 unreachable=1 failed=0 skipped=75 rescued=0 ignored=0

Friday 29 May 2020 12:23:32 +0000 (0:01:56.284) 0:21:21.381 ************
===============================================================================
tripleo_ceph_run_ansible : run ceph-ansible ---------------------------------------------------------------------------------------------------------------------------------------------------------- 118.26s
ceph : Check if ceph_mon is deployed -----------------------------------------

Revision history for this message
John Fulton (jfulton-org) wrote :

The following is a tripleo validation failing to connect to the ceph cluster. To workaround try running config-download with a tag to skip validations

TASK [ceph : Check if ceph_mon is deployed] *******************************************************************************************************************************************************************
Friday 29 May 2020 12:21:36 +0000 (0:00:00.200) 0:19:25.096 ************
fatal: [undercloud]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"undercloud\". Make sure this host can be reached over ssh: ssh: Could not resolve hostname undercloud: Name or service not known\r\n", "unreachable": true}

Revision history for this message
Vasileios Baousis (bbaous) wrote :
Download full text (3.7 KiB)

Thanks for the help and directions.

Update.

a. The second workaround worked but then another problem surfaces (see below). The main problem is the ansible user permission for the external ceph cluster configuration which is an important component of any OpenStack deployment.Our ceph cluster has 1PB storage and it is currenly used by our openstack infrastructure (rocky) with ~1500 vCPU. ==> A ceph-ansible fix for the external ceph configution is very important and urgent IMHO.

b. Deployment without external ceph cluster passes deployment steps 1-4 but it fails in the beginning of step 5. I will raise another report about that shortly.

The above workarounds solve some of the problems but some other in the deployment. So far I haven't managed to complete any deployment in ussuri.

TASK [Pre-cache facts for puppet containers] ******************************************************************************************************************************************************************
task path: /home/stack/config-download/common_deploy_steps_tasks.yaml:67
Friday 29 May 2020 14:57:03 +0000 (0:00:00.144) 0:01:51.574 ************

TASK [tripleo_puppet_cache : Gather variables for each operating system] **************************************************************************************************************************************
task path: /usr/share/ansible/roles/tripleo_puppet_cache/tasks/main.yml:21
Friday 29 May 2020 14:57:03 +0000 (0:00:00.185) 0:01:51.760 ************
fatal: [overcloud-controller-0]: FAILED! => {
    "msg": "No file was found when using first_found. Use errors='ignore' to allow this task to be skipped if no files are found"
}
fatal: [overcloud-controller-1]: FAILED! => {
    "msg": "No file was found when using first_found. Use errors='ignore' to allow this task to be skipped if no files are found"
}
fatal: [overcloud-controller-2]: FAILED! => {
    "msg": "No file was found when using first_found. Use errors='ignore' to allow this task to be skipped if no files are found"
}
fatal: [overcloud-novacompute-0]: FAILED! => {
    "msg": "No file was found when using first_found. Use errors='ignore' to allow this task to be skipped if no files are found"
}
fatal: [overcloud-novacompute-1]: FAILED! => {
    "msg": "No file was found when using first_found. Use errors='ignore' to allow this task to be skipped if no files are found"
}
fatal: [overcloud-novacompute-2]: FAILED! => {
    "msg": "No file was found when using first_found. Use errors='ignore' to allow this task to be skipped if no files are found"
}

NO MORE HOSTS LEFT ********************************************************************************************************************************************************************************************

PLAY RECAP ****************************************************************************************************************************************************************************************************
overcloud-controller-0 : ok=11 changed=3 unreachable=0 failed=1 skipped=1 rescued=0 ignored=0
overcloud-controller-1 : ok=9 changed=2 unreachable=0 ...

Read more...

Revision history for this message
Vasileios Baousis (bbaous) wrote :

Please see a related problem I reported https://bugs.launchpad.net/tripleo/+bug/1881420 of ansible when deploying with octavia

Revision history for this message
Juan Badia Payno (jbadiapa) wrote :

My workaround was modified the ansible_ssh_user and set it to "stack" and execute the /home/stack/config-download/overcloud/ansible-playbook-command.sh as suggested on the error.

Revision history for this message
Vasileios Baousis (bbaous) wrote :

We set the ansible_ssh_user to "stack" but now we run to another known issue

TASK [Write octavia inventory] ***************************************************************************************************************************************************************
Thursday 04 June 2020 12:30:18 +0000 (0:00:00.512) 0:00:04.344 *********
fatal: [undercloud]: FAILED! =>
  msg: |-
    The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'ansible_hostname'

    The error appears to be in '/home/stack/config-download/overcloud/external_deploy_steps_tasks.yaml': line 381, column 5, but may
    be elsewhere in the file depending on the exact syntax problem.

    The offending line appears to be:

        name: Write group_vars file
      - copy:
        ^ here

NO MORE HOSTS LEFT ******

Revision history for this message
John Fulton (jfulton-org) wrote :

WORKAROUND 2 (even easier).

ssh stack@undercloud
cd /usr/lib/python3.6/site-packages/tripleo_common
sudo vi inventory.py +408

Delete the following line so that the default of UNDERCLOUD_CONNECTION_LOCAL is used:

        undercloud_connection=UNDERCLOUD_CONNECTION_SSH,

Here that line is again in more context:

https://github.com/openstack/tripleo-common/blob/d0c60a280fa7d7277165e4ddea88a1c4891dea53/tripleo_common/inventory.py#L408

I've verifieed that the above works in my environment even for an existing deployment where I've reproduced this problem.

I've marked this bug as a duplicate of the following:

 https://bugs.launchpad.net/tripleo/+bug/1884123

and will be posting a patch to fix the duplicate today (and this bug implicitly) today.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.