Intermittent failure adding user 'ceph-admin', exit code: 9

Bug #1979093 reported by John Fulton
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
John Fulton

Bug Description

The tripleo_create_admin role (which creates the tripleo-admin user for all tripleo deloys) fails [0] intermittently in our CI.
It failed on the "create user" task [1] which really just calls the ansible user module [2]
When correlating the failed task time [3] with the journal log [4] I see "useradd[31970]: failed adding user 'ceph-admin', exit code: 9" [5]
Exit code 9 means "username already in use" [6]
The journal [4] shows "new user: name=ceph-admin" only once
We're using Ansible Version: 2.8.20 per the job output [7]
The user module for this version already has a user_exists function so idempotence is attempted [8]
The user module in 2.8.20 hasn't been updated in 3 years though the version 2.13 was updated last 3 months ago
Perhaps for now, as a work-around, we should assume that [1] is not idempotent and add a check before calling it.

[0] https://e7fbb04e1f1bc4fc681d-d7024b4a9df5a3cf5a52965a2b5a469a.ssl.cf1.rackcdn.com/834352/81/gate/tripleo-ci-centos-9-scenario004-standalone/574cdfb/logs/undercloud/home/zuul/ansible.log

[1] https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/roles/tripleo_create_admin/tasks/create_user.yml#L18-L21

[2] https://docs.ansible.com/ansible/latest/collections/ansible/builtin/user_module.html

[3] 2022-06-14 08:15:45,377 p=31908 u=zuul n=ansible | 2022-06-14 08:15:45.376184 | bc764e20-14a4-6973-e52d-000000000016 | FATAL | create user ceph-admin | undercloud | error={"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}

[4] https://e7fbb04e1f1bc4fc681d-d7024b4a9df5a3cf5a52965a2b5a469a.ssl.cf1.rackcdn.com/834352/81/gate/tripleo-ci-centos-9-scenario004-standalone/574cdfb/logs/undercloud/var/log/extra/journal.txt

[5]
Jun 14 08:15:45 standalone.localdomain useradd[31969]: new group: name=ceph-admin, GID=1002
Jun 14 08:15:45 standalone.localdomain useradd[31969]: new user: name=ceph-admin, UID=1001, GID=1002, home=/home/ceph-admin, shell=/bin/bash, from=none
Jun 14 08:15:45 standalone.localdomain useradd[31970]: failed adding user 'ceph-admin', exit code: 9

[6] https://linux.die.net/man/8/useradd

[7] https://e7fbb04e1f1bc4fc681d-d7024b4a9df5a3cf5a52965a2b5a469a.ssl.cf1.rackcdn.com/834352/81/gate/tripleo-ci-centos-9-scenario004-standalone/574cdfb/job-output.txt

[8] https://github.com/ansible/ansible/blob/v2.8.20/lib/ansible/modules/system/user.py#L848

[9] https://github.com/ansible/ansible/blob/v2.13.0/lib/ansible/modules/user.py

summary: - failed adding user 'ceph-admin', exit code: 9
+ Intermittent failure adding user 'ceph-admin', exit code: 9
Revision history for this message
John Fulton (jfulton-org) wrote :
Download full text (7.1 KiB)

Another intermittent failure was observed in CI once but not again. Perhaps when this bug is patched an extra task could be added to ensure authorized keys have the correct selinux context.

https://review.rdoproject.org/r/c/testproject/+/36256

001 standalone testing periodic. New failure:

First time I've seen this timeout:

fatal: [undercloud]: FAILED! => {"attempts": 12, "changed": true, "cmd": "ssh -i /home/zuul/.ssh/ceph-admin-id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ceph-admin@192.168.42.1 'echo good'", "delta": "0:00:00.063750", "end": "2022-06-18 22:10:39.793079", "msg": "non-zero return code", "rc": 255, "start": "2022-06-18 22:10:39.729329", "stderr": "Warning: Permanently added '192.168.42.1' (ED25519) to the list of known hosts.\r\nceph-admin@192.168.42.1: Permission denied (publickey).", "stderr_lines": ["Warning: Permanently added '192.168.42.1' (ED25519) to the list of known hosts.", "ceph-admin@192.168.42.1: Permission denied (publickey)."], "stdout": "", "stdout_lines": []}

The failed task [1] confirms the ceph-admin user account is unable to support ssh entry.
The account was created [2].
The journal [3] showed that selinux denied sshd from read access on the file authorized_keys.
I could add an extra task after creating the the authorized_keys key [5] which ensures the SELinux context is correct.

[1] https://github.com/openstack/tripleo-quickstart-extras/blob/master/roles/standalone/tasks/ceph-install.yml#L95

[2] https://logserver.rdoproject.org/56/36256/98/check/periodic-tripleo-ci-centos-9-scenario001-standalone-master/a453b28/logs/undercloud/etc/passwd.txt.gz

[3] https://logserver.rdoproject.org/56/36256/98/check/periodic-tripleo-ci-centos-9-scenario001-standalone-master/a453b28/logs/undercloud/var/log/extra/journal.txt.gz

[4]
Jun 18 22:10:41 standalone.localdomain setroubleshoot[86399]: SELinux is preventing /usr/sbin/sshd from read access on the file authorized_keys. For complete SELinux messages run: sealert -l fa5095ff-f00d-4445-bacf-6c7ddf9e24d5
Jun 18 22:10:41 standalone.localdomain setroubleshoot[86399]: SELinux is preventing /usr/sbin/sshd from read access on the file authorized_keys.

                                                              ***** Plugin catchall (100. confidence) suggests **************************

                                                              If you believe that sshd should be allowed read access on the authorized_keys file by default.
                                                              Then you should report this as a bug.
                                                              You can generate a local policy module to allow this access.
                                                              Do
                                                              allow this access for now by executing:
                                                              # ausearch -c 'sshd' --raw | audit2allow -M my-sshd
                                                              # semodule -X 300 -i my-...

Read more...

Revision history for this message
John Fulton (jfulton-org) wrote :
Ronelle Landy (rlandy)
Changed in tripleo:
importance: Medium → Critical
tags: added: promotion promotion-blocker
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to python-tripleoclient (master)
Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-ansible (master)

Change abandoned by "John Fulton <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-ansible/+/846530
Reason: https://review.opendev.org/c/openstack/python-tripleoclient/+/847844

Revision history for this message
John Fulton (jfulton-org) wrote :

This patch is tested in standalone 001/004/010 by https://review.opendev.org/c/openstack/tripleo-heat-templates/+/834354

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to python-tripleoclient (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/python-tripleoclient/+/847980

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on python-tripleoclient (stable/wallaby)

Change abandoned by "John Fulton <email address hidden>" on branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/python-tripleoclient/+/847980
Reason: Created new patch by accident, instead wanted to update https://review.opendev.org/c/openstack/python-tripleoclient/+/845482

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to python-tripleoclient (master)

Reviewed: https://review.opendev.org/c/openstack/python-tripleoclient/+/847844
Committed: https://opendev.org/openstack/python-tripleoclient/commit/5ee23cf83def70b541858958659dc33a6bb5b0b6
Submitter: "Zuul (22348)"
Branch: master

commit 5ee23cf83def70b541858958659dc33a6bb5b0b6
Author: John Fulton <email address hidden>
Date: Mon Jun 27 14:45:35 2022 -0400

    Limit standalone ceph-admin user creation to a single host

    When 'openstack overcloud ceph user enable --standalone' is
    run, call Ansible with '--limit undercloud'.

    Bug #1979093 happened because Ansible was running the user
    module on the same host as if it were two hosts. The module
    is idempotent but not race safe. E.g. when user execution A
    and user execution B are run on the same host, A's check that
    the user does not exist might be true but before A goes on to
    create the user, B could have created it first depending on
    scheduling.

    The python-tripleoclient uses Ansible --limit when creating
    the ceph-admin user so only _admin nodes get the private key.
    This works for multinode but standalone only has one node, so
    for that condition redefine the limit list to that single node.

    Change-Id: I2f62cdfcb88edb5552cbd7351b6240f78376c93d
    Closes-Bug: #1979093

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to python-tripleoclient (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/python-tripleoclient/+/847980
Committed: https://opendev.org/openstack/python-tripleoclient/commit/186d7f4e4cd1b931918bbb375f62b4fb9d48375e
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 186d7f4e4cd1b931918bbb375f62b4fb9d48375e
Author: John Fulton <email address hidden>
Date: Mon Jun 27 14:45:35 2022 -0400

    Limit standalone ceph-admin user creation to a single host

    When 'openstack overcloud ceph user enable --standalone' is
    run, call Ansible with '--limit undercloud'.

    Bug #1979093 happened because Ansible was running the user
    module on the same host as if it were two hosts. The module
    is idempotent but not race safe. E.g. when user execution A
    and user execution B are run on the same host, A's check that
    the user does not exist might be true but before A goes on to
    create the user, B could have created it first depending on
    scheduling.

    The python-tripleoclient uses Ansible --limit when creating
    the ceph-admin user so only _admin nodes get the private key.
    This works for multinode but standalone only has one node, so
    for that condition redefine the limit list to that single node.

    Change-Id: I2f62cdfcb88edb5552cbd7351b6240f78376c93d
    Closes-Bug: #1979093
    (cherry picked from commit 5ee23cf83def70b541858958659dc33a6bb5b0b6)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/python-tripleoclient 19.0.0

This issue was fixed in the openstack/python-tripleoclient 19.0.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.