Intermittent failure adding user 'ceph-admin', exit code: 9
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
Critical
|
John Fulton |
Bug Description
The tripleo_
It failed on the "create user" task [1] which really just calls the ansible user module [2]
When correlating the failed task time [3] with the journal log [4] I see "useradd[31970]: failed adding user 'ceph-admin', exit code: 9" [5]
Exit code 9 means "username already in use" [6]
The journal [4] shows "new user: name=ceph-admin" only once
We're using Ansible Version: 2.8.20 per the job output [7]
The user module for this version already has a user_exists function so idempotence is attempted [8]
The user module in 2.8.20 hasn't been updated in 3 years though the version 2.13 was updated last 3 months ago
Perhaps for now, as a work-around, we should assume that [1] is not idempotent and add a check before calling it.
[2] https:/
[3] 2022-06-14 08:15:45,377 p=31908 u=zuul n=ansible | 2022-06-14 08:15:45.376184 | bc764e20-
[5]
Jun 14 08:15:45 standalone.
Jun 14 08:15:45 standalone.
Jun 14 08:15:45 standalone.
[6] https:/
[8] https:/
[9] https:/
summary: |
- failed adding user 'ceph-admin', exit code: 9 + Intermittent failure adding user 'ceph-admin', exit code: 9 |
Changed in tripleo: | |
importance: | Medium → Critical |
tags: | added: promotion promotion-blocker |
Another intermittent failure was observed in CI once but not again. Perhaps when this bug is patched an extra task could be added to ensure authorized keys have the correct selinux context.
https:/ /review. rdoproject. org/r/c/ testproject/ +/36256
001 standalone testing periodic. New failure:
First time I've seen this timeout:
fatal: [undercloud]: FAILED! => {"attempts": 12, "changed": true, "cmd": "ssh -i /home/zuul/ .ssh/ceph- admin-id_ rsa -o StrictHostKeyCh ecking= no -o UserKnownHostsF ile=/dev/ null ceph-admin@ 192.168. 42.1 'echo good'", "delta": "0:00:00.063750", "end": "2022-06-18 22:10:39.793079", "msg": "non-zero return code", "rc": 255, "start": "2022-06-18 22:10:39.729329", "stderr": "Warning: Permanently added '192.168.42.1' (ED25519) to the list of known hosts.\ r\nceph- admin@192. 168.42. 1: Permission denied (publickey).", "stderr_lines": ["Warning: Permanently added '192.168.42.1' (ED25519) to the list of known hosts.", "ceph-admin@ 192.168. 42.1: Permission denied (publickey)."], "stdout": "", "stdout_lines": []}
The failed task [1] confirms the ceph-admin user account is unable to support ssh entry.
The account was created [2].
The journal [3] showed that selinux denied sshd from read access on the file authorized_keys.
I could add an extra task after creating the the authorized_keys key [5] which ensures the SELinux context is correct.
[1] https:/ /github. com/openstack/ tripleo- quickstart- extras/ blob/master/ roles/standalon e/tasks/ ceph-install. yml#L95
[2] https:/ /logserver. rdoproject. org/56/ 36256/98/ check/periodic- tripleo- ci-centos- 9-scenario001- standalone- master/ a453b28/ logs/undercloud /etc/passwd. txt.gz
[3] https:/ /logserver. rdoproject. org/56/ 36256/98/ check/periodic- tripleo- ci-centos- 9-scenario001- standalone- master/ a453b28/ logs/undercloud /var/log/ extra/journal. txt.gz
[4] localdomain setroubleshoot[ 86399]: SELinux is preventing /usr/sbin/sshd from read access on the file authorized_keys. For complete SELinux messages run: sealert -l fa5095ff- f00d-4445- bacf-6c7ddf9e24 d5 localdomain setroubleshoot[ 86399]: SELinux is preventing /usr/sbin/sshd from read access on the file authorized_keys.
Jun 18 22:10:41 standalone.
Jun 18 22:10:41 standalone.