Sssd doesn't clean up PIDfile after crash

Bug #1777860 reported by Jurjen Bokma on 2018-06-20
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
sssd (Ubuntu)
Undecided
Unassigned
Xenial
Medium
Karl Stenerud

Bug Description

[Impact]

sssd doesn't check for PIDfile validity when starting. As a result, a PID file from a crashed sssd process will prevent it from launching again until the pidfile is manually removed.

[Test Case]

$ lxc launch ubuntu:xenial tester
$ lxc exec tester bash

# apt update
# apt dist-upgrade -y
# apt install -y sssd
# echo "[nss]
filter_groups = root
filter_users = root
reconnection_retries = 3

[pam]
reconnection_retries = 3

[sssd]
config_file_version = 2
reconnection_retries = 3
sbus_timeout = 30
services = nss, pam
domains = europe.example.com,asia.example.com

[domain/europe.example.com]
#With this as false, a simple "getent passwd" for testing won't work. You must do getent passwd <email address hidden>
enumerate = false
cache_credentials = true

id_provider = ldap
access_provider = ldap
auth_provider = krb5
chpass_provider = krb5

ldap_uri = ldaps://dc1.europe.example.com,ldaps://dc2.europe.example.com
ldap_search_base = dc=europe,dc=example,dc=com
ldap_tls_cacert = /etc/ssl/certs/ca-certificates.crt

#This parameter requires that the DC present a completely validated certificate chain. If you're testing or don't care, use 'allow' or 'never'.
ldap_tls_reqcert = demand

krb5_realm = EUROPE.EXAMPLE.COM
dns_discovery_domain = EUROPE.EXAMPLE.COM

ldap_schema = rfc2307bis
ldap_access_order = expire
ldap_account_expire_policy = ad
ldap_force_upper_case_realm = true

ldap_user_search_base = dc=europe,dc=example,dc=com
ldap_group_search_base = dc=europe,dc=example,dc=com
ldap_user_object_class = user
ldap_user_name = sAMAccountName
ldap_user_fullname = displayName
ldap_user_home_directory = unixHomeDirectory
ldap_user_principal = userPrincipalName
ldap_group_object_class = group
ldap_group_name = sAMAccountName

#Bind credentials
ldap_default_bind_dn = cn=europe-ldap-reader,cn=Users,dc=europe,dc=example,dc=com
ldap_default_authtok = secret

[domain/asia.example.com]
#With this as false, a simple "getent passwd" for testing won't work. You must do getent passwd <email address hidden>
enumerate = false
cache_credentials = true

id_provider = ldap
access_provider = ldap
auth_provider = krb5
chpass_provider = krb5

ldap_uri = ldaps://dc1.asia.example.com,ldaps://dc2.asia.example.com
ldap_search_base = dc=asia,dc=example,dc=com
ldap_tls_cacert = /etc/ssl/certs/ca-certificates.crt

#This parameter requires that the DC present a completely validated certificate chain. If you're testing or don't care, use 'allow' or 'never'.
ldap_tls_reqcert = demand

krb5_realm = ASIA.EXAMPLE.COM
dns_discovery_domain = ASIA.EXAMPLE.COM

ldap_schema = rfc2307bis
ldap_access_order = expire
ldap_account_expire_policy = ad
ldap_force_upper_case_realm = true

ldap_user_search_base = dc=asia,dc=example,dc=com
ldap_group_search_base = dc=asia,dc=example,dc=com
ldap_user_object_class = user
ldap_user_name = sAMAccountName
ldap_user_fullname = displayName
ldap_user_home_directory = unixHomeDirectory
ldap_user_principal = userPrincipalName
ldap_group_object_class = group
ldap_group_name = sAMAccountName

#Bind credentials
ldap_default_bind_dn = cn=asia-ldap-reader,cn=Users,dc=asia,dc=example,dc=com
ldap_default_authtok = secret" >/etc/sssd/sssd.conf
# chmod 600 /etc/sssd/sssd.conf
# service sssd start
# pkill -KILL -F /var/run/sssd.pid
# service sssd start
# journalctl -xe
Oct 30 10:25:46 xtest sssd[7110]: SSSD is already running

[Regression Potential]

The change would be to check if the pid in the file is still active, which shouldn't cause regressions.

[Original Description]

After having crashed, sssd will not start, because the old PIDfile is still present. The fact that the process does not exist any more does not cause the PIDfile to be cleaned up.
This bug is already known, but not fixed, upstream: https://pagure.io/SSSD/sssd/issue/3528
(also contains instructions for reproduction).

In our environment, with hundreds of computers running Ubuntu, the 'solution' brought forth in that discussion, to investigate and handle the issue manually, is not a serious option.

So I propose that we make systemd handle the PIDfile in case of a crash. With the attached one-line patch applied, systemd will clean up the PIDfile after a crash. That way, sssd doesn't have to make assumptions about namespaces, but the package still handles the issue.

Mandatory data:

Ubuntu version:
  Ubuntu 16.04.4 LTS

Package version:
  apt-cache policy $(dpkg -S /lib/systemd/system/sssd.service )
   sssd-common: Installed: 1.13.4-1ubuntu1.11

What I expect to happen:
After
  kill -9 $(cat /var/run/sssd.pid)
the command
  systemctl start sssd
results in a running sssd.

What happens instead:
No sssd is running. Only after
  rm /var/run/sssd.pid
  systemctl start sssd
does it run again.

Related branches

Jurjen Bokma (j-bokma-t) wrote :
description: updated
description: updated

The attachment "Add PIDFile setting in sssd.service" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch
Robie Basak (racb) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better.

It looks like upstream added a PIDFile entry to the systemd service definition in 1.16.1, which is included in Bionic and Cosmic. So it's likely that this bug is fixed there.

Therefore this bug affects only Xenial and possibly Artful.

> In our environment, with hundreds of computers running Ubuntu, the 'solution' brought forth in that discussion, to investigate and handle the issue manually, is not a serious option.

Did you know that you can work around the problem by creating an override file at /etc/systemd/system/sssd.service?

Nevertheless we should fix this in Xenial for the benefit of all other users who hit this, but now that you know the workaround should stop the bug affecting you easily enough.

Changed in sssd (Ubuntu):
status: New → Fix Released
Changed in sssd (Ubuntu Xenial):
status: New → Triaged
importance: Undecided → Medium
tags: added: server-next

On 06/21/2018 04:39 PM, Robie Basak wrote:
> Thank you for taking the time to report this bug and helping to make
> Ubuntu better.
>
> It looks like upstream added a PIDFile entry to the systemd service
> definition in 1.16.1, which is included in Bionic and Cosmic. So it's
> likely that this bug is fixed there.
>
> Therefore this bug affects only Xenial and possibly Artful.
>
>> In our environment, with hundreds of computers running Ubuntu, the
> 'solution' brought forth in that discussion, to investigate and handle
> the issue manually, is not a serious option.
>
> Did you know that you can work around the problem by creating an
> override file at /etc/systemd/system/sssd.service?
>
Thanks for pointing that out! I should've thought of it, but I guess my
brain stopped once I had fixed it in /lib.

'gards
Jurjen

> Nevertheless we should fix this in Xenial for the benefit of all other
> users who hit this, but now that you know the workaround should stop the
> bug affecting you easily enough.
>
> ** Also affects: sssd (Ubuntu Xenial)
> Importance: Undecided
> Status: New
>
> ** Changed in: sssd (Ubuntu)
> Status: New => Fix Released
>
> ** Changed in: sssd (Ubuntu Xenial)
> Status: New => Triaged
>
> ** Changed in: sssd (Ubuntu Xenial)
> Importance: Undecided => Medium
>
> ** Tags added: server-next
>

Changed in sssd (Ubuntu Xenial):
assignee: nobody → Karl Stenerud (kstenerud)
description: updated
Changed in sssd (Ubuntu Xenial):
status: Triaged → In Progress

Hello Jurjen, or anyone else affected,

Accepted sssd into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/sssd/1.13.4-1ubuntu1.12 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in sssd (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-xenial
Karl Stenerud (kstenerud) wrote :
Download full text (4.6 KiB)

Verified working:

Setup:

# lxc launch ubuntu-daily:xenial tester && lxc exec tester bash

Failure Case:

# apt update && apt dist-upgrade -y && apt install -y sssd
# echo "[nss]
filter_groups = root
filter_users = root
reconnection_retries = 3
[pam]
reconnection_retries = 3
[sssd]
config_file_version = 2
reconnection_retries = 3
sbus_timeout = 30
services = nss, pam
domains = europe.example.com,asia.example.com
[domain/europe.example.com]
#With this as false, a simple "getent passwd" for testing won't work. You must do getent passwd <email address hidden>
enumerate = false
cache_credentials = true
id_provider = ldap
access_provider = ldap
auth_provider = krb5
chpass_provider = krb5
ldap_uri = ldaps://dc1.europe.example.com,ldaps://dc2.europe.example.com
ldap_search_base = dc=europe,dc=example,dc=com
ldap_tls_cacert = /etc/ssl/certs/ca-certificates.crt
#This parameter requires that the DC present a completely validated certificate chain. If you're testing or don't care, use 'allow' or 'never'.
ldap_tls_reqcert = demand
krb5_realm = EUROPE.EXAMPLE.COM
dns_discovery_domain = EUROPE.EXAMPLE.COM
ldap_schema = rfc2307bis
ldap_access_order = expire
ldap_account_expire_policy = ad
ldap_force_upper_case_realm = true
ldap_user_search_base = dc=europe,dc=example,dc=com
ldap_group_search_base = dc=europe,dc=example,dc=com
ldap_user_object_class = user
ldap_user_name = sAMAccountName
ldap_user_fullname = displayName
ldap_user_home_directory = unixHomeDirectory
ldap_user_principal = userPrincipalName
ldap_group_object_class = group
ldap_group_name = sAMAccountName
#Bind credentials
ldap_default_bind_dn = cn=europe-ldap-reader,cn=Users,dc=europe,dc=example,dc=com
ldap_default_authtok = secret
[domain/asia.example.com]
#With this as false, a simple "getent passwd" for testing won't work. You must do getent passwd <email address hidden>
enumerate = false
cache_credentials = true
id_provider = ldap
access_provider = ldap
auth_provider = krb5
chpass_provider = krb5
ldap_uri = ldaps://dc1.asia.example.com,ldaps://dc2.asia.example.com
ldap_search_base = dc=asia,dc=example,dc=com
ldap_tls_cacert = /etc/ssl/certs/ca-certificates.crt
#This parameter requires that the DC present a completely validated certificate chain. If you're testing or don't care, use 'allow' or 'never'.
ldap_tls_reqcert = demand
krb5_realm = ASIA.EXAMPLE.COM
dns_discovery_domain = ASIA.EXAMPLE.COM
ldap_schema = rfc2307bis
ldap_access_order = expire
ldap_account_expire_policy = ad
ldap_force_upper_case_realm = true
ldap_user_search_base = dc=asia,dc=example,dc=com
ldap_group_search_base = dc=asia,dc=example,dc=com
ldap_user_object_class = user
ldap_user_name = sAMAccountName
ldap_user_fullname = displayName
ldap_user_home_directory = unixHomeDirectory
ldap_user_principal = userPrincipalName
ldap_group_object_class = group
ldap_group_name = sAMAccountName
#Bind credentials
ldap_default_bind_dn = cn=asia-ldap-reader,cn=Users,dc=asia,dc=example,dc=com
ldap_default_authtok = secret" >/etc/sssd/sssd.conf
# chmod 600 /etc/sssd/sssd.conf
# service sssd start
# pkill -KILL -F /var/run/sssd.pid
# service sssd start
Job for sssd.service failed because the control process exited with error code. See "systemctl status ...

Read more...

tags: added: verification-done verification-done-xenial
removed: verification-needed verification-needed-xenial
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers