sssd offline on boot, stays offline forever

Bug #1723350 reported by Marian Rainer-Harbach on 2017-10-13
58
This bug affects 9 people
Affects Status Importance Assigned to Milestone
sssd (Ubuntu)
Medium
Unassigned
Artful
Medium
Unassigned

Bug Description

SSSD 1.15.3-2ubuntu1 on 17.10/artful (previous versions on artful were also affected) is offline on boot and seems to stay offline forever (I waited over 20 minutes).

sssd_nss.log:
(Fri Oct 13 09:49:50 2017) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Fri Oct 13 09:49:51 2017) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Fri Oct 13 09:49:51 2017) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Fri Oct 13 09:49:51 2017) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
...

SSSD immediately returns to normal operation after restarting it or after sending SIGUSR2.

A workaround for the problem is creating the file /etc/systemd/system/sssd.service.d/override.conf with contents
[Unit]
Requires=network-online.target
After=network-online.target

renbag (renbag) wrote :

I confirm the bug and that the workaround solves the problem.
If the sssd configuration doesn't include:
cache_credentials = true
it is not even possible to do an Active Directory login.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in sssd (Ubuntu):
status: New → Confirmed
Andreas Hasenack (ahasenack) wrote :

I can confirm this is happening after a reboot, and that a simple USR2 signal fixes it, but I wonder why sssd doesn't get itself back into online mode on its own in this case.

For example, I tried the following:
- login via kerberos using pam_sss while it was online
- got my ticket
- kdestroy, logout
- in another terminal, bring down the network on this client workstation
- try go login again
- login failed, and sssd log showed:
(Fri Oct 13 21:23:15 2017) [sssd[pam]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]

I then brought the network back up, and after a while (a minute or less), login was working again using krb5. This shows sssd recovered from the induced "outage".

The systemd unit file we are using in ubuntu comes straight from upstream.

(some searching)

Found these upstream tickets:

https://pagure.io/SSSD/sssd/issue/3467
https://pagure.io/SSSD/sssd/issue/3294

Changed in sssd (Ubuntu):
status: Confirmed → Triaged
importance: Undecided → Medium

Also affects Ubuntu Bionic (18.04)

A nicer workaround, for those using NetworkManager & systemd would be to tell sssd to try going online when an interface comes up.

Create /etc/NetworkManager/dispatcher.d/sssd_notify with the contents:
#!/bin/bash
case "$2" in
  up|vpn-up) systemctl kill --signal=USR2 sssd.service; ;;
esac

summary: - sssd offline on boot, stays offline forever (artful)
+ sssd offline on boot, stays offline forever (artful, bionic)

@Simon: Interesting – I can't reproduce the problem in 18.04 anymore. I tested a VM and a laptop using ethernet and wifi, and sssd was online on boot without any workarounds.

Andreas Hasenack (ahasenack) wrote :

The test in comment #4 is more thorough, as it checks that sssd can detect that the network is up again on its own. Can you try that on bionic please?

I tried the steps in #4. As expected, login failed while the network was offline. It worked again immediately after enabling the network. So to me it seems the problem really does not occur in bionic anymore.

Andreas Hasenack (ahasenack) wrote :

I also tried again in bionic, including rebooting the sssd client while the krb5 server was offline, and it automatically switched back to an online status after i restored the network on my krb5 server. I then tried this again on xenial, and there it also worked.

let me try artful now.

Andreas Hasenack (ahasenack) wrote :

Reading what I wrote in comment #4, seems like I'm contradicting myself. First I say I can reproduce the problem, but later I say that after I bring the network back up it starts working again on its own. That's exactly what I'm seeing now, i.e., no bug. Tried xenial, artful and bionic. Are you guys on a desktop by any chance, with just a wifi connection perhaps?

Changed in sssd (Ubuntu):
status: Triaged → Incomplete
renbag (renbag) wrote :

I confirmed the bug in artful, but now, with the same sssd.conf in bionic, I do not see the bug anymore.

Andreas Hasenack (ahasenack) wrote :

I'll mark this as fix released for bionic then, and open up an artful task.

Changed in sssd (Ubuntu):
status: Incomplete → Fix Released
summary: - sssd offline on boot, stays offline forever (artful, bionic)
+ sssd offline on boot, stays offline forever
Changed in sssd (Ubuntu Artful):
status: New → Triaged
importance: Undecided → Medium
Andreas Hasenack (ahasenack) wrote :

@renbag, can you share your network configuration on this artful machine? These files:

/etc/network/interfaces
/etc/network/interfaces.d/*
/etc/netplan/*.yaml

And is this a desktop or a server?

renbag (renbag) wrote :

The artful machine is a workstation and has only a wired network interface managed by network-manager that gets configured by a DHCP server on our LAN.
The files you requested are not present or have little or no information:

/etc/network/interfaces
--------------------------------------------------
# interfaces(5) file used by ifup(8) and ifdown(8)
auto lo
iface lo inet loopback
--------------------------------------------------

/etc/netplan/01-network-manager-all.yaml
--------------------------------------------------
# Let NetworkManager manage all devices on this system
network:
  version: 2
  renderer: NetworkManager
--------------------------------------------------

Andreas Hasenack (ahasenack) wrote :

@renbag could you please:

- add debug_level = 4 to [sssd] and your [domain/] sections
- reboot
- post the output of these commands right after:

systemd-analyze blame

systemd-analyze critical-chain

- login as a kerberos/ldap user, which has been failing
- check the new logs in /var/log/sssd/* trying to catch the moment it decides it is offline. You could attach them to this bug, your the username and realm will show up and it might be too much to sanitize

In my artful-desktop VM, setup to use krb5 and ldap from another machine, logins after reboots work just fine. I can also take the network down, fail a login, take it backup, and without restarting anything login just fine afterwards. So we haven't gotten to the bottom of this yet.

renbag (renbag) wrote :

@Andreas: thank you for your suggestions. I removed the workaround from my artful machine and did what you asked, then, after some minutes, I reactivated the workaround and did the same. All the logs are attached for comparison (the real username and domain were redacted).

renbag (renbag) wrote :

/var/log/sssd/* attachment.

renbag (renbag) wrote :
Andreas Hasenack (ahasenack) wrote :

The workaround you speak of, is that the sssd option to cache credentials, or the systemd unit file change to require network-online.target?

renbag (renbag) wrote :

The workaround is the systemd unit file written in the Bug Description.
The 'cache credentials' config option does not have any effect on the bug and for the tests I have disabled it.

Jakub Hrozek (jakub-hrozek) wrote :

Are the sssd logs captured with a high enough debug_level in the [sssd] section of sssd.conf? Normally sssd should detect notifications from libnetlink and reset the offline status.. It would be nice to see those and correlate with system logs..

OliFre (freyermuth) wrote :

Could somebody please add Bionic back?
We observe this on Bionic, it's pretty random, mostly machines with fast root devices (e.g. NVMEs) are affected. Backend is offline on boot and stays offline.
Since it's so random and I don't want to enable debug information on all machines we have, I can't promise that we will be able to collect debug data.

OliFre (freyermuth) wrote :

@jakub-hrozek: When activating debug_level on some machines, which level is high enough?

Andreas Hasenack (ahasenack) wrote :

Maybe start with 6. From https://docs.pagure.org/SSSD.sssd/users/troubleshooting.html:
"""
To enable debugging persistently across SSSD service restarts, put the directive debug_level=N, where N typically stands for a number between 1 and 10 into the particular section. Levels up to 3 should log mostly failures (although we haven’t really been consistent especially earlier in the SSSD development) and anything above level 8 provides a large number of log messages. Level 6 might be a good starting point for debugging problems.
"""

OliFre (freyermuth) wrote :

@ahasenack: Even in case things "work", I observe the following:

Jun 11 14:41:15 cip000 systemd[1]: Starting System Security Services Daemon...
Jun 11 14:41:24 cip000 sssd[920]: Starting up
Jun 11 14:41:25 cip000 sssd[be[998]: Starting up
Jun 11 14:41:28 cip000 sssd[1021]: Starting up
Jun 11 14:41:28 cip000 sssd[1022]: Starting up
Jun 11 14:41:29 cip000 systemd[1]: Started System Security Services Daemon.
Jun 11 14:42:38 cip000 sssd[be[998]: Backend is online

Users trying to log in between 14:41 and 14:42 get an error message.
In the logfiles, I only see tons of:
(Mon Jun 11 14:42:32 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]

in the nss-logs, and a few:
(Mon Jun 11 14:41:49 2018) [sssd[pam]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]

in the sssd_pam log for each failed attempt.

Even though I get "Backend is online" in this case, the logs don't tell me why it actually recovered. So maybe 6 is not high enough?
Is it really sufficient to enable debugging only for sssd itself, don't I need to enable it for the provider, too?

Also, shouldn't this debug_level be a bitmask?

In any case, network connectivity was available well before 14:42:38. I actually logged in via ssh at 14:41:52 as user, but since the backend was still offline, no kerberos ticket was initialized.

OliFre (freyermuth) wrote :

I've now had a machine with actual failure and debug_level=9 in sections sssd, pam and nss.
No more information was produced, just tons of org.freedesktop.sssd.Error.DataProvider.Offline .

Seems that official manual is outdated, will try with bitmasks next.

OliFre (freyermuth) wrote :

I tried wiht 0x7770 now in all sections, but I don't get more info.
Now, sssd.log shows:
(Mon Jun 11 15:10:34:313118 2018) [sssd] [sss_ini_call_validators] (0x0020): [rule/allowed_sssd_options]: Attribute 'debug_devel' is not allowed in section 'sssd'. Check for typos.
for each section :-(.

So apparently the manpage is also wrong? I specified:
debug_devel = 0x7770

So I've no idea on how to provide debugging info. But I have a bionic machine (for 1-2 days) on which I can reproduce well, since it has an SSD.
Seems to be highly reproducible on any SSD / NVME Bionic machine.

OliFre (freyermuth) wrote :

Argh! My fault, with the typo fixed, I see the following:

Jun 11 15:36:27 exp006 sssd[be[918]: Backend is offline
[...]
(Mon Jun 11 15:36:31 2018) [sssd] [message_type] (0x0200): netlink Message type: 24
(Mon Jun 11 15:36:31 2018) [sssd] [route_msg_debug_print] (0x1000): route idx 2 flags 0 family 2 addr correct_ip_address/32
(Mon Jun 11 15:36:31 2018) [sssd] [network_status_change_cb] (0x2000): A networking status change detected signaling providers to reset offline status
(Mon Jun 11 15:36:31 2018) [sssd] [sbus_add_timeout] (0x2000): 0x5616e2ab79f0
[...]
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [resolv_gethostbyname_step] (0x2000): Querying DNS
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [resolv_gethostbyname_dns_query] (0x0100): Trying to resolve A record of 'kdc1.OUR.INSTITUTION' in DNS
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [schedule_request_timeout] (0x2000): Scheduling a timeout of 6 seconds
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [schedule_timeout_watcher] (0x2000): Scheduling DNS timeout watcher
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [unschedule_timeout_watcher] (0x4000): Unscheduling DNS timeout watcher
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [request_watch_destructor] (0x0400): Deleting request watch
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [resolv_gethostbyname_done] (0x0040): querying hosts database failed [5]: Input/output error
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [fo_resolve_service_done] (0x0020): Failed to resolve server 'kdc1.OUR_INSTITUTION': Could not contact DNS servers
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [set_server_common_status] (0x0100): Marking server 'kdc1.OUR_INSTITUTION' as 'not working'

So it seems even though network runs fine, DNS fails.
dig kdc1.OUR_INSTITUTION A
works fine, though.

OliFre (freyermuth) wrote :

That's interesting.
strace'ing the sssd_be process, it seems that it's trying to contact 127.0.0.1 as DNS server, and fails.
systemd-resolved is running on 127.0.0.53 and works just fine.

OliFre (freyermuth) wrote :

Is this an inotify race due to the symlink-magic of systemd-resolved?

sssd.log:(Mon Jun 11 15:36:23 2018) [sssd] [_snotify_create] (0x0400): Added a watch for /etc/resolv.conf with inotify flags 0x8D88 internal flags 0x1 using function resolv_conf_inotify_cb after delay 1.0

# ls -la /etc/resolv.conf
lrwxrwxrwx 1 root root 39 Jun 11 13:12 /etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf

# stat /run/systemd/resolve/stub-resolv.conf
  File: /run/systemd/resolve/stub-resolv.conf
  Size: 727 Blocks: 8 IO Block: 4096 regular file
Device: 17h/23d Inode: 950 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 101/systemd-resolve) Gid: ( 103/systemd-resolve)
Access: 2018-06-11 15:36:31.937607730 +0200
Modify: 2018-06-11 15:36:31.925607676 +0200
Change: 2018-06-11 15:36:31.925607676 +0200
 Birth: -

So it seems /run/systemd/resolve/stub-resolv.conf was modified slightly after sssd_be set up it's inotify watch, but I don't see anything in the logs that sssd was notified about this modification. Maybe since it's a symlink?

OliFre (freyermuth) wrote :

Needless to say, resolv.conf is:
nameserver 127.0.0.53
search OUR.INSTITUTION

OliFre (freyermuth) wrote :

I can confirm that adding the described override, all works fine.

Andreas Hasenack (ahasenack) wrote :

The inotify callback would log this:

    DEBUG(SSSDBG_TRACE_INTERNAL,
          "Received inotify notification for %s\n", filename);

That is level 8 debugging:
./src/util/debug.h:#define SSSDBG_TRACE_INTERNAL 0x2000 /* level 8 */

OliFre (freyermuth) wrote :

Thanks for searching out this info!
I grepped through all logs for "inot" and found no match (only the initial inotify setup). So it seems sssd was not notified via inotify of the change.
That matches the behaviour that "strace" showed that "127.0.0.1" instead of "127.0.0.53" was queried.

So either, there is a race (resolv.conf symlink target changed right between sssd reading it and inotify watch being activated, if this is the order in which things happen?) or systemd-resolved somehow manages to change resolv.conf in a way that inotify does not alert on.

For example, it seems that if I do the following:
$ touch old_resolv
$ touch new_resolv
$ ln -s old_resolv resolv
and now I setup an inotify watch on resolv, and perform:
$ ln -sf new_resolv resolv
Then effectively, resolv is changed (since the symlink target changed) but inotify does not see it.

Not sure if this is what systemd-resolved is doing, but it's not unlikely...

OliFre (freyermuth) wrote :

Another issue appears to be...
When starting sssd in highest debug level, and then changing resolv.conf once by doing:
$ touch foo
$ mv foo /run/systemd/resolve/stub-resolv.conf
(that's where the link target is pointing to), I see the expected:
(Tue Jun 12 00:16:57 2018) [sssd] [resolv_conf_inotify_cb] (0x2000): Received inotify notification for /etc/resolv.conf
(Tue Jun 12 00:16:57 2018) [sssd] [monitor_update_resolv] (0x1000): Resolv.conf has been updated. Reloading.

Doing that *again*, i.e.
$ touch bar
$ mv bar /run/systemd/resolve/stub-resolv.conf
Does *not* trigger sssd's inotify watch again. That's apparently since sssd does not re-register a new inotify watch for the new inode number of the file which was moved in.

This seems like a fundamental rework would have to go into sssd, maybe the easiest would be to default to polling instead of inotify until all this is fixed?

OliFre (freyermuth) wrote :

I'll test tomorrow if try_inotify=false in the [sssd] section also works around this issue, unless someone beats me to it...

renbag (renbag) wrote :

Since my last comment on this bug report (#16), I upgraded my workstation from artful to bionic. The machine also has an NVME SSD, but I do not see this sssd offline bug anymore.

However, I had intermittent problems in browsing the shares inside my AD domain and, when I read here about possible race problems in sssd, I tried to solve them by putting back in place the systemd workaround of this bug description. The result is that with the workaround I can always browse the network shares (I'm using thunar in XFCE4, which in turn uses gvfsd-smb-browse), while without it I can't see the shares almost all the times. This is a consistent behaviour that I noticed in the last days and after several reboots of the workstation.
Should this be reported as a different bug?

Note: the samba browsing problem is not related to the Security Improvements described in the Bionic Release Notes (https://wiki.ubuntu.com/BionicBeaver/ReleaseNotes).

OliFre (freyermuth) wrote :

@renbag: Do you see "sssd[be[896]: Backend is offline" in "systemctl status sssd"?
Is your backend server given as hostname / fqdn in sssd.conf?

@ahasenack: try_inotify=false did not change the issue at all. With strace, I can still see 127.0.0.1 being queried. IMHO this can only be an ugly race between c_ares channel creation and resolv.conf changing. In addition, inotify support is borked, since no new watch is created when the inode of the watched file changes, but this should be a different issue.

OliFre (freyermuth) wrote :

@renbag: It might be that in terms of timing, the DHCP server is also involved, since it defines the time network-online.target takes to start. Our setup is relatively slow, since we have a DHCP server under high load (several 10000 leases) and are using a dhcrelay in between.
Potentially, you can still force the issue to appear by inverting the workaround to force sssd to start before network-online.target.

Andreas Hasenack (ahasenack) wrote :

The sssd.conf manpage from sssd 1.13.4 and 1.16.1 (xenial and bionic) says that without inotify sssd will fall back to polling resolv.conf every 5s:

try_inotify (boolean)
    SSSD monitors the state of resolv.conf to identify when it needs to update its internal DNS resolver. By default, we will attempt to use inotify for this, and will fall back to polling resolv.conf every five seconds if inotify cannot be used.

Did you observe that when setting try_inotify to false?

Andreas Hasenack (ahasenack) wrote :

Interesting, I setup a watch on a directory to see what happens when a symlink is changed:

https://pastebin.ubuntu.com/p/wKpMVY4dVz/

The watch on the "old" symlink definitely is lost via that operation, as far as I can see.

OliFre (freyermuth) wrote :

After setting try_inotify to false, I see that the message that an inotify was set up does not show up anymore, but no message that polling is used instead is shown in the logs. If I understand the code correctly, that's just right - a message will only be shown if try_inotify is true and it actually fails.

The polling, though, appears to be pretty useless when checking the code. It checks whether "mtime" was changed, on "/etc/resolv.conf". That seems to rarely change for the symlink, but only for the symlink target, so actual resolv-file changes are not noticed...

I think your observation concerning inotify is expected, since inotify works with on inode numbers, IIUC.

Andrew Conway (acubuntuone) wrote :

I also have observed this problem on bionic for several different computers. The workaround always solves the problem; without the workaround I cannot log in. What the computers had in common:

* Using bionic (either the version from the DVD or with network updates)
* Using networkd rather than NetworkManager with a static ipv4 address
* sssd.conf uses fqdn to refer to the name servers
* A backup ldap and kerberos server were both provided.

OliFre (freyermuth) wrote :

This also mostly applies to us:
* Bionic, installed via network installer.
* We did not configure network explicitly, AFAICT, this means networkd is used. However, we run with DHCP.
* sssd.conf uses FQDNs for the kerberos and ldap servers (I guess you meant that)?
* We also provide backup servers for both.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers