sssd offline on boot, stays offline forever

Bug #1723350 reported by Marian Rainer-Harbach
88
This bug affects 14 people
Affects Status Importance Assigned to Milestone
sssd (Ubuntu)
Fix Released
Medium
Unassigned
Artful
Won't Fix
Medium
Unassigned
Bionic
Fix Released
High
Andreas Hasenack
Eoan
Fix Released
High
Andreas Hasenack

Bug Description

[Impact]
sssd can switch to an offline mode of operation when it cannot reach the authentication or id backend. It uses several methods to assess the situation, and one of them is monitoring the /etc/resolv.conf file for changes.

In ubuntu that file is a symlink to /run/systemd/resolve/stub-resolv.conf, but the target doesn't exist at all times during boot. It's expected that symlink to be broken for a while during boot.

Turns out that the monitoring that sssd was doing on /etc/resolv.conf didn't take into consideration that what could change was the *target* of the symlink. it completely ignored that fact, and didn't notice when the resolv.conf contents actually changed in this scenario, which resulted in sssd staying in the offline mode when it shouldn't.

There are two fixes being pulled in for this SRU:
a) fix the monitoring of the target of the /etc/resolv.conf symlink
b) change the fallback polling code to keep trying, instead of giving up right away

[Test Case]
It's recommended to test this in a lxd container, or a vm.

Preparation steps. When prompted for an openldap/slapd password, chose any password you want. It won't be needed again:
$ sudo apt update && sudo apt dist-upgrade
$ sudo apt install sssd-ldap sssd-tools sssd-dbus slapd dnsmasq

Become root:
$ sudo su -

Detect your ip:
# export interface=$(ip route | grep default | sed -r 's,^default via .* dev ([a-z0-9]+) .*,\1,')
# export ip=$(ip addr show dev $interface | grep "inet [0-9]" | awk '{print $2}' | cut -d / -f 1)

Confirm the $ip variable is correct for your case:
# echo $ip

Create /etc/dnsmasq.d/sssd-test.conf using your real ip:
# cat > /etc/dnsmasq.d/sssd-test.conf <<EOF
host-record=ldap01.example.com,$ip
listen-address=$ip
interface=$interface
except-interface=lo
bind-interfaces
EOF

restart dnsmasq
# systemctl restart dnsmasq

a) inotify test
Create /etc/sssd/sssd.conf:
# cat > /etc/sssd/sssd.conf <<EOF
[sssd]
config_file_version = 2
services = nss, pam, ifp
domains = LDAP
debug_level = 6

[domain/LDAP]
id_provider = ldap
ldap_uri = ldap://ldap01.example.com
cache_credentials = True
ldap_search_base = dc=example,dc=com
EOF

# chmod 0600 /etc/sssd/sssd.conf

# rm /etc/resolv.conf
# ln -s /etc/resolv.conf.target /etc/resolv.conf

create good resolv.conf:
# echo "nameserver $ip" > /etc/resolv.conf.good

Confirm /etc/resolv.conf is a broken symlink:
# ll /etc/resolv.conf*
lrwxrwxrwx 1 root root 23 May 13 20:48 /etc/resolv.conf -> /etc/resolv.conf.target
-rw-r--r-- 1 root root 24 May 13 20:48 /etc/resolv.conf.good

Open another terminal/screen and tail the sssd logs with a grep:
# tail -F /var/log/sssd/sssd.log | grep -i resolv

Start sssd
# systemctl restart sssd

The tail output from that other terminal should say sssd is monitoring /etc/resolv.conf (that's the broken symlink):
(Mon May 18 17:32:34 2020) [sssd] [_snotify_create] (0x0400): Added a watch for /etc/resolv.conf with inotify flags 0x8D88 internal flags 0x1 using function resolv_conf_inotify_cb after delay 1.0

Repeat this sssctl call until it shows the offline mode persistently:
# sssctl domain-status LDAP
Online status: Offline

Active servers:
LDAP: not connected

Discovered LDAP servers:
- ldap01.example.com

"Unbreak" the symlink:
# cp /etc/resolv.conf.good /etc/resolv.conf.target

The log should now say:
(Mon May 18 17:33:30 2020) [sssd] [process_dir_event] (0x0400): Not interested in resolv.conf.target

This shows that sssd didn't pick up that resolv.conf changed, via the target of the symlink.

Run sssctl again, and the online status will remain offline.

With the fixed packages, the sssd startup log will say:
(Mon May 18 17:17:06 2020) [sssd] [_snotify_create] (0x0400): Added a watch for /etc/resolv.conf.target with inotify flags 0x8D88 internal flags 0x1 using function resolv_conf_inotify_cb after delay 1.0

Showing that it's monitoring the symlink target.

And after fixing the broken symlink, it will say:
(Mon May 18 17:18:06 2020) [sssd] [process_dir_event] (0x0400): received notification for watched file [resolv.conf.target] under /etc

Run sssctl again, it should almost immediately switch to online:
# sssctl domain-status LDAP
Online status: Online

Active servers:
LDAP: ldap01.example.com

Discovered LDAP servers:
- ldap01.example.com

b) polling test
Repeat the previous test, but with "try_inotify = false" in sssd.conf, like this:
# cat > /etc/sssd/sssd.conf <<EOF
[sssd]
config_file_version = 2
services = nss, pam, ifp
domains = LDAP
debug_level = 6
try_inotify = false

[domain/LDAP]
id_provider = ldap
ldap_uri = ldap://ldap01.example.com
cache_credentials = True
ldap_search_base = dc=example,dc=com
EOF

Then follow the other steps of test case (a).

Upon startup, the sssd log will show:
(Mon May 18 17:57:04 2020) [sssd] [monitor_config_file_fallback] (0x0080): file [/etc/resolv.conf] is missing. Will not update online status based on watching the file

Note how it says it will not watch it for updates.

With the fixed package, the log will show this instead:
(Mon May 18 18:02:56 2020) [sssd] [monitor_config_file_fallback] (0x0020): file [/etc/resolv.conf] is missing. Will try again later.

Notice how it says it will try again, and multiple times if you keep watching it (once every 10s).

"Unbreak" the symlink:
# cp /etc/resolv.conf.good /etc/resolv.conf.target

And the log output will eventually show this:
(Mon May 18 18:04:06 2020) [sssd] [monitor_config_file_fallback] (0x0020): file [/etc/resolv.conf] is missing. Will try again later.
(Mon May 18 18:04:17 2020) [sssd] [signal_res_init] (0x0040): Reloading Resolv.conf.

And sssctl will show the now online status:
# sssctl domain-status LDAP
Online status: Online

Active servers:
LDAP: ldap01.example.com

Discovered LDAP servers:
- ldap01.example.com

[Regression Potential]
Breaking sssd can mean preventing logins of network remote users on a system, but should still allow local users (and root) to login. Another possible source of regressions, but not related to the code changes in this update, if there are pre-existing incorrect changes to sssd.conf, they might prevent the service from restarting when this update is applied.

[Other Info]
Not at this time.

[Original Description]

SSSD 1.15.3-2ubuntu1 on 17.10/artful (previous versions on artful were also affected) is offline on boot and seems to stay offline forever (I waited over 20 minutes).

sssd_nss.log:
(Fri Oct 13 09:49:50 2017) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Fri Oct 13 09:49:51 2017) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Fri Oct 13 09:49:51 2017) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Fri Oct 13 09:49:51 2017) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
...

SSSD immediately returns to normal operation after restarting it or after sending SIGUSR2.

A workaround for the problem is creating the file /etc/systemd/system/sssd.service.d/override.conf with contents
[Unit]
Requires=network-online.target
After=network-online.target

Related branches

Revision history for this message
Marian Rainer-Harbach (marianrh) wrote :
Revision history for this message
renbag (renbag) wrote :

I confirm the bug and that the workaround solves the problem.
If the sssd configuration doesn't include:
cache_credentials = true
it is not even possible to do an Active Directory login.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in sssd (Ubuntu):
status: New → Confirmed
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I can confirm this is happening after a reboot, and that a simple USR2 signal fixes it, but I wonder why sssd doesn't get itself back into online mode on its own in this case.

For example, I tried the following:
- login via kerberos using pam_sss while it was online
- got my ticket
- kdestroy, logout
- in another terminal, bring down the network on this client workstation
- try go login again
- login failed, and sssd log showed:
(Fri Oct 13 21:23:15 2017) [sssd[pam]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]

I then brought the network back up, and after a while (a minute or less), login was working again using krb5. This shows sssd recovered from the induced "outage".

The systemd unit file we are using in ubuntu comes straight from upstream.

(some searching)

Found these upstream tickets:

https://pagure.io/SSSD/sssd/issue/3467
https://pagure.io/SSSD/sssd/issue/3294

Changed in sssd (Ubuntu):
status: Confirmed → Triaged
importance: Undecided → Medium
Revision history for this message
Simon Elmir (nerd65536) wrote : Re: sssd offline on boot, stays offline forever (artful, bionic)

Also affects Ubuntu Bionic (18.04)

A nicer workaround, for those using NetworkManager & systemd would be to tell sssd to try going online when an interface comes up.

Create /etc/NetworkManager/dispatcher.d/sssd_notify with the contents:
#!/bin/bash
case "$2" in
  up|vpn-up) systemctl kill --signal=USR2 sssd.service; ;;
esac

summary: - sssd offline on boot, stays offline forever (artful)
+ sssd offline on boot, stays offline forever (artful, bionic)
Revision history for this message
Marian Rainer-Harbach (marianrh) wrote :

@Simon: Interesting – I can't reproduce the problem in 18.04 anymore. I tested a VM and a laptop using ethernet and wifi, and sssd was online on boot without any workarounds.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

The test in comment #4 is more thorough, as it checks that sssd can detect that the network is up again on its own. Can you try that on bionic please?

Revision history for this message
Marian Rainer-Harbach (marianrh) wrote :

I tried the steps in #4. As expected, login failed while the network was offline. It worked again immediately after enabling the network. So to me it seems the problem really does not occur in bionic anymore.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I also tried again in bionic, including rebooting the sssd client while the krb5 server was offline, and it automatically switched back to an online status after i restored the network on my krb5 server. I then tried this again on xenial, and there it also worked.

let me try artful now.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Reading what I wrote in comment #4, seems like I'm contradicting myself. First I say I can reproduce the problem, but later I say that after I bring the network back up it starts working again on its own. That's exactly what I'm seeing now, i.e., no bug. Tried xenial, artful and bionic. Are you guys on a desktop by any chance, with just a wifi connection perhaps?

Changed in sssd (Ubuntu):
status: Triaged → Incomplete
Revision history for this message
renbag (renbag) wrote :

I confirmed the bug in artful, but now, with the same sssd.conf in bionic, I do not see the bug anymore.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I'll mark this as fix released for bionic then, and open up an artful task.

Changed in sssd (Ubuntu):
status: Incomplete → Fix Released
summary: - sssd offline on boot, stays offline forever (artful, bionic)
+ sssd offline on boot, stays offline forever
Changed in sssd (Ubuntu Artful):
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

@renbag, can you share your network configuration on this artful machine? These files:

/etc/network/interfaces
/etc/network/interfaces.d/*
/etc/netplan/*.yaml

And is this a desktop or a server?

Revision history for this message
renbag (renbag) wrote :

The artful machine is a workstation and has only a wired network interface managed by network-manager that gets configured by a DHCP server on our LAN.
The files you requested are not present or have little or no information:

/etc/network/interfaces
--------------------------------------------------
# interfaces(5) file used by ifup(8) and ifdown(8)
auto lo
iface lo inet loopback
--------------------------------------------------

/etc/netplan/01-network-manager-all.yaml
--------------------------------------------------
# Let NetworkManager manage all devices on this system
network:
  version: 2
  renderer: NetworkManager
--------------------------------------------------

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

@renbag could you please:

- add debug_level = 4 to [sssd] and your [domain/] sections
- reboot
- post the output of these commands right after:

systemd-analyze blame

systemd-analyze critical-chain

- login as a kerberos/ldap user, which has been failing
- check the new logs in /var/log/sssd/* trying to catch the moment it decides it is offline. You could attach them to this bug, your the username and realm will show up and it might be too much to sanitize

In my artful-desktop VM, setup to use krb5 and ldap from another machine, logins after reboots work just fine. I can also take the network down, fail a login, take it backup, and without restarting anything login just fine afterwards. So we haven't gotten to the bottom of this yet.

Revision history for this message
renbag (renbag) wrote :

@Andreas: thank you for your suggestions. I removed the workaround from my artful machine and did what you asked, then, after some minutes, I reactivated the workaround and did the same. All the logs are attached for comparison (the real username and domain were redacted).

Revision history for this message
renbag (renbag) wrote :

/var/log/sssd/* attachment.

Revision history for this message
renbag (renbag) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

The workaround you speak of, is that the sssd option to cache credentials, or the systemd unit file change to require network-online.target?

Revision history for this message
renbag (renbag) wrote :

The workaround is the systemd unit file written in the Bug Description.
The 'cache credentials' config option does not have any effect on the bug and for the tests I have disabled it.

Revision history for this message
Jakub Hrozek (jakub-hrozek) wrote :

Are the sssd logs captured with a high enough debug_level in the [sssd] section of sssd.conf? Normally sssd should detect notifications from libnetlink and reset the offline status.. It would be nice to see those and correlate with system logs..

Revision history for this message
OliFre (freyermuth) wrote :

Could somebody please add Bionic back?
We observe this on Bionic, it's pretty random, mostly machines with fast root devices (e.g. NVMEs) are affected. Backend is offline on boot and stays offline.
Since it's so random and I don't want to enable debug information on all machines we have, I can't promise that we will be able to collect debug data.

Revision history for this message
OliFre (freyermuth) wrote :

@jakub-hrozek: When activating debug_level on some machines, which level is high enough?

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Maybe start with 6. From https://docs.pagure.org/SSSD.sssd/users/troubleshooting.html:
"""
To enable debugging persistently across SSSD service restarts, put the directive debug_level=N, where N typically stands for a number between 1 and 10 into the particular section. Levels up to 3 should log mostly failures (although we haven’t really been consistent especially earlier in the SSSD development) and anything above level 8 provides a large number of log messages. Level 6 might be a good starting point for debugging problems.
"""

Revision history for this message
OliFre (freyermuth) wrote :

@ahasenack: Even in case things "work", I observe the following:

Jun 11 14:41:15 cip000 systemd[1]: Starting System Security Services Daemon...
Jun 11 14:41:24 cip000 sssd[920]: Starting up
Jun 11 14:41:25 cip000 sssd[be[998]: Starting up
Jun 11 14:41:28 cip000 sssd[1021]: Starting up
Jun 11 14:41:28 cip000 sssd[1022]: Starting up
Jun 11 14:41:29 cip000 systemd[1]: Started System Security Services Daemon.
Jun 11 14:42:38 cip000 sssd[be[998]: Backend is online

Users trying to log in between 14:41 and 14:42 get an error message.
In the logfiles, I only see tons of:
(Mon Jun 11 14:42:32 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]

in the nss-logs, and a few:
(Mon Jun 11 14:41:49 2018) [sssd[pam]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]

in the sssd_pam log for each failed attempt.

Even though I get "Backend is online" in this case, the logs don't tell me why it actually recovered. So maybe 6 is not high enough?
Is it really sufficient to enable debugging only for sssd itself, don't I need to enable it for the provider, too?

Also, shouldn't this debug_level be a bitmask?

In any case, network connectivity was available well before 14:42:38. I actually logged in via ssh at 14:41:52 as user, but since the backend was still offline, no kerberos ticket was initialized.

Revision history for this message
OliFre (freyermuth) wrote :

I've now had a machine with actual failure and debug_level=9 in sections sssd, pam and nss.
No more information was produced, just tons of org.freedesktop.sssd.Error.DataProvider.Offline .

Seems that official manual is outdated, will try with bitmasks next.

Revision history for this message
OliFre (freyermuth) wrote :

I tried wiht 0x7770 now in all sections, but I don't get more info.
Now, sssd.log shows:
(Mon Jun 11 15:10:34:313118 2018) [sssd] [sss_ini_call_validators] (0x0020): [rule/allowed_sssd_options]: Attribute 'debug_devel' is not allowed in section 'sssd'. Check for typos.
for each section :-(.

So apparently the manpage is also wrong? I specified:
debug_devel = 0x7770

So I've no idea on how to provide debugging info. But I have a bionic machine (for 1-2 days) on which I can reproduce well, since it has an SSD.
Seems to be highly reproducible on any SSD / NVME Bionic machine.

Revision history for this message
OliFre (freyermuth) wrote :

Argh! My fault, with the typo fixed, I see the following:

Jun 11 15:36:27 exp006 sssd[be[918]: Backend is offline
[...]
(Mon Jun 11 15:36:31 2018) [sssd] [message_type] (0x0200): netlink Message type: 24
(Mon Jun 11 15:36:31 2018) [sssd] [route_msg_debug_print] (0x1000): route idx 2 flags 0 family 2 addr correct_ip_address/32
(Mon Jun 11 15:36:31 2018) [sssd] [network_status_change_cb] (0x2000): A networking status change detected signaling providers to reset offline status
(Mon Jun 11 15:36:31 2018) [sssd] [sbus_add_timeout] (0x2000): 0x5616e2ab79f0
[...]
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [resolv_gethostbyname_step] (0x2000): Querying DNS
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [resolv_gethostbyname_dns_query] (0x0100): Trying to resolve A record of 'kdc1.OUR.INSTITUTION' in DNS
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [schedule_request_timeout] (0x2000): Scheduling a timeout of 6 seconds
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [schedule_timeout_watcher] (0x2000): Scheduling DNS timeout watcher
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [unschedule_timeout_watcher] (0x4000): Unscheduling DNS timeout watcher
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [request_watch_destructor] (0x0400): Deleting request watch
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [resolv_gethostbyname_done] (0x0040): querying hosts database failed [5]: Input/output error
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [fo_resolve_service_done] (0x0020): Failed to resolve server 'kdc1.OUR_INSTITUTION': Could not contact DNS servers
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [set_server_common_status] (0x0100): Marking server 'kdc1.OUR_INSTITUTION' as 'not working'

So it seems even though network runs fine, DNS fails.
dig kdc1.OUR_INSTITUTION A
works fine, though.

Revision history for this message
OliFre (freyermuth) wrote :

That's interesting.
strace'ing the sssd_be process, it seems that it's trying to contact 127.0.0.1 as DNS server, and fails.
systemd-resolved is running on 127.0.0.53 and works just fine.

Revision history for this message
OliFre (freyermuth) wrote :

Is this an inotify race due to the symlink-magic of systemd-resolved?

sssd.log:(Mon Jun 11 15:36:23 2018) [sssd] [_snotify_create] (0x0400): Added a watch for /etc/resolv.conf with inotify flags 0x8D88 internal flags 0x1 using function resolv_conf_inotify_cb after delay 1.0

# ls -la /etc/resolv.conf
lrwxrwxrwx 1 root root 39 Jun 11 13:12 /etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf

# stat /run/systemd/resolve/stub-resolv.conf
  File: /run/systemd/resolve/stub-resolv.conf
  Size: 727 Blocks: 8 IO Block: 4096 regular file
Device: 17h/23d Inode: 950 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 101/systemd-resolve) Gid: ( 103/systemd-resolve)
Access: 2018-06-11 15:36:31.937607730 +0200
Modify: 2018-06-11 15:36:31.925607676 +0200
Change: 2018-06-11 15:36:31.925607676 +0200
 Birth: -

So it seems /run/systemd/resolve/stub-resolv.conf was modified slightly after sssd_be set up it's inotify watch, but I don't see anything in the logs that sssd was notified about this modification. Maybe since it's a symlink?

Revision history for this message
OliFre (freyermuth) wrote :

Needless to say, resolv.conf is:
nameserver 127.0.0.53
search OUR.INSTITUTION

Revision history for this message
OliFre (freyermuth) wrote :

I can confirm that adding the described override, all works fine.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

The inotify callback would log this:

    DEBUG(SSSDBG_TRACE_INTERNAL,
          "Received inotify notification for %s\n", filename);

That is level 8 debugging:
./src/util/debug.h:#define SSSDBG_TRACE_INTERNAL 0x2000 /* level 8 */

Revision history for this message
OliFre (freyermuth) wrote :

Thanks for searching out this info!
I grepped through all logs for "inot" and found no match (only the initial inotify setup). So it seems sssd was not notified via inotify of the change.
That matches the behaviour that "strace" showed that "127.0.0.1" instead of "127.0.0.53" was queried.

So either, there is a race (resolv.conf symlink target changed right between sssd reading it and inotify watch being activated, if this is the order in which things happen?) or systemd-resolved somehow manages to change resolv.conf in a way that inotify does not alert on.

For example, it seems that if I do the following:
$ touch old_resolv
$ touch new_resolv
$ ln -s old_resolv resolv
and now I setup an inotify watch on resolv, and perform:
$ ln -sf new_resolv resolv
Then effectively, resolv is changed (since the symlink target changed) but inotify does not see it.

Not sure if this is what systemd-resolved is doing, but it's not unlikely...

Revision history for this message
OliFre (freyermuth) wrote :

Another issue appears to be...
When starting sssd in highest debug level, and then changing resolv.conf once by doing:
$ touch foo
$ mv foo /run/systemd/resolve/stub-resolv.conf
(that's where the link target is pointing to), I see the expected:
(Tue Jun 12 00:16:57 2018) [sssd] [resolv_conf_inotify_cb] (0x2000): Received inotify notification for /etc/resolv.conf
(Tue Jun 12 00:16:57 2018) [sssd] [monitor_update_resolv] (0x1000): Resolv.conf has been updated. Reloading.

Doing that *again*, i.e.
$ touch bar
$ mv bar /run/systemd/resolve/stub-resolv.conf
Does *not* trigger sssd's inotify watch again. That's apparently since sssd does not re-register a new inotify watch for the new inode number of the file which was moved in.

This seems like a fundamental rework would have to go into sssd, maybe the easiest would be to default to polling instead of inotify until all this is fixed?

Revision history for this message
OliFre (freyermuth) wrote :

I'll test tomorrow if try_inotify=false in the [sssd] section also works around this issue, unless someone beats me to it...

Revision history for this message
renbag (renbag) wrote :

Since my last comment on this bug report (#16), I upgraded my workstation from artful to bionic. The machine also has an NVME SSD, but I do not see this sssd offline bug anymore.

However, I had intermittent problems in browsing the shares inside my AD domain and, when I read here about possible race problems in sssd, I tried to solve them by putting back in place the systemd workaround of this bug description. The result is that with the workaround I can always browse the network shares (I'm using thunar in XFCE4, which in turn uses gvfsd-smb-browse), while without it I can't see the shares almost all the times. This is a consistent behaviour that I noticed in the last days and after several reboots of the workstation.
Should this be reported as a different bug?

Note: the samba browsing problem is not related to the Security Improvements described in the Bionic Release Notes (https://wiki.ubuntu.com/BionicBeaver/ReleaseNotes).

Revision history for this message
OliFre (freyermuth) wrote :

@renbag: Do you see "sssd[be[896]: Backend is offline" in "systemctl status sssd"?
Is your backend server given as hostname / fqdn in sssd.conf?

@ahasenack: try_inotify=false did not change the issue at all. With strace, I can still see 127.0.0.1 being queried. IMHO this can only be an ugly race between c_ares channel creation and resolv.conf changing. In addition, inotify support is borked, since no new watch is created when the inode of the watched file changes, but this should be a different issue.

Revision history for this message
OliFre (freyermuth) wrote :

@renbag: It might be that in terms of timing, the DHCP server is also involved, since it defines the time network-online.target takes to start. Our setup is relatively slow, since we have a DHCP server under high load (several 10000 leases) and are using a dhcrelay in between.
Potentially, you can still force the issue to appear by inverting the workaround to force sssd to start before network-online.target.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

The sssd.conf manpage from sssd 1.13.4 and 1.16.1 (xenial and bionic) says that without inotify sssd will fall back to polling resolv.conf every 5s:

try_inotify (boolean)
    SSSD monitors the state of resolv.conf to identify when it needs to update its internal DNS resolver. By default, we will attempt to use inotify for this, and will fall back to polling resolv.conf every five seconds if inotify cannot be used.

Did you observe that when setting try_inotify to false?

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Interesting, I setup a watch on a directory to see what happens when a symlink is changed:

https://pastebin.ubuntu.com/p/wKpMVY4dVz/

The watch on the "old" symlink definitely is lost via that operation, as far as I can see.

Revision history for this message
OliFre (freyermuth) wrote :

After setting try_inotify to false, I see that the message that an inotify was set up does not show up anymore, but no message that polling is used instead is shown in the logs. If I understand the code correctly, that's just right - a message will only be shown if try_inotify is true and it actually fails.

The polling, though, appears to be pretty useless when checking the code. It checks whether "mtime" was changed, on "/etc/resolv.conf". That seems to rarely change for the symlink, but only for the symlink target, so actual resolv-file changes are not noticed...

I think your observation concerning inotify is expected, since inotify works with on inode numbers, IIUC.

Revision history for this message
Andrew Conway (acubuntuone) wrote :

I also have observed this problem on bionic for several different computers. The workaround always solves the problem; without the workaround I cannot log in. What the computers had in common:

* Using bionic (either the version from the DVD or with network updates)
* Using networkd rather than NetworkManager with a static ipv4 address
* sssd.conf uses fqdn to refer to the name servers
* A backup ldap and kerberos server were both provided.

Revision history for this message
OliFre (freyermuth) wrote :

This also mostly applies to us:
* Bionic, installed via network installer.
* We did not configure network explicitly, AFAICT, this means networkd is used. However, we run with DHCP.
* sssd.conf uses FQDNs for the kerberos and ldap servers (I guess you meant that)?
* We also provide backup servers for both.

Revision history for this message
Mark Foster (fostermarkd) wrote :

We are seeing this too, on Bionic servers installed from netboot / PXE.

Bionic is (still) missing from the Affected
Status says "Fix Released" but where is that? I don't think it's true.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Sorry, "fix released" is for the current development of Ubuntu which is 18.10. I'll add a task for bionic, and remove the artful one.

Changed in sssd (Ubuntu Artful):
status: Triaged → Won't Fix
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in sssd (Ubuntu Bionic):
status: New → Confirmed
Revision history for this message
Jani Jaakkola (jj-lousa) wrote :

I hit this bug too with Ubuntu 18.04. On boot sssd fails to find AD servers from DNS and then continues to query 127.0.0.1 port 53, which will never answer. The log looks like this:

(Wed Oct 10 13:04:13 2018) [sssd[be[ad.helsinki.fi]]] [resolv_getsrv_send] (0x0100): Trying to resolve SRV record of '_ldap._tcp.helsinki._sites.ad.helsinki.fi'
(Wed Oct 10 13:04:13 2018) [sssd[be[ad.helsinki.fi]]] [resolv_discover_srv_done] (0x0040): SRV query failed [11]: Could not contact DNS servers

which does not help much, since it doesn't mention the non existing DNS server (127.0.0.1) in the logs (you need to run strace to see it).

Restarting sssd makes it go back online.

Revision history for this message
Ron Terwilliger (ront2018) wrote :

I can confirm the same issue with Ubuntu 18.04 (used the DVD). SSSD cannot resolve the the backend and then says it's offline. Restarting SSSD after a reboot always resolves the issue.

I've found two workarounds in 18.04:
1) Use your own resolv.conf and not rely on systemd-resolved.service.
"unlink /etc/resolv.conf" and then create your own resolv.conf that doesn't rely on systemd-resolved.

2) Make SSSD start after systemd-resolved.service. I prefer this method because setting up the nameservers is still automatic (in case they change).
cp -p /lib/systemd/system/sssd.service /etc/systemd/system/
sed -i 's/Wants.*$/Wants &\nAfter=systemd-resolved.service/g' /etc/systemd/system/sssd.service

Hope this helps someone.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

A PR which might address this was posted to the sssd mailing list: https://github.com/SSSD/sssd/pull/864

Revision history for this message
meskaya (meskaya) wrote :

The code in the PR above has now been pushed can we get a new build of SSSD including it ?

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I agree. Do you have a way to reproduce this somewhat reliably?

Changed in sssd (Ubuntu Bionic):
importance: Undecided → High
Revision history for this message
meskaya (meskaya) wrote :

I am trying to seed SSSD cache. The setup is working fine, I can do a successful offline authentication but once I am connected to my company network using VPN, SSSD cache is never refreshed unless I restart SSSD or replace /etc/resolv.conf symlink (I am using systemd-resolved) by a regular /etc/resolv.conf file.

tags: added: server-next
Revision history for this message
meskaya (meskaya) wrote :

Hi all.
@Andreas Hasenack any progress on this :) ?
Anything we can do to help.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I'll get to it. Being able to test the fix would help a lot, since I never got it reproduced locally.

Revision history for this message
Jens Elkner (jelmd) wrote :

Same here - it is a nightmare:

(Wed Mar 4 16:19:36 2020) [sssd[be[LDAP]]] [resolv_gethostbyname_done] (0x0040): querying hosts database failed [5]: Input/output error
(Wed Mar 4 16:19:36 2020) [sssd[be[LDAP]]] [fo_resolve_service_done] (0x0020): Failed to resolve server 'ldap....': Could not contact DNS servers
(Wed Mar 4 16:19:36 2020) [sssd[be[LDAP]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (ldap....), resolver returned [5]: Input/output error

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

This is still on my list, I'm just very busy with focal fossa right now, but the feature freeze is past, and now I can concentrate on only fixing bugs.

Changed in sssd (Ubuntu Bionic):
assignee: nobody → Andreas Hasenack (ahasenack)
status: Confirmed → In Progress
Revision history for this message
Andrew Conway (acubuntuone) wrote :

It has been a constant problem for me in 18.04 but seems to work fine in 20.04 on the couple of computers I have tried it on.

However I have had reliability issues with it in 20.04 on one computer with multiple ethernet adapters with only one plugged in and bridging set up (for VMs). That may be just my misconfiguration though. Disabling one adapter made it work fine.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Ok, I have a simple way to reproduce this, and was able to test the upstream patch. I'll polish it up and put it up for review, then prepare the SRU paperwork.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Just tested and eoan is affected, focal is clear.

Changed in sssd (Ubuntu Eoan):
assignee: nobody → Andreas Hasenack (ahasenack)
status: New → In Progress
description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
Changed in sssd (Ubuntu Eoan):
importance: Undecided → High
description: updated
Revision history for this message
Robie Basak (racb) wrote :

> b) change the fallback polling code to keep trying, instead of giving up right away

AFAICT, this will only happen if /etc/resolv.conf is a broken symlink.

What happens if the symlink exists but doesn't provide something valid, sssd goes into offline, and then the symlink is changed to something that does work? I get the impression that this case hasn't been fixed. Am I right?

Nevertheless the fix you have is correct and would be useful to land for the common case failure. Thank you to everyone involved in identifying the problem. So +1 for the SRU review, and this is ready to accept. I only wait in case you want to do something different if my edge case above is correct. Let me know.

description: updated
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

> AFAICT, this will only happen if /etc/resolv.conf is a broken symlink.

Correct. I would have expected the default inotify case to handle that, though, as it will see that the contents of resolv.conf (or the symlink target) changed, and in fact it notices that, but doesn't seem to switch back to the online status.

I find it hard to test dns resolution issues in ubuntu, though, something unexpected always happens, like bug #1765477 which introduces a 5s delay and might be what is causing sssd to think even the new dns isn't working. I tested on bionic your case above.

In any case, the fixes are very localized to the bug reported here, and I believe fix them. Trying to handle the case you mentioned would introduce more regression potential, and investigating it further would add more delays. I think we should proceed with the SRU as is.

Thanks!

Revision history for this message
Robie Basak (racb) wrote : Please test proposed package

Hello Marian, or anyone else affected,

Accepted sssd into eoan-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/sssd/2.2.0-4ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-eoan to verification-done-eoan. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-eoan. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in sssd (Ubuntu Eoan):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-eoan
Changed in sssd (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed-bionic
Revision history for this message
Robie Basak (racb) wrote :

Hello Marian, or anyone else affected,

Accepted sssd into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/sssd/1.16.1-1ubuntu1.6 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

eoan verification

With package from proposed:
 *** 2.2.0-4ubuntu1.1 500
        500 http://archive.ubuntu.com/ubuntu eoan-proposed/main amd64 Packages

a) inotify test
Since I have the proposed package, logs show sssd is monitoring the symlink's target:
(Tue Jun 2 13:37:27 2020) [sssd] [_snotify_create] (0x0400): Added a watch for /etc/resolv.conf.target with inotify flags 0x8D88 internal flags 0x1 using function resolv_conf_inotify_cb after delay 1.0

Domain is offline as expected:
root@eoan-sssd-1723350:~# sssctl domain-status LDAP
Online status: Online

Active servers:
LDAP: ldap01.example.com

Discovered LDAP servers:
- ldap01.example.com

After unbreaking the symlink, logs show sssd picked the change up:
(Tue Jun 2 13:39:21 2020) [sssd] [process_dir_event] (0x0400): received notification for watched file [resolv.conf.target] under /etc

And the domain is now online:
root@eoan-sssd-1723350:~# sssctl domain-status LDAP
Online status: Online

Active servers:
LDAP: ldap01.example.com

Discovered LDAP servers:
- ldap01.example.com

b) polling test

After restarting with the proposed sssd package, I see it detected /etc/resolv.conf is broken, and since it's not using inotify, it starts the polling:
(Tue Jun 2 14:13:57 2020) [sssd] [monitor_config_file_fallback] (0x0020): file [/etc/resolv.conf] is missing. Will try again later.
(Tue Jun 2 14:14:07 2020) [sssd] [monitor_config_file_fallback] (0x0020): file [/etc/resolv.conf] is missing. Will try again later.
...
sssctl shows domain is offline, as expected:
root@eoan-sssd-1723350:~# sssctl domain-status LDAP
Online status: Offline

Active servers:
LDAP: not connected

Discovered LDAP servers:
- ldap01.example.com

After unbreaking the symlink and waiting a few seconds, sssd picks the change up:
(Tue Jun 2 14:20:12 2020) [sssd] [monitor_config_file_fallback] (0x0020): file [/etc/resolv.conf] is missing. Will try again later.
(Tue Jun 2 14:20:22 2020) [sssd] [signal_res_init] (0x0040): Reloading Resolv.conf.

And the domain is now online:
root@eoan-sssd-1723350:~# sssctl domain-status LDAP
Online status: Online

Active servers:
LDAP: ldap01.example.com

Discovered LDAP servers:
- ldap01.example.com

eoan verification successful.

description: updated
tags: added: verification-done-eoan
removed: verification-needed-eoan
description: updated
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

bionic verification

With package from proposed:
 *** 1.16.1-1ubuntu1.6 500
        500 http://archive.ubuntu.com/ubuntu bionic-proposed/main amd64 Packages

a) inotify test
Since I have the proposed package, logs show sssd is monitoring the symlink's target:
(Tue Jun 2 14:39:32 2020) [sssd] [_snotify_create] (0x0400): Added a watch for /etc/resolv.conf.target with inotify flags 0x8D88 internal flags 0x1 using function resolv_conf_inotify_cb after delay 1.0

Domain is offline as expected:
root@bionic-sssd-1723350:~# sssctl domain-status LDAP
Online status: Offline

Active servers:
LDAP: not connected

Discovered LDAP servers:
- ldap01.example.com

root@bionic-sssd-1723350:~#

After unbreaking the symlink, logs show sssd picked the change up:
(Tue Jun 2 14:40:17 2020) [sssd] [process_dir_event] (0x0400): received notification for watched file [resolv.conf.target] under /etc

And the domain is now online:
root@bionic-sssd-1723350:~# sssctl domain-status LDAP
Online status: Online

Active servers:
LDAP: ldap01.example.com

Discovered LDAP servers:
- ldap01.example.com

b) polling test

After restarting with the proposed sssd package, I see it detected /etc/resolv.conf is broken, and since it's not using inotify, it starts the polling:
(Tue Jun 2 14:44:59 2020) [sssd] [monitor_config_file_fallback] (0x0020): file [/etc/resolv.conf] is missing. Will try again later.
(Tue Jun 2 14:45:09 2020) [sssd] [monitor_config_file_fallback] (0x0020): file [/etc/resolv.conf] is missing. Will try again later.
...

sssctl shows domain is offline, as expected:
root@bionic-sssd-1723350:~# sssctl domain-status LDAP
Online status: Offline

Active servers:
LDAP: not connected

Discovered LDAP servers:
- ldap01.example.com

After unbreaking the symlink and waiting a few seconds, sssd picks the change up:
(Tue Jun 2 14:45:49 2020) [sssd] [monitor_config_file_fallback] (0x0020): file [/etc/resolv.conf] is missing. Will try again later.
(Tue Jun 2 14:45:59 2020) [sssd] [signal_res_init] (0x0040): Reloading Resolv.conf.

And the domain is now online:
root@bionic-sssd-1723350:~# sssctl domain-status LDAP
Online status: Online

Active servers:
LDAP: ldap01.example.com

Discovered LDAP servers:
- ldap01.example.com

bionic verification successful.

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package sssd - 2.2.0-4ubuntu1.1

---------------
sssd (2.2.0-4ubuntu1.1) eoan; urgency=medium

  * d/p/monitor-propagate-error.patch,
    d/p/monitor-resolve-symlinks.patch: correctly monitor the
    /etc/resolv.conf symlink when its target changes from a non-existent
    file to one that exists (LP: #1723350)

 -- Andreas Hasenack <email address hidden> Tue, 12 May 2020 16:58:37 -0300

Changed in sssd (Ubuntu Eoan):
status: Fix Committed → Fix Released
Revision history for this message
Robie Basak (racb) wrote : Update Released

The verification of the Stable Release Update for sssd has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package sssd - 1.16.1-1ubuntu1.6

---------------
sssd (1.16.1-1ubuntu1.6) bionic; urgency=medium

  * d/p/monitor-propagate-error.patch,
    d/p/monitor-resolve-symlinks.patch: correctly monitor the
    /etc/resolv.conf symlink when its target changes from a non-existent
    file to one that exists (LP: #1723350)

 -- Andreas Hasenack <email address hidden> Mon, 11 May 2020 17:42:04 -0300

Changed in sssd (Ubuntu Bionic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.