sssd offline on boot, stays offline forever

Bug #1723350 reported by Marian Rainer-Harbach on 2017-10-13
90
This bug affects 14 people
Affects Status Importance Assigned to Milestone
sssd (Ubuntu)
Medium
Unassigned
Artful
Medium
Unassigned
Bionic
High
Andreas Hasenack
Eoan
High
Andreas Hasenack

Bug Description

[Impact]
sssd can switch to an offline mode of operation when it cannot reach the authentication or id backend. It uses several methods to assess the situation, and one of them is monitoring the /etc/resolv.conf file for changes.

In ubuntu that file is a symlink to /run/systemd/resolve/stub-resolv.conf, but the target doesn't exist at all times during boot. It's expected that symlink to be broken for a while during boot.

Turns out that the monitoring that sssd was doing on /etc/resolv.conf didn't take into consideration that what could change was the *target* of the symlink. it completely ignored that fact, and didn't notice when the resolv.conf contents actually changed in this scenario, which resulted in sssd staying in the offline mode when it shouldn't.

There are two fixes being pulled in for this SRU:
a) fix the monitoring of the target of the /etc/resolv.conf symlink
b) change the fallback polling code to keep trying, instead of giving up right away

[Test Case]
It's recommended to test this in a lxd container, or a vm.

Preparation steps. When prompted for an openldap/slapd password, chose any password you want. It won't be needed again:
$ sudo apt update && sudo apt dist-upgrade
$ sudo apt install sssd-ldap sssd-tools sssd-dbus slapd dnsmasq

Become root:
$ sudo su -

Detect your ip:
# export interface=$(ip route | grep default | sed -r 's,^default via .* dev ([a-z0-9]+) .*,\1,')
# export ip=$(ip addr show dev $interface | grep "inet [0-9]" | awk '{print $2}' | cut -d / -f 1)

Confirm the $ip variable is correct for your case:
# echo $ip

Create /etc/dnsmasq.d/sssd-test.conf using your real ip:
# cat > /etc/dnsmasq.d/sssd-test.conf <<EOF
host-record=ldap01.example.com,$ip
listen-address=$ip
interface=$interface
except-interface=lo
bind-interfaces
EOF

restart dnsmasq
# systemctl restart dnsmasq

a) inotify test
Create /etc/sssd/sssd.conf:
# cat > /etc/sssd/sssd.conf <<EOF
[sssd]
config_file_version = 2
services = nss, pam, ifp
domains = LDAP
debug_level = 6

[domain/LDAP]
id_provider = ldap
ldap_uri = ldap://ldap01.example.com
cache_credentials = True
ldap_search_base = dc=example,dc=com
EOF

# chmod 0600 /etc/sssd/sssd.conf

# rm /etc/resolv.conf
# ln -s /etc/resolv.conf.target /etc/resolv.conf

create good resolv.conf:
# echo "nameserver $ip" > /etc/resolv.conf.good

Confirm /etc/resolv.conf is a broken symlink:
# ll /etc/resolv.conf*
lrwxrwxrwx 1 root root 23 May 13 20:48 /etc/resolv.conf -> /etc/resolv.conf.target
-rw-r--r-- 1 root root 24 May 13 20:48 /etc/resolv.conf.good

Open another terminal/screen and tail the sssd logs with a grep:
# tail -F /var/log/sssd/sssd.log | grep -i resolv

Start sssd
# systemctl restart sssd

The tail output from that other terminal should say sssd is monitoring /etc/resolv.conf (that's the broken symlink):
(Mon May 18 17:32:34 2020) [sssd] [_snotify_create] (0x0400): Added a watch for /etc/resolv.conf with inotify flags 0x8D88 internal flags 0x1 using function resolv_conf_inotify_cb after delay 1.0

Repeat this sssctl call until it shows the offline mode persistently:
# sssctl domain-status LDAP
Online status: Offline

Active servers:
LDAP: not connected

Discovered LDAP servers:
- ldap01.example.com

"Unbreak" the symlink:
# cp /etc/resolv.conf.good /etc/resolv.conf.target

The log should now say:
(Mon May 18 17:33:30 2020) [sssd] [process_dir_event] (0x0400): Not interested in resolv.conf.target

This shows that sssd didn't pick up that resolv.conf changed, via the target of the symlink.

Run sssctl again, and the online status will remain offline.

With the fixed packages, the sssd startup log will say:
(Mon May 18 17:17:06 2020) [sssd] [_snotify_create] (0x0400): Added a watch for /etc/resolv.conf.target with inotify flags 0x8D88 internal flags 0x1 using function resolv_conf_inotify_cb after delay 1.0

Showing that it's monitoring the symlink target.

And after fixing the broken symlink, it will say:
(Mon May 18 17:18:06 2020) [sssd] [process_dir_event] (0x0400): received notification for watched file [resolv.conf.target] under /etc

Run sssctl again, it should almost immediately switch to online:
# sssctl domain-status LDAP
Online status: Online

Active servers:
LDAP: ldap01.example.com

Discovered LDAP servers:
- ldap01.example.com

b) polling test
Repeat the previous test, but with "try_inotify = false" in sssd.conf, like this:
# cat > /etc/sssd/sssd.conf <<EOF
[sssd]
config_file_version = 2
services = nss, pam, ifp
domains = LDAP
debug_level = 6
try_inotify = false

[domain/LDAP]
id_provider = ldap
ldap_uri = ldap://ldap01.example.com
cache_credentials = True
ldap_search_base = dc=example,dc=com
EOF

Then follow the other steps of test case (a).

Upon startup, the sssd log will show:
(Mon May 18 17:57:04 2020) [sssd] [monitor_config_file_fallback] (0x0080): file [/etc/resolv.conf] is missing. Will not update online status based on watching the file

Note how it says it will not watch it for updates.

With the fixed package, the log will show this instead:
(Mon May 18 18:02:56 2020) [sssd] [monitor_config_file_fallback] (0x0020): file [/etc/resolv.conf] is missing. Will try again later.

Notice how it says it will try again, and multiple times if you keep watching it (once every 10s).

"Unbreak" the symlink:
# cp /etc/resolv.conf.good /etc/resolv.conf.target

And the log output will eventually show this:
(Mon May 18 18:04:06 2020) [sssd] [monitor_config_file_fallback] (0x0020): file [/etc/resolv.conf] is missing. Will try again later.
(Mon May 18 18:04:17 2020) [sssd] [signal_res_init] (0x0040): Reloading Resolv.conf.

And sssctl will show the now online status:
# sssctl domain-status LDAP
Online status: Online

Active servers:
LDAP: ldap01.example.com

Discovered LDAP servers:
- ldap01.example.com

[Regression Potential]
Breaking sssd can mean preventing logins of network remote users on a system, but should still allow local users (and root) to login. Another possible source of regressions, but not related to the code changes in this update, if there are pre-existing incorrect changes to sssd.conf, they might prevent the service from restarting when this update is applied.

[Other Info]
Not at this time.

[Original Description]

SSSD 1.15.3-2ubuntu1 on 17.10/artful (previous versions on artful were also affected) is offline on boot and seems to stay offline forever (I waited over 20 minutes).

sssd_nss.log:
(Fri Oct 13 09:49:50 2017) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Fri Oct 13 09:49:51 2017) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Fri Oct 13 09:49:51 2017) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
(Fri Oct 13 09:49:51 2017) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]
...

SSSD immediately returns to normal operation after restarting it or after sending SIGUSR2.

A workaround for the problem is creating the file /etc/systemd/system/sssd.service.d/override.conf with contents
[Unit]
Requires=network-online.target
After=network-online.target

Related branches

renbag (renbag) wrote :

I confirm the bug and that the workaround solves the problem.
If the sssd configuration doesn't include:
cache_credentials = true
it is not even possible to do an Active Directory login.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in sssd (Ubuntu):
status: New → Confirmed
Andreas Hasenack (ahasenack) wrote :

I can confirm this is happening after a reboot, and that a simple USR2 signal fixes it, but I wonder why sssd doesn't get itself back into online mode on its own in this case.

For example, I tried the following:
- login via kerberos using pam_sss while it was online
- got my ticket
- kdestroy, logout
- in another terminal, bring down the network on this client workstation
- try go login again
- login failed, and sssd log showed:
(Fri Oct 13 21:23:15 2017) [sssd[pam]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]

I then brought the network back up, and after a while (a minute or less), login was working again using krb5. This shows sssd recovered from the induced "outage".

The systemd unit file we are using in ubuntu comes straight from upstream.

(some searching)

Found these upstream tickets:

https://pagure.io/SSSD/sssd/issue/3467
https://pagure.io/SSSD/sssd/issue/3294

Changed in sssd (Ubuntu):
status: Confirmed → Triaged
importance: Undecided → Medium

Also affects Ubuntu Bionic (18.04)

A nicer workaround, for those using NetworkManager & systemd would be to tell sssd to try going online when an interface comes up.

Create /etc/NetworkManager/dispatcher.d/sssd_notify with the contents:
#!/bin/bash
case "$2" in
  up|vpn-up) systemctl kill --signal=USR2 sssd.service; ;;
esac

summary: - sssd offline on boot, stays offline forever (artful)
+ sssd offline on boot, stays offline forever (artful, bionic)

@Simon: Interesting – I can't reproduce the problem in 18.04 anymore. I tested a VM and a laptop using ethernet and wifi, and sssd was online on boot without any workarounds.

Andreas Hasenack (ahasenack) wrote :

The test in comment #4 is more thorough, as it checks that sssd can detect that the network is up again on its own. Can you try that on bionic please?

I tried the steps in #4. As expected, login failed while the network was offline. It worked again immediately after enabling the network. So to me it seems the problem really does not occur in bionic anymore.

Andreas Hasenack (ahasenack) wrote :

I also tried again in bionic, including rebooting the sssd client while the krb5 server was offline, and it automatically switched back to an online status after i restored the network on my krb5 server. I then tried this again on xenial, and there it also worked.

let me try artful now.

Andreas Hasenack (ahasenack) wrote :

Reading what I wrote in comment #4, seems like I'm contradicting myself. First I say I can reproduce the problem, but later I say that after I bring the network back up it starts working again on its own. That's exactly what I'm seeing now, i.e., no bug. Tried xenial, artful and bionic. Are you guys on a desktop by any chance, with just a wifi connection perhaps?

Changed in sssd (Ubuntu):
status: Triaged → Incomplete
renbag (renbag) wrote :

I confirmed the bug in artful, but now, with the same sssd.conf in bionic, I do not see the bug anymore.

Andreas Hasenack (ahasenack) wrote :

I'll mark this as fix released for bionic then, and open up an artful task.

Changed in sssd (Ubuntu):
status: Incomplete → Fix Released
summary: - sssd offline on boot, stays offline forever (artful, bionic)
+ sssd offline on boot, stays offline forever
Changed in sssd (Ubuntu Artful):
status: New → Triaged
importance: Undecided → Medium
Andreas Hasenack (ahasenack) wrote :

@renbag, can you share your network configuration on this artful machine? These files:

/etc/network/interfaces
/etc/network/interfaces.d/*
/etc/netplan/*.yaml

And is this a desktop or a server?

renbag (renbag) wrote :

The artful machine is a workstation and has only a wired network interface managed by network-manager that gets configured by a DHCP server on our LAN.
The files you requested are not present or have little or no information:

/etc/network/interfaces
--------------------------------------------------
# interfaces(5) file used by ifup(8) and ifdown(8)
auto lo
iface lo inet loopback
--------------------------------------------------

/etc/netplan/01-network-manager-all.yaml
--------------------------------------------------
# Let NetworkManager manage all devices on this system
network:
  version: 2
  renderer: NetworkManager
--------------------------------------------------

Andreas Hasenack (ahasenack) wrote :

@renbag could you please:

- add debug_level = 4 to [sssd] and your [domain/] sections
- reboot
- post the output of these commands right after:

systemd-analyze blame

systemd-analyze critical-chain

- login as a kerberos/ldap user, which has been failing
- check the new logs in /var/log/sssd/* trying to catch the moment it decides it is offline. You could attach them to this bug, your the username and realm will show up and it might be too much to sanitize

In my artful-desktop VM, setup to use krb5 and ldap from another machine, logins after reboots work just fine. I can also take the network down, fail a login, take it backup, and without restarting anything login just fine afterwards. So we haven't gotten to the bottom of this yet.

renbag (renbag) wrote :

@Andreas: thank you for your suggestions. I removed the workaround from my artful machine and did what you asked, then, after some minutes, I reactivated the workaround and did the same. All the logs are attached for comparison (the real username and domain were redacted).

renbag (renbag) wrote :

/var/log/sssd/* attachment.

renbag (renbag) wrote :
Andreas Hasenack (ahasenack) wrote :

The workaround you speak of, is that the sssd option to cache credentials, or the systemd unit file change to require network-online.target?

renbag (renbag) wrote :

The workaround is the systemd unit file written in the Bug Description.
The 'cache credentials' config option does not have any effect on the bug and for the tests I have disabled it.

Jakub Hrozek (jakub-hrozek) wrote :

Are the sssd logs captured with a high enough debug_level in the [sssd] section of sssd.conf? Normally sssd should detect notifications from libnetlink and reset the offline status.. It would be nice to see those and correlate with system logs..

OliFre (freyermuth) wrote :

Could somebody please add Bionic back?
We observe this on Bionic, it's pretty random, mostly machines with fast root devices (e.g. NVMEs) are affected. Backend is offline on boot and stays offline.
Since it's so random and I don't want to enable debug information on all machines we have, I can't promise that we will be able to collect debug data.

OliFre (freyermuth) wrote :

@jakub-hrozek: When activating debug_level on some machines, which level is high enough?

Andreas Hasenack (ahasenack) wrote :

Maybe start with 6. From https://docs.pagure.org/SSSD.sssd/users/troubleshooting.html:
"""
To enable debugging persistently across SSSD service restarts, put the directive debug_level=N, where N typically stands for a number between 1 and 10 into the particular section. Levels up to 3 should log mostly failures (although we haven’t really been consistent especially earlier in the SSSD development) and anything above level 8 provides a large number of log messages. Level 6 might be a good starting point for debugging problems.
"""

OliFre (freyermuth) wrote :

@ahasenack: Even in case things "work", I observe the following:

Jun 11 14:41:15 cip000 systemd[1]: Starting System Security Services Daemon...
Jun 11 14:41:24 cip000 sssd[920]: Starting up
Jun 11 14:41:25 cip000 sssd[be[998]: Starting up
Jun 11 14:41:28 cip000 sssd[1021]: Starting up
Jun 11 14:41:28 cip000 sssd[1022]: Starting up
Jun 11 14:41:29 cip000 systemd[1]: Started System Security Services Daemon.
Jun 11 14:42:38 cip000 sssd[be[998]: Backend is online

Users trying to log in between 14:41 and 14:42 get an error message.
In the logfiles, I only see tons of:
(Mon Jun 11 14:42:32 2018) [sssd[nss]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]

in the nss-logs, and a few:
(Mon Jun 11 14:41:49 2018) [sssd[pam]] [sss_dp_get_reply] (0x0010): The Data Provider returned an error [org.freedesktop.sssd.Error.DataProvider.Offline]

in the sssd_pam log for each failed attempt.

Even though I get "Backend is online" in this case, the logs don't tell me why it actually recovered. So maybe 6 is not high enough?
Is it really sufficient to enable debugging only for sssd itself, don't I need to enable it for the provider, too?

Also, shouldn't this debug_level be a bitmask?

In any case, network connectivity was available well before 14:42:38. I actually logged in via ssh at 14:41:52 as user, but since the backend was still offline, no kerberos ticket was initialized.

OliFre (freyermuth) wrote :

I've now had a machine with actual failure and debug_level=9 in sections sssd, pam and nss.
No more information was produced, just tons of org.freedesktop.sssd.Error.DataProvider.Offline .

Seems that official manual is outdated, will try with bitmasks next.

OliFre (freyermuth) wrote :

I tried wiht 0x7770 now in all sections, but I don't get more info.
Now, sssd.log shows:
(Mon Jun 11 15:10:34:313118 2018) [sssd] [sss_ini_call_validators] (0x0020): [rule/allowed_sssd_options]: Attribute 'debug_devel' is not allowed in section 'sssd'. Check for typos.
for each section :-(.

So apparently the manpage is also wrong? I specified:
debug_devel = 0x7770

So I've no idea on how to provide debugging info. But I have a bionic machine (for 1-2 days) on which I can reproduce well, since it has an SSD.
Seems to be highly reproducible on any SSD / NVME Bionic machine.

OliFre (freyermuth) wrote :

Argh! My fault, with the typo fixed, I see the following:

Jun 11 15:36:27 exp006 sssd[be[918]: Backend is offline
[...]
(Mon Jun 11 15:36:31 2018) [sssd] [message_type] (0x0200): netlink Message type: 24
(Mon Jun 11 15:36:31 2018) [sssd] [route_msg_debug_print] (0x1000): route idx 2 flags 0 family 2 addr correct_ip_address/32
(Mon Jun 11 15:36:31 2018) [sssd] [network_status_change_cb] (0x2000): A networking status change detected signaling providers to reset offline status
(Mon Jun 11 15:36:31 2018) [sssd] [sbus_add_timeout] (0x2000): 0x5616e2ab79f0
[...]
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [resolv_gethostbyname_step] (0x2000): Querying DNS
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [resolv_gethostbyname_dns_query] (0x0100): Trying to resolve A record of 'kdc1.OUR.INSTITUTION' in DNS
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [schedule_request_timeout] (0x2000): Scheduling a timeout of 6 seconds
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [schedule_timeout_watcher] (0x2000): Scheduling DNS timeout watcher
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [unschedule_timeout_watcher] (0x4000): Unscheduling DNS timeout watcher
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [request_watch_destructor] (0x0400): Deleting request watch
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [resolv_gethostbyname_done] (0x0040): querying hosts database failed [5]: Input/output error
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [fo_resolve_service_done] (0x0020): Failed to resolve server 'kdc1.OUR_INSTITUTION': Could not contact DNS servers
(Mon Jun 11 15:37:50 2018) [sssd[be[uni-bonn.de]]] [set_server_common_status] (0x0100): Marking server 'kdc1.OUR_INSTITUTION' as 'not working'

So it seems even though network runs fine, DNS fails.
dig kdc1.OUR_INSTITUTION A
works fine, though.

OliFre (freyermuth) wrote :

That's interesting.
strace'ing the sssd_be process, it seems that it's trying to contact 127.0.0.1 as DNS server, and fails.
systemd-resolved is running on 127.0.0.53 and works just fine.

OliFre (freyermuth) wrote :

Is this an inotify race due to the symlink-magic of systemd-resolved?

sssd.log:(Mon Jun 11 15:36:23 2018) [sssd] [_snotify_create] (0x0400): Added a watch for /etc/resolv.conf with inotify flags 0x8D88 internal flags 0x1 using function resolv_conf_inotify_cb after delay 1.0

# ls -la /etc/resolv.conf
lrwxrwxrwx 1 root root 39 Jun 11 13:12 /etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf

# stat /run/systemd/resolve/stub-resolv.conf
  File: /run/systemd/resolve/stub-resolv.conf
  Size: 727 Blocks: 8 IO Block: 4096 regular file
Device: 17h/23d Inode: 950 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 101/systemd-resolve) Gid: ( 103/systemd-resolve)
Access: 2018-06-11 15:36:31.937607730 +0200
Modify: 2018-06-11 15:36:31.925607676 +0200
Change: 2018-06-11 15:36:31.925607676 +0200
 Birth: -

So it seems /run/systemd/resolve/stub-resolv.conf was modified slightly after sssd_be set up it's inotify watch, but I don't see anything in the logs that sssd was notified about this modification. Maybe since it's a symlink?

OliFre (freyermuth) wrote :

Needless to say, resolv.conf is:
nameserver 127.0.0.53
search OUR.INSTITUTION

OliFre (freyermuth) wrote :

I can confirm that adding the described override, all works fine.

Andreas Hasenack (ahasenack) wrote :

The inotify callback would log this:

    DEBUG(SSSDBG_TRACE_INTERNAL,
          "Received inotify notification for %s\n", filename);

That is level 8 debugging:
./src/util/debug.h:#define SSSDBG_TRACE_INTERNAL 0x2000 /* level 8 */

OliFre (freyermuth) wrote :

Thanks for searching out this info!
I grepped through all logs for "inot" and found no match (only the initial inotify setup). So it seems sssd was not notified via inotify of the change.
That matches the behaviour that "strace" showed that "127.0.0.1" instead of "127.0.0.53" was queried.

So either, there is a race (resolv.conf symlink target changed right between sssd reading it and inotify watch being activated, if this is the order in which things happen?) or systemd-resolved somehow manages to change resolv.conf in a way that inotify does not alert on.

For example, it seems that if I do the following:
$ touch old_resolv
$ touch new_resolv
$ ln -s old_resolv resolv
and now I setup an inotify watch on resolv, and perform:
$ ln -sf new_resolv resolv
Then effectively, resolv is changed (since the symlink target changed) but inotify does not see it.

Not sure if this is what systemd-resolved is doing, but it's not unlikely...

OliFre (freyermuth) wrote :

Another issue appears to be...
When starting sssd in highest debug level, and then changing resolv.conf once by doing:
$ touch foo
$ mv foo /run/systemd/resolve/stub-resolv.conf
(that's where the link target is pointing to), I see the expected:
(Tue Jun 12 00:16:57 2018) [sssd] [resolv_conf_inotify_cb] (0x2000): Received inotify notification for /etc/resolv.conf
(Tue Jun 12 00:16:57 2018) [sssd] [monitor_update_resolv] (0x1000): Resolv.conf has been updated. Reloading.

Doing that *again*, i.e.
$ touch bar
$ mv bar /run/systemd/resolve/stub-resolv.conf
Does *not* trigger sssd's inotify watch again. That's apparently since sssd does not re-register a new inotify watch for the new inode number of the file which was moved in.

This seems like a fundamental rework would have to go into sssd, maybe the easiest would be to default to polling instead of inotify until all this is fixed?

OliFre (freyermuth) wrote :

I'll test tomorrow if try_inotify=false in the [sssd] section also works around this issue, unless someone beats me to it...

renbag (renbag) wrote :

Since my last comment on this bug report (#16), I upgraded my workstation from artful to bionic. The machine also has an NVME SSD, but I do not see this sssd offline bug anymore.

However, I had intermittent problems in browsing the shares inside my AD domain and, when I read here about possible race problems in sssd, I tried to solve them by putting back in place the systemd workaround of this bug description. The result is that with the workaround I can always browse the network shares (I'm using thunar in XFCE4, which in turn uses gvfsd-smb-browse), while without it I can't see the shares almost all the times. This is a consistent behaviour that I noticed in the last days and after several reboots of the workstation.
Should this be reported as a different bug?

Note: the samba browsing problem is not related to the Security Improvements described in the Bionic Release Notes (https://wiki.ubuntu.com/BionicBeaver/ReleaseNotes).

OliFre (freyermuth) wrote :

@renbag: Do you see "sssd[be[896]: Backend is offline" in "systemctl status sssd"?
Is your backend server given as hostname / fqdn in sssd.conf?

@ahasenack: try_inotify=false did not change the issue at all. With strace, I can still see 127.0.0.1 being queried. IMHO this can only be an ugly race between c_ares channel creation and resolv.conf changing. In addition, inotify support is borked, since no new watch is created when the inode of the watched file changes, but this should be a different issue.

OliFre (freyermuth) wrote :

@renbag: It might be that in terms of timing, the DHCP server is also involved, since it defines the time network-online.target takes to start. Our setup is relatively slow, since we have a DHCP server under high load (several 10000 leases) and are using a dhcrelay in between.
Potentially, you can still force the issue to appear by inverting the workaround to force sssd to start before network-online.target.

Andreas Hasenack (ahasenack) wrote :

The sssd.conf manpage from sssd 1.13.4 and 1.16.1 (xenial and bionic) says that without inotify sssd will fall back to polling resolv.conf every 5s:

try_inotify (boolean)
    SSSD monitors the state of resolv.conf to identify when it needs to update its internal DNS resolver. By default, we will attempt to use inotify for this, and will fall back to polling resolv.conf every five seconds if inotify cannot be used.

Did you observe that when setting try_inotify to false?

Andreas Hasenack (ahasenack) wrote :

Interesting, I setup a watch on a directory to see what happens when a symlink is changed:

https://pastebin.ubuntu.com/p/wKpMVY4dVz/

The watch on the "old" symlink definitely is lost via that operation, as far as I can see.

OliFre (freyermuth) wrote :

After setting try_inotify to false, I see that the message that an inotify was set up does not show up anymore, but no message that polling is used instead is shown in the logs. If I understand the code correctly, that's just right - a message will only be shown if try_inotify is true and it actually fails.

The polling, though, appears to be pretty useless when checking the code. It checks whether "mtime" was changed, on "/etc/resolv.conf". That seems to rarely change for the symlink, but only for the symlink target, so actual resolv-file changes are not noticed...

I think your observation concerning inotify is expected, since inotify works with on inode numbers, IIUC.

Andrew Conway (acubuntuone) wrote :

I also have observed this problem on bionic for several different computers. The workaround always solves the problem; without the workaround I cannot log in. What the computers had in common:

* Using bionic (either the version from the DVD or with network updates)
* Using networkd rather than NetworkManager with a static ipv4 address
* sssd.conf uses fqdn to refer to the name servers
* A backup ldap and kerberos server were both provided.

OliFre (freyermuth) wrote :

This also mostly applies to us:
* Bionic, installed via network installer.
* We did not configure network explicitly, AFAICT, this means networkd is used. However, we run with DHCP.
* sssd.conf uses FQDNs for the kerberos and ldap servers (I guess you meant that)?
* We also provide backup servers for both.

Mark Foster (fostermarkd) wrote :

We are seeing this too, on Bionic servers installed from netboot / PXE.

Bionic is (still) missing from the Affected
Status says "Fix Released" but where is that? I don't think it's true.

Andreas Hasenack (ahasenack) wrote :

Sorry, "fix released" is for the current development of Ubuntu which is 18.10. I'll add a task for bionic, and remove the artful one.

Changed in sssd (Ubuntu Artful):
status: Triaged → Won't Fix
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in sssd (Ubuntu Bionic):
status: New → Confirmed
Jani Jaakkola (jj-lousa) wrote :

I hit this bug too with Ubuntu 18.04. On boot sssd fails to find AD servers from DNS and then continues to query 127.0.0.1 port 53, which will never answer. The log looks like this:

(Wed Oct 10 13:04:13 2018) [sssd[be[ad.helsinki.fi]]] [resolv_getsrv_send] (0x0100): Trying to resolve SRV record of '_ldap._tcp.helsinki._sites.ad.helsinki.fi'
(Wed Oct 10 13:04:13 2018) [sssd[be[ad.helsinki.fi]]] [resolv_discover_srv_done] (0x0040): SRV query failed [11]: Could not contact DNS servers

which does not help much, since it doesn't mention the non existing DNS server (127.0.0.1) in the logs (you need to run strace to see it).

Restarting sssd makes it go back online.

Ron Terwilliger (ront2018) wrote :

I can confirm the same issue with Ubuntu 18.04 (used the DVD). SSSD cannot resolve the the backend and then says it's offline. Restarting SSSD after a reboot always resolves the issue.

I've found two workarounds in 18.04:
1) Use your own resolv.conf and not rely on systemd-resolved.service.
"unlink /etc/resolv.conf" and then create your own resolv.conf that doesn't rely on systemd-resolved.

2) Make SSSD start after systemd-resolved.service. I prefer this method because setting up the nameservers is still automatic (in case they change).
cp -p /lib/systemd/system/sssd.service /etc/systemd/system/
sed -i 's/Wants.*$/Wants &\nAfter=systemd-resolved.service/g' /etc/systemd/system/sssd.service

Hope this helps someone.

Andreas Hasenack (ahasenack) wrote :

A PR which might address this was posted to the sssd mailing list: https://github.com/SSSD/sssd/pull/864

meskaya (meskaya) wrote :

The code in the PR above has now been pushed can we get a new build of SSSD including it ?

Andreas Hasenack (ahasenack) wrote :

I agree. Do you have a way to reproduce this somewhat reliably?

Changed in sssd (Ubuntu Bionic):
importance: Undecided → High
meskaya (meskaya) wrote :

I am trying to seed SSSD cache. The setup is working fine, I can do a successful offline authentication but once I am connected to my company network using VPN, SSSD cache is never refreshed unless I restart SSSD or replace /etc/resolv.conf symlink (I am using systemd-resolved) by a regular /etc/resolv.conf file.

tags: added: server-next
meskaya (meskaya) wrote :

Hi all.
@Andreas Hasenack any progress on this :) ?
Anything we can do to help.

Andreas Hasenack (ahasenack) wrote :

I'll get to it. Being able to test the fix would help a lot, since I never got it reproduced locally.

Jens Elkner (jelmd) wrote :

Same here - it is a nightmare:

(Wed Mar 4 16:19:36 2020) [sssd[be[LDAP]]] [resolv_gethostbyname_done] (0x0040): querying hosts database failed [5]: Input/output error
(Wed Mar 4 16:19:36 2020) [sssd[be[LDAP]]] [fo_resolve_service_done] (0x0020): Failed to resolve server 'ldap....': Could not contact DNS servers
(Wed Mar 4 16:19:36 2020) [sssd[be[LDAP]]] [be_resolve_server_process] (0x0080): Couldn't resolve server (ldap....), resolver returned [5]: Input/output error

Andreas Hasenack (ahasenack) wrote :

This is still on my list, I'm just very busy with focal fossa right now, but the feature freeze is past, and now I can concentrate on only fixing bugs.

Changed in sssd (Ubuntu Bionic):
assignee: nobody → Andreas Hasenack (ahasenack)
status: Confirmed → In Progress
Andrew Conway (acubuntuone) wrote :

It has been a constant problem for me in 18.04 but seems to work fine in 20.04 on the couple of computers I have tried it on.

However I have had reliability issues with it in 20.04 on one computer with multiple ethernet adapters with only one plugged in and bridging set up (for VMs). That may be just my misconfiguration though. Disabling one adapter made it work fine.

Andreas Hasenack (ahasenack) wrote :

Ok, I have a simple way to reproduce this, and was able to test the upstream patch. I'll polish it up and put it up for review, then prepare the SRU paperwork.

Andreas Hasenack (ahasenack) wrote :

Just tested and eoan is affected, focal is clear.

Changed in sssd (Ubuntu Eoan):
assignee: nobody → Andreas Hasenack (ahasenack)
status: New → In Progress
description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
Changed in sssd (Ubuntu Eoan):
importance: Undecided → High
description: updated
Robie Basak (racb) wrote :

> b) change the fallback polling code to keep trying, instead of giving up right away

AFAICT, this will only happen if /etc/resolv.conf is a broken symlink.

What happens if the symlink exists but doesn't provide something valid, sssd goes into offline, and then the symlink is changed to something that does work? I get the impression that this case hasn't been fixed. Am I right?

Nevertheless the fix you have is correct and would be useful to land for the common case failure. Thank you to everyone involved in identifying the problem. So +1 for the SRU review, and this is ready to accept. I only wait in case you want to do something different if my edge case above is correct. Let me know.

description: updated
Andreas Hasenack (ahasenack) wrote :

> AFAICT, this will only happen if /etc/resolv.conf is a broken symlink.

Correct. I would have expected the default inotify case to handle that, though, as it will see that the contents of resolv.conf (or the symlink target) changed, and in fact it notices that, but doesn't seem to switch back to the online status.

I find it hard to test dns resolution issues in ubuntu, though, something unexpected always happens, like bug #1765477 which introduces a 5s delay and might be what is causing sssd to think even the new dns isn't working. I tested on bionic your case above.

In any case, the fixes are very localized to the bug reported here, and I believe fix them. Trying to handle the case you mentioned would introduce more regression potential, and investigating it further would add more delays. I think we should proceed with the SRU as is.

Thanks!

Hello Marian, or anyone else affected,

Accepted sssd into eoan-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/sssd/2.2.0-4ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-eoan to verification-done-eoan. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-eoan. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in sssd (Ubuntu Eoan):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-eoan
Changed in sssd (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed-bionic
Robie Basak (racb) wrote :

Hello Marian, or anyone else affected,

Accepted sssd into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/sssd/1.16.1-1ubuntu1.6 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Andreas Hasenack (ahasenack) wrote :

eoan verification

With package from proposed:
 *** 2.2.0-4ubuntu1.1 500
        500 http://archive.ubuntu.com/ubuntu eoan-proposed/main amd64 Packages

a) inotify test
Since I have the proposed package, logs show sssd is monitoring the symlink's target:
(Tue Jun 2 13:37:27 2020) [sssd] [_snotify_create] (0x0400): Added a watch for /etc/resolv.conf.target with inotify flags 0x8D88 internal flags 0x1 using function resolv_conf_inotify_cb after delay 1.0

Domain is offline as expected:
root@eoan-sssd-1723350:~# sssctl domain-status LDAP
Online status: Online

Active servers:
LDAP: ldap01.example.com

Discovered LDAP servers:
- ldap01.example.com

After unbreaking the symlink, logs show sssd picked the change up:
(Tue Jun 2 13:39:21 2020) [sssd] [process_dir_event] (0x0400): received notification for watched file [resolv.conf.target] under /etc

And the domain is now online:
root@eoan-sssd-1723350:~# sssctl domain-status LDAP
Online status: Online

Active servers:
LDAP: ldap01.example.com

Discovered LDAP servers:
- ldap01.example.com

b) polling test

After restarting with the proposed sssd package, I see it detected /etc/resolv.conf is broken, and since it's not using inotify, it starts the polling:
(Tue Jun 2 14:13:57 2020) [sssd] [monitor_config_file_fallback] (0x0020): file [/etc/resolv.conf] is missing. Will try again later.
(Tue Jun 2 14:14:07 2020) [sssd] [monitor_config_file_fallback] (0x0020): file [/etc/resolv.conf] is missing. Will try again later.
...
sssctl shows domain is offline, as expected:
root@eoan-sssd-1723350:~# sssctl domain-status LDAP
Online status: Offline

Active servers:
LDAP: not connected

Discovered LDAP servers:
- ldap01.example.com

After unbreaking the symlink and waiting a few seconds, sssd picks the change up:
(Tue Jun 2 14:20:12 2020) [sssd] [monitor_config_file_fallback] (0x0020): file [/etc/resolv.conf] is missing. Will try again later.
(Tue Jun 2 14:20:22 2020) [sssd] [signal_res_init] (0x0040): Reloading Resolv.conf.

And the domain is now online:
root@eoan-sssd-1723350:~# sssctl domain-status LDAP
Online status: Online

Active servers:
LDAP: ldap01.example.com

Discovered LDAP servers:
- ldap01.example.com

eoan verification successful.

description: updated
tags: added: verification-done-eoan
removed: verification-needed-eoan
description: updated
Andreas Hasenack (ahasenack) wrote :

bionic verification

With package from proposed:
 *** 1.16.1-1ubuntu1.6 500
        500 http://archive.ubuntu.com/ubuntu bionic-proposed/main amd64 Packages

a) inotify test
Since I have the proposed package, logs show sssd is monitoring the symlink's target:
(Tue Jun 2 14:39:32 2020) [sssd] [_snotify_create] (0x0400): Added a watch for /etc/resolv.conf.target with inotify flags 0x8D88 internal flags 0x1 using function resolv_conf_inotify_cb after delay 1.0

Domain is offline as expected:
root@bionic-sssd-1723350:~# sssctl domain-status LDAP
Online status: Offline

Active servers:
LDAP: not connected

Discovered LDAP servers:
- ldap01.example.com

root@bionic-sssd-1723350:~#

After unbreaking the symlink, logs show sssd picked the change up:
(Tue Jun 2 14:40:17 2020) [sssd] [process_dir_event] (0x0400): received notification for watched file [resolv.conf.target] under /etc

And the domain is now online:
root@bionic-sssd-1723350:~# sssctl domain-status LDAP
Online status: Online

Active servers:
LDAP: ldap01.example.com

Discovered LDAP servers:
- ldap01.example.com

b) polling test

After restarting with the proposed sssd package, I see it detected /etc/resolv.conf is broken, and since it's not using inotify, it starts the polling:
(Tue Jun 2 14:44:59 2020) [sssd] [monitor_config_file_fallback] (0x0020): file [/etc/resolv.conf] is missing. Will try again later.
(Tue Jun 2 14:45:09 2020) [sssd] [monitor_config_file_fallback] (0x0020): file [/etc/resolv.conf] is missing. Will try again later.
...

sssctl shows domain is offline, as expected:
root@bionic-sssd-1723350:~# sssctl domain-status LDAP
Online status: Offline

Active servers:
LDAP: not connected

Discovered LDAP servers:
- ldap01.example.com

After unbreaking the symlink and waiting a few seconds, sssd picks the change up:
(Tue Jun 2 14:45:49 2020) [sssd] [monitor_config_file_fallback] (0x0020): file [/etc/resolv.conf] is missing. Will try again later.
(Tue Jun 2 14:45:59 2020) [sssd] [signal_res_init] (0x0040): Reloading Resolv.conf.

And the domain is now online:
root@bionic-sssd-1723350:~# sssctl domain-status LDAP
Online status: Online

Active servers:
LDAP: ldap01.example.com

Discovered LDAP servers:
- ldap01.example.com

bionic verification successful.

tags: added: verification-done-bionic
removed: verification-needed-bionic
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers