winbind returns PAM_AUTHINFO_UNAVAIL on first login after reboot

Bug #1764853 reported by msaxl
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
samba (Ubuntu)
Won't Fix
Undecided
Unassigned

Bug Description

The following issue exists only on Ubuntu 18.04

I've upgraded ubuntu from 17.10 and noticed that winbind does not work well.
90% of the time I reboot my system I'm getting PAM_AUTHINFO_UNAVAIL when trying to log in with a domain account.
clicking login again on the login screen most of the time succeeds (so the password is correct)

I've checked if it works if I wait 10 minutes before logging in, no success, so it is not a timing issue.
Also I've checked if winbind is working (log in with ssh using a local account)
getent passwd xy and wbinfo -K user%pwd both work always.

Now my workaround is putting
winbind request timeout = 3
in smb.conf, since the PAM_AUTHINFO_UNAVAIL is returned about 60sec after trying to login. This workaround solves nothing, it only makes logging in faster. (But now it fails mostly two times, but waiting 6 seconds is better than 60)

To me it seems like deadlock, but I was unable to track it since it happens only on the first login. Then I would have to reboot (restarting winbind does not trigger it twice, also removing all caches in /run/samba does not trigger it twice)

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Thanks for filing this bug in Ubuntu.

Is this perhaps a desktop system, where the network is only available after you login, because of network manager? Or is it a server?

Is it a fresh install of ubuntu bionic 18.04, or did you upgrade from a previous release? This matters because 18.04 uses netplan for networking by default if it's a fresh install.

Changed in samba (Ubuntu):
status: New → Incomplete
Revision history for this message
msaxl (saxl) wrote :

1) Yes, it is a desktop system, but not a wireless system, so network is available (NetworkManager).
I've checked that with ssh-ing into this machine with a local account. Both wbinfo -p and wbinfo -P showed everything is online. But also in this case the first domain login failes.

2) It is a upgrade installation (without netplan)

3) Quick question: is there a documentation how to manually migrate to netplan in a desktop system?

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Here are many examples of netplan configurations to get you started: https://netplan.io/examples

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Can you share your /etc/samba/smb.conf, /etc/network/interfaces and /etc/network/interfaces.d/* please? And /var/log/samba/log* files

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

And just to be sure, please check if /etc/netplan/ is empty :)

Revision history for this message
msaxl (saxl) wrote :

/etc/netplan/ contains 01-network-manager-all.yaml, if I remove it I get no network connection.
This systems seems to be already migrated to netplan.

/etc/network/interfaces.d/ is empty, /etc/network/interfaces contains only the default lo interface.

smb.conf:
[global]
        workgroup = JDW
        realm = JDW.CONET
        security = ads
        idmap config * : backend = tdb
        idmap config * : range = 1000000-1999999
        idmap config JDW : backend = rid
        idmap config JDW : range = 1266900000-2000000000
        template homedir = /home/%D/%U
        template shell = /bin/bash
        winbind use default domain = Yes
        winbind refresh tickets = Yes
        winbind offline logon = Yes
        winbind request timeout = 3
        kerberos method = secrets and keytab
# winbind rpc only = yes
        client signing = yes
        client use spnego = yes
        store dos attributes = yes
        ea support = yes

Revision history for this message
Andreas Hasenack (ahasenack) wrote : Re: [Bug 1764853] Re: winbind returns PAM_AUTHINFO_UNAVAIL on first login after reboot

Ok, what are the contents of the netplan file?

On Fri, Apr 20, 2018, 04:31 msaxl <email address hidden> wrote:

> /etc/netplan/ contains 01-network-manager-all.yaml, if I remove it I get
> no network connection.
> This systems seems to be already migrated to netplan.
>
> /etc/network/interfaces.d/ is empty, /etc/network/interfaces contains
> only the default lo interface.
>
> smb.conf:
> [global]
> workgroup = JDW
> realm = JDW.CONET
> security = ads
> idmap config * : backend = tdb
> idmap config * : range = 1000000-1999999
> idmap config JDW : backend = rid
> idmap config JDW : range = 1266900000-2000000000
> template homedir = /home/%D/%U
> template shell = /bin/bash
> winbind use default domain = Yes
> winbind refresh tickets = Yes
> winbind offline logon = Yes
> winbind request timeout = 3
> kerberos method = secrets and keytab
>
>
> # winbind rpc only = yes
>
>
> client signing = yes
>
>
> client use spnego = yes
>
>
> store dos attributes = yes
>
>
> ea support = yes
>
> ** Attachment added: "logfiles"
>
> https://bugs.launchpad.net/ubuntu/+source/samba/+bug/1764853/+attachment/5123369/+files/sambalog.tar.xz
>
> --
> You received this bug notification because you are subscribed to samba
> in Ubuntu.
> https://bugs.launchpad.net/bugs/1764853
>
> Title:
> winbind returns PAM_AUTHINFO_UNAVAIL on first login after reboot
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/samba/+bug/1764853/+subscriptions
>

Revision history for this message
msaxl (saxl) wrote :

The content is:

# Let NetworkManager manage all devices on this system
network:
  version: 2
  renderer: NetworkManager

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Please attach these logs from when the login failure happens:
- /var/log/auth.log
- /var/log/syslog
- /var/log/samba/log*

I configured a VM with your smb.conf, joined a windows 2016 AD server via net ads join -k and AD users can login just fine immediately after a reboot.

I also used your netplan file, and I have no /etc/network/interfaces or interfaces.d/* content. I didn't upgrade from xenial, though, this was a fresh bionic install.

Since this is a VM I provisioned with uvt-kvm, there are a few differences from a normal desktop install:
- I did "apt install ubuntu-desktop" after provisioning the vm
- I removed the cloud-init package after provisioning
- my dns and dhcp server is not the AD server, although I used its DNS server temporarily when doing "net ads join" so that the server could be found

I could try a fresh bionic *desktop* install, I know it could configure networking a bit differently, or even install a xenial desktop and then upgrade, but let's start with the logs I requested above.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

In fact, I was even able to login after I shut down the windows server, because of the "winbind offline logon = Yes" setting. I got a notice saying that the controller was offline, but logged in to the desktop without further problems, launched applications, etc.

Revision history for this message
msaxl (saxl) wrote :

Requested logs.
The failed first authentication is on Apr 21 11:05:28, immediatly after the second attempt succeeds.

Before I logged in with the domain account I checked that networking of the machine worked:
wbinfo -P and wbinfo -p both showed online, wbinfo -u displayed every user.

The DC is a Ubuntu 16.04 samba active directory. In a similar setup where I have the same problem I use a "18.04" ubuntu samba dc, but lets stay with this machine since I can reproduce the problem very reliably and the machine reboots quickly.

/etc/nsswitch has the following setup:
passwd: compat winbind systemd
group: compat winbind systemd
shadow: compat
gshadow: files

hosts: files resolve dns mdns_minimal
networks: files

protocols: db files
services: db files
ethers: db files
rpc: db files

netgroup: nis

Revision history for this message
msaxl (saxl) wrote :

Some additions:

I discovered that if I do not symlink /etc/resolv.conf -> /lib/systemd/resolv.conf but /etc/resolvconf/resolv.conf
and add
dns=dnsmasq
rc-manager=resolvconf

in /etc/NetworkManager/NetworkManager.conf,

the problem is gone.

Additionally I re-added the 127.0.1.1 entry in /etc/hosts (should not be required with systemd-resolved).

This entry is the source of the problem: if it is missing, getaddrinfo in source3/lib/util.c should get the domain name from systemd-resolve (hostname -f does, getent hosts <hostname> also), but on the first call after reboot it does not return the fqdn but only the hostname. Very strange.. I will look if I find something in systemd-resolve, maybe there is a regression

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Does your normal resolve.conf contain 127.0.0.53, or your actual dns up?

On Sat, Apr 21, 2018, 07:00 msaxl <email address hidden> wrote:

> Some additions:
>
> I discovered that if I do not symlink /etc/resolv.conf ->
> /lib/systemd/resolv.conf but /etc/resolvconf/resolv.conf
> and add
> dns=dnsmasq
> rc-manager=resolvconf
>
> in /etc/NetworkManager/NetworkManager.conf,
>
> the problem is gone.
>
> Additionally I re-added the 127.0.1.1 entry in /etc/hosts (should not be
> required with systemd-resolved).
>
> This entry is the source of the problem: if it is missing, getaddrinfo
> in source3/lib/util.c should get the domain name from systemd-resolve
> (hostname -f does, getent hosts <hostname> also), but on the first call
> after reboot it does not return the fqdn but only the hostname. Very
> strange.. I will look if I find something in systemd-resolve, maybe
> there is a regression
>
> --
> You received this bug notification because you are subscribed to samba
> in Ubuntu.
> https://bugs.launchpad.net/bugs/1764853
>
> Title:
> winbind returns PAM_AUTHINFO_UNAVAIL on first login after reboot
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/samba/+bug/1764853/+subscriptions
>

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I've seen a 5s delay in dns resolution upon first boot in bionic. I don't
have the bug at hand now, but I filed it against system two days ago or so.

On Sat, Apr 21, 2018, 08:48 Andreas Hasenack <email address hidden> wrote:

> Does your normal resolve.conf contain 127.0.0.53, or your actual dns up?
>
> On Sat, Apr 21, 2018, 07:00 msaxl <email address hidden> wrote:
>
>> Some additions:
>>
>> I discovered that if I do not symlink /etc/resolv.conf ->
>> /lib/systemd/resolv.conf but /etc/resolvconf/resolv.conf
>> and add
>> dns=dnsmasq
>> rc-manager=resolvconf
>>
>> in /etc/NetworkManager/NetworkManager.conf,
>>
>> the problem is gone.
>>
>> Additionally I re-added the 127.0.1.1 entry in /etc/hosts (should not be
>> required with systemd-resolved).
>>
>> This entry is the source of the problem: if it is missing, getaddrinfo
>> in source3/lib/util.c should get the domain name from systemd-resolve
>> (hostname -f does, getent hosts <hostname> also), but on the first call
>> after reboot it does not return the fqdn but only the hostname. Very
>> strange.. I will look if I find something in systemd-resolve, maybe
>> there is a regression
>>
>> --
>> You received this bug notification because you are subscribed to samba
>> in Ubuntu.
>> https://bugs.launchpad.net/bugs/1764853
>>
>> Title:
>> winbind returns PAM_AUTHINFO_UNAVAIL on first login after reboot
>>
>> To manage notifications about this bug go to:
>>
>> https://bugs.launchpad.net/ubuntu/+source/samba/+bug/1764853/+subscriptions
>>
>

Revision history for this message
msaxl (saxl) wrote :

Some testresults:
resolv.conf dns server*, nsswitch setting, hosts contains 127.0.1.1 entry, result
---------------------------------------------------------------------------------
127.0.0.53 , file resolve dns, no , fails
127.0.1.1 , file resolve dns, no , fails
127.0.0.53 , file dns , no , works
127.0.1.1 , file dns , no , works
127.0.0.53 , file resolve dns, yes , works
127.0.1.1 , file resolve dns, yes , works

* if 127.0.0.53, symlink to /lib/systemd/resolve.conf is in use

Conclusion: the problem is in nss_resolve

since nss_resolve should use dbus, I checked with dbus-monitor --system what is sent.
If you are able to reproduce this problem: To me it seems that the request is sent after the timeout already happened. Also while the login attempt is running, systemd-resolve is not working. Do you know a situation dbus-daemon is blocking?. If this proves true, what could cause this?

Revision history for this message
msaxl (saxl) wrote :

i guess I found the problem.

winbindd somewhere does change its uid to the target uid to create the users kerberos cache.
If keytab method contains system keytab (it does in my configuration), in gse_krb5.c fill_mem_keytab_from_system_keytab there is a call to name_to_fqdn. This function uses getaddrinfo to get the machines fqdn. This in turn connects to system dbus (not as uid 0!). system dbus has not cached this uid's "credentials" (there seems to be a hash table, see dbus-userdb.c line 148), so it uses nsswitch configuration to get it. system dbus now connects to winbind. But winbind seems to be blocking in this case (and system dbus now is blocked to).
As soon as pam_winbind times out, the deadlock is broken, the needed information is returned to system dbus, the info is put into the hashtable, dbus is not blocked anymore.

The second time the info is in dbus's hashtable, so the deadlock does not happen (this also explains why the second time I get the systems fqdn but not the first time).

Keep in mind that this means calling getaddrinfo in winbind is only save as uid 0, but I suggest the following (maybe better to be discussed upstream):

insert a if(getuid()==0){ .. } around line 597 and 602 in gee_krb5.c (https://git.samba.org/?p=samba.git;a=blob;f=source3/librpc/crypto/gse_krb5.c;h=4dd39eaf08d8f492b6b332cfb5b2f30e4c1ab575;hb=4dd39eaf08d8f492b6b332cfb5b2f30e4c1ab575#l597)

Revision history for this message
msaxl (saxl) wrote :

I've tested if my suggested workaround would work. see ppa:saxl/ppa.

It works :)

Summary: Default 18.04 installation should not be affected since /etc/hosts contains an entry with the local hostname. If ubuntu removes this line by default the default installation will break (afaik systemd-resolved should replace every /etc/hosts since it also resolves localhost).

Technical problem: winbindd process must not use dbus with uid!=0.

My workaround makes sure it will not happen (in this case, kerberos method = system keytab will still deadlock). The impact of this patch should be zero since on a correctly configured system only uid==0 will be able to use /etc/krb5.keytab so this workaround skips the step loading the system keytab and failing doing so.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

How did you get "resolve" in the hosts line in your /etc/nsswitch.conf? Was that a default in < bionic perhaps? The libnss-resolve package does exist in bionic, but I don't have it installed in my test bionic-desktop machine. There, the hosts line from /etc/nsswitch.conf is this:

hosts: files mdns4_minimal [NOTFOUND=return] dns myhostname

myhostname comes from libnss-myhostname, a systemd deb

Revision history for this message
msaxl (saxl) wrote :

Yes, I think a version between 16.04 and 18.04 added this (Don't remember what version).

If someone installs libnss-resolve it will modify nsswitch automatically.

I think we can close this ticket since it does not apply to a default configuration.
Also I think /etc/hosts is not empty by default but still contains localhost and hostname.
Just keep in mind that myhostname would also allow removing localhost and hostname from /etc/hosts. I expect ubuntu to do that at some point in the future.
In that case installing libnss-resolve could make problems that are not easy to track.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Can you remove libnss-resolve without further incident? It looks like only an openvpn package depends on it in bionic:

root@nsnx:~# apt-cache rdepends libnss-resolve
libnss-resolve
Reverse Depends:
  openvpn-systemd-resolved

I also removed libnss-myhostname, that I got installed because of a Recommends from gnome-control-center. I'm not a big fan of these magic resolvers if one has a proper dns setup.

Revision history for this message
msaxl (saxl) wrote :

Yes, it seems apt remove libnss-resolve would only remove that single thing.

Well, I'm not the one that decides what gets recommended, but systemd also has nss-mymachines that also uses dbus. Also that could be some day be recommended by ex. systemd-nspawn :)

Again: Now I consider this bug as state: wontfix. It is however important to know nss dbus backends and winbind don't work well. Let's hope if someone suffers from a similar issue finds this and knows how to resolve the probem.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Thanks for your detailed analysis. It was helpful to know all the things that can nowadays alter ones /etc/nsswitch.conf and change the behavior in surprising ways.

There is no "wontfix" state in launchpad, so I will mark this as "invalid" since it was caused by a local configuration issue. Still, as you said, there is useful information in this bug, and we can always reopen it if other scenarios where this could happen show up.

Changed in samba (Ubuntu):
status: Incomplete → Won't Fix
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

And just after I clicked "post", I found "wontfix" :)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.