Samba 2:4.3.8+dfsg-0ubuntu0.14.04.2 Reversion: CPU Soft Lock

Bug #1572608 reported by Ed Huyer on 2016-04-20
50
This bug affects 8 people
Affects Status Importance Assigned to Milestone
samba (Ubuntu)
Undecided
Unassigned

Bug Description

Upon upgrading to Samba 4.3.8, I encounter crippling soft lockup problems with smbd. The kernel will start throwing errors (see below) about what appears to be the root smbd process, and the locked process eventually cripples all smb connections to the server. Further, the process is sufficiently locked that SIGKILL has no effect. While the process is locked, the system shows steadily increasing iowait on the CPU.

/var/log/error excerpt:
Apr 20 09:34:17 banner kernel: BUG: soft lockup - CPU#7 stuck for 22s! [smbd:9842]
Apr 20 09:34:45 banner kernel: BUG: soft lockup - CPU#7 stuck for 22s! [smbd:9842]
Apr 20 09:35:13 banner kernel: BUG: soft lockup - CPU#7 stuck for 22s! [smbd:9842]
Apr 20 09:35:41 banner kernel: BUG: soft lockup - CPU#7 stuck for 22s! [smbd:9842]
Apr 20 09:36:09 banner kernel: BUG: soft lockup - CPU#7 stuck for 22s! [smbd:9842]
Apr 20 09:36:37 banner kernel: BUG: soft lockup - CPU#7 stuck for 22s! [smbd:9842]
Apr 20 09:36:51 banner kernel: INFO: rcu_sched self-detected stall on CPU { 7}(t=240025 jiffies g=7144731 c=7144730 q=0)

smb.conf global options:
[global]
        workgroup = [redacted]
        server string = Samba Server Version %v
        interfaces = lo eth0
        hosts allow = [redacted]
        socket options = SO_SNDBUF=16384 SO_RCVBUF=16384 TCP_NODELAY
        syslog only = yes
        security = ads
        passdb backend = tdbsam
        realm = [redacted]
        winbind nss info = rfc2307
        idmap config * :backend = tdb
        idmap config * :range = 10000000-11000000
        idmap config [redacted]:schema_mode = rfc2307
        idmap config [redacted]:backend = ad
        idmap config [redacted]:range = 1000-9999999
        idmap config [redacted]:default = yes
        winbind use default domain = true
        winbind offline logon = false
        winbind enum users = no
        winbind enum groups = no
        winbind nested groups = yes
        winbind expand groups = 4
        winbind separator = /
        winbind refresh tickets = yes
        kerberos method = secrets and keytab
        template homedir = /home/%U
        template shell = /bin/bash
        vfs objects = acl_xattr
        map acl inherit = yes
        store dos attributes = yes
       server signing = auto
        winbind sealed pipes = false
        require strong key = false
        winbind sealed pipes:[redacted] = true
        require strong key:[redacted] = true
        preferred master = no
        wins support = no
        wins proxy = no
        dns proxy = no
        load printers = no
        cups options = raw

Release Info:
Description: Ubuntu 14.04.4 LTS
Release: 14.04

I will also note that the Samba shares are living on Ceph RBD volumes formatted as XFS. It seems unlikely, but it's conceivable that Samba 4.3.8 introduced some new conflict with the kernel RBD module.

Ed Huyer (arcanum3000) wrote :

Sorry, I neglected to mention: Reverting to Samba 4.1.6 (along with all the associated libraries) resolves the problem.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in samba (Ubuntu):
status: New → Confirmed

Hi,

we have problems with this new version too... the system stop to work (with out memory) after a time and I cant get the user list from the domain, like in this bug report:
http://www.spinics.net/lists/samba/msg133470.html

Note1: wbinfo -u not work but wbinfo -g work perfectly.
Note2: on my case, rejoin to domain not work, only making a downgrade to the previous version (2:4.1.6+dfsg-1ubuntu2) work again.

on log-wb.DOMAIN I see:

====================
[2016/04/21 21:02:50.111459, 1] ../auth/gensec/spnego.c:664(gensec_spnego_create_negTokenInit)
  Failed to setup SPNEGO negTokenInit request: NT_STATUS_INTERNAL_ERROR
[2016/04/21 21:02:50.697369, 1] ../source3/libads/ldap_utils.c:91(ads_do_search_retry_internal)
  Reducing LDAP page size from 1000 to 500 due to IO_TIMEOUT
[2016/04/21 21:02:51.231447, 1] ../source3/libads/ldap_utils.c:91(ads_do_search_retry_internal)
  Reducing LDAP page size from 500 to 250 due to IO_TIMEOUT
[2016/04/21 21:02:51.632155, 1] ../source3/libads/ldap_utils.c:135(ads_do_search_retry_internal)
  ads reopen failed after error Time limit exceeded
[2016/04/21 21:02:51.632204, 1] ../source3/winbindd/winbindd_ads.c:320(query_user_list)
  query_user_list ads_search: Time limit exceeded
=====================

thanks and hopping that this info help to get a solution.

Marc Deslauriers (mdeslaur) wrote :

Today's Samba update may contain the fix for this issue:

http://www.ubuntu.com/usn/usn-2950-2/

Could the original bug reporter please test the update and comment here? Thanks!

Ed Huyer (arcanum3000) wrote :

Thanks, I'll give it a try and report back when I am able. Probably in a few weeks. The server I encountered the problem on is heavily used currently, even during what are normally "off hours". Usage will drop off dramatically later this month.

MoD (lluna-nova) wrote :

We are having the same issue today, with samba 4.3.9 on ubuntu 14.04. We started having this issue on wednesday night.

Linux helisrv 3.16.0-70-generic #90~14.04.1-Ubuntu SMP Wed Apr 6 22:56:34 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

(if I can add more information, let me know, not so used to launchpad yet)

MoD (lluna-nova) wrote :

We are having the same issue today, with samba 4.3.9 on ubuntu 14.04. We started having this issue on wednesday night.

Linux helisrv 3.16.0-70-generic #90~14.04.1-Ubuntu SMP Wed Apr 6 22:56:34 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

(if I can add more information, let me know, not so used to launchpad yet)

MoD (lluna-nova) wrote :

I suspect the problem was due to low RAM memory. I believe smbd was just the first to suffer but not the culprit.

Hi,

I dont think it, because that is the same VM (CPU,RAM,etc) that work very well on the last 3 years and only have this problem after the upgrade !!!

MoD (lluna-nova) wrote :

Yes, this only happened after some upgrade last week. But I suspect the upgrade might've caused a memory leak that eats up the ram under some conditions. I still have to pinpoint what is happening, but it's my guess at the moment. Let me know if you find otherwise! I am very concerned by this.

Ed Huyer (arcanum3000) wrote :

I'm pretty sure it's not a RAM thing. The system I encountered the CPU soft lockup on has 192GB of RAM, and Zabbix shows it as mostly being free when the problem occurred.

As an aside, Victor, are you sure this is the same problem you're having? Looking at your error log, it appears to be something else? In the release notes for the Badlock fixes there were a few new options in smb.conf that can affect authentication.

Gavin Chappell (g-a-c) wrote :

I'm seeing similar behaviour too - on a Ubuntu 14.04.4 box, completely up to date. I just had it freeze, which marks the third or fourth time I've either had to Magic-SysRq it, or pull the power, since the Badlock date.

The box has "plenty" of RAM (it was recently upgraded from 8GB to 16GB in fact, some time in March) but I don't believe this issue ever happened with 8GB and Samba 4.1.6. It acts as a file server for a small office of 12 people, which I think would count as lightly loaded in terms of Samba installations?

All I have so far is a snippet from /var/log/kernel.log of the call trace which I've attached. If there is more information that would be useful next time this happens, let me know what you need and I'll get it. Unfortunately there is (currently) no load monitoring on this box, so I don't know whether we're getting a similar increase in iowait time before the crash - although the box is logging soft lockup errors, from a user perspective the box is working perfectly fine until it freezes altogether and stops responding to pings or console activity.

MoD (lluna-nova) wrote :

Ok, the RAM issue is discarded. It just happened again some minutes ago and the RAM memory log I had set in place didn't notice a problem there.

I have no idea how to replicate the issue or what is causing it. So far I've only noticed the soft lockup messages linked to smbd from dmesg, syslog or kern.log, which start a few hours before the halt.

Hi

In my case, both servers are connected to a Windows domain... Maybe that is
the difference ??

Attentive
On May 13, 2016 2:10 PM, "MoD" <email address hidden> wrote:

> Ok, the RAM issue is discarded. It just happened again some minutes ago
> and the RAM memory log I had set in place didn't notice a problem there.
>
> I have no idea how to replicate the issue or what is causing it. So far
> I've only noticed the soft lockup messages linked to smbd from dmesg,
> syslog or kern.log, which start a few hours before the halt.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1572608
>
> Title:
> Samba 2:4.3.8+dfsg-0ubuntu0.14.04.2 Reversion: CPU Soft Lock
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/samba/+bug/1572608/+subscriptions
>

MoD (lluna-nova) wrote :

I haven't had any issues in the last two weeks. Looks like the bug is resolved?

Chris Lynch (chrislynch8) wrote :

Hi,

I'm having this issue, but its happening for me on Samba 2:4.3.9+dfsg-0ubuntu0.14.04.3. Its the same CPU SOFT LOCK. Its been happening for a couple of weeks now. Only way to get the server back is to power it off and back on.

Do we know what it causing the issue?

Robie Basak (racb) on 2016-06-06
tags: added: regression-update
Ed Huyer (arcanum3000) wrote :

I was finally able to try 4.3.9. It *seems* to have cleared up the problem, but the system is lightly loaded right now. If the problem was somehow load-related, only time will tell if it is really fixed. Still no idea what actually caused it.

Stefan Metzmacher (metze) wrote :

The soft lock up triggered by Samba was a regression in the kernel.
See https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1543980
So it's fixed because you installed a fixed kernel.

Per former comment setting the bug to be a dup to 1543980

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers