Ubuntu

pam_motd needs a module option to disable in-line dynamic updates

Reported by Jan K. on 2011-07-04
30
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Landscape Client
Undecided
Unassigned
pam (Ubuntu)
Medium
Unassigned

Bug Description

1) lsb_release -rd
Description: Ubuntu 10.04.2 LTS
Release: 10.04

2) Installiert: 1.1.1-2ubuntu5.3
  Kandidat: 1.1.1-2ubuntu5.3
  Versions-Tabelle:
 *** 1.1.1-2ubuntu5.3 0
        500 http://de.archive.ubuntu.com/ubuntu/ lucid-updates/main Packages
        500 http://security.ubuntu.com/ubuntu/ lucid-security/main Packages
        100 /var/lib/dpkg/status
     1.1.1-2ubuntu2 0
        500 http://de.archive.ubuntu.com/ubuntu/ lucid/main Packages

3) Login on systems with high load/a lot of io wait via ssh should still be possible.

4) On servers with high load or a lot of io wait login times out, because pam_motd does io intensive calculations. This hurts even more when using nagios check_by_ssh. There should be a way to use a cron job again (like update-motd did). Logging into a system is more important than motd.

Steve Langasek (vorlon) wrote :

Thank you for reporting this issue and helping to improve Ubuntu.

I understand your concern about the performance impact; however, we are not going to change the default behavior of pam_motd in Ubuntu. There is consensus that the dynamic motd should be enabled by default, and the current behavior is the best way to implement this: the cronjob you refer to was abandoned because it was very wasteful in the common case.

You are right that the ability to log in is more important than presenting a motd. If this behavior is a problem for you, there are several ways that you can disable it:

 - comment out the 'pam_motd' line in /etc/pam.d/sshd if you don't want to display a motd.
 - delete the contents of the /etc/update-motd.d directory.
 - chmod -x the scripts in /etc/update-motd.d that you don't want to run.

Given this existing array of options, I don't think there's anything further that we can do here short of changing the default behavior, which we won't do; so marking this bug "wontfix".

Changed in pam (Ubuntu):
status: New → Won't Fix
Jan K. (jan-launchpad-kantert) wrote :

Hi,

i understand that you dont want to change the default. However i thing the motd is really useful, because its pointing you to updates etc. So disabling is rather a hack.

We would like to have an option to run the cronjobs again. This should be clearly optional. Current behaviour is problematic, if you run busy servers (especially with heavy IO like mapreduce jobs). Its really a problem to use nagios check_by_ssh on some boxes and logging in manually takes minutes.

Steve Langasek (vorlon) wrote :

> We would like to have an option to run the cronjobs again.

Well, you could install a cronjob by hand, which is what you would need to do anyway (we're not going to ship a disabled cron job in the package). The harder part is making pam_motd not call run-parts itself in-line. You could achieve that by moving /etc/update-motd.d to a different location where pam_motd won't look for it but then you run the risk that any new package installs that add a new update-motd.d hook will reintroduce the problem (and, furthermore, clobber the work of your cronjob).

So I can see a case here after all for a 'noupdate' module option that disables the dynamic updating.

BTW, does check_by_ssh open a tty? I believe there's another outstanding bug about fixing pam_motd to do the update when the motd won't actually be displayed.

Steve Langasek (vorlon) on 2011-07-08
Changed in pam (Ubuntu):
status: Won't Fix → Triaged
summary: - pam_motd performance problems
+ pam_motd needs a module option to disable in-line dynamic updates
Changed in pam (Ubuntu):
importance: Undecided → Medium
tags: added: bitesize
Jan K. (jan-launchpad-kantert) wrote :

check_by_ssh does not open a tty. So it should not calculate the motd. I did not thing about that. May be it could solve the nagios problem, too.

Launchpad Janitor (janitor) wrote :
Download full text (4.9 KiB)

This bug was fixed in the package pam - 1.1.3-5ubuntu1

---------------
pam (1.1.3-5ubuntu1) precise; urgency=low

  * Merge from Debian unstable. Remaining changes:
    - debian/libpam-modules.postinst: Add PATH to /etc/environment if it's
      not present there or in /etc/security/pam_env.conf. (should send to
      Debian).
    - debian/libpam0g.postinst: only ask questions during update-manager when
      there are non-default services running.
    - Change Vcs-Bzr to point at the Ubuntu branch.
    - debian/patches-applied/series: Ubuntu patches are as below ...
    - debian/patches-applied/ubuntu-rlimit_nice_correction: Explicitly
      initialise RLIMIT_NICE rather than relying on the kernel limits.
    - debian/patches-applied/pam_motd-legal-notice: display the contents of
      /etc/legal once, then set a flag in the user's homedir to prevent
      showing it again.
    - debian/update-motd.5, debian/libpam-modules.manpages: add a manpage
      for update-motd, with some best practices and notes of explanation.
    - debian/patches/update-motd-manpage-ref: add a reference in pam_motd(8)
      to update-motd(5)
    - debian/libpam0g.postinst: drop kdm from the list of services to
      restart.
    - debian/libpam0g.postinst: check if gdm is actually running before
      trying to reload it.
    - debian/local/common-session{,-noninteractive}: Enable pam_umask by
      default, now that the umask setting is gone from /etc/profile.
    - debian/local/pam-auth-update: Add the new md5sums for pam_umask addition.
    - add debian/patches-applied/pam_umask_usergroups_from_login.defs.patch:
      Deprecate pam_unix' explicit "usergroups" option and instead read it
      from /etc/login.def's "USERGROUP_ENAB" option if umask is only defined
      there. This restores compatibility with the pre-PAM behaviour of login.
      (Closes: #583958)
  * Dropped changes, included in Debian:
    - debian/patches-applied/CVE-2011-3148.patch
    - debian/patches-applied/CVE-2011-3149.patch
    - debian/patches-applied/update-motd: updated to use clean environment
      and absolute paths in modules/pam_motd/pam_motd.c.
  * debian/libpam0g.postinst: the init script for 'samba' is now named 'smbd'
    in Ubuntu, so fix the restart handling.
  * debian/patches-applied/update-motd: set a sane umask before calling
    run-parts, and restore the old mask afterwards, so /run/motd gets
    consistent permissions. LP: #871943.
  * debian/patches-applied/update-motd: new module option for pam_motd,
    'noupdate', which suppresses the call to run-parts /etc/update-motd.d.
    LP: #805423.

pam (1.1.3-5) unstable; urgency=low

  [ Kees Cook ]
  * debian/patches-applied/pam_unix_dont_trust_chkpwd_caller.patch: use
    setresgid() to wipe out saved-gid just in case.
  * debian/patches-applied/008_modules_pam_limits_chroot:
    - fix off-by-one when parsing configuration file.
    - when using chroot, chdir() to root to lose links to old tree.
  * debian/patches-applied/022_pam_unix_group_time_miscfixes,
    debian/patches-applied/026_pam_unix_passwd_unknown_user,
    debian/patches-applied/054_pam_security_abstract_securetty_handling:
    improve descriptions.
  *...

Read more...

Changed in pam (Ubuntu):
status: Triaged → Fix Released
Till Klampaeckel (till-php) wrote :

I'm on 10.04.4 (latest kernel, everything) and I just spent an entire work-day debugging pam_motd behavior.

For some reason, one of the scripts fails (defuncts) when I try to log into a server. Add to that, this server is on EC2 so there is no way to use the terminal either.

Anyhow – for a sumary I've posted everything here: http://askubuntu.com/a/162373/11244

The solution was to disable pam_motd in these files:

 /etc/pam.d/sshd
 /etc/pam.d/login

The lack of debugging facilities in here are one of the reasons why this should be removed period. I don't really care if some people don't get a pretty MOTD then.

The larger issue here is the potential block of a log in process, which makes it "severe". There seems to be no way to figure out what exactly is wrong because you are literally logged out of the instance which is IMHO unacceptable behavior for an LTS.

There should be at least a timeout which will eventually make the scripts fail if they cannot complete.

Till Klampaeckel (till-php) wrote :

I want to add that it seems like the following removes the files as well:

apt-get purge landscape-client

I did this to avoid having to maintain pam configuration. Maybe someone escalate it there. I still fail to understand how you can add something to the login process which might block the user from logging in.

Andreas Hasenack (ahasenack) wrote :

landscape-common installs a script in /etc/update-motd.d to display a banner with some basic system information.

It will not run the main landscape-sysinfo binary if the load is higher than the number of cores, see /usr/share/landscape/landscape-sysinfo.wrapper:

#!/bin/sh
cores=$(grep -c ^processor /proc/cpuinfo 2>/dev/null)
[ "$cores" -eq "0" ] && cores=1
threshold="${cores:-1}.0"
if [ $(echo "`cut -f1 -d ' ' /proc/loadavg` < $threshold" | bc) -eq 1 ]; then
    echo
    echo -n " System information as of "
    /bin/date
    echo
    /usr/bin/landscape-sysinfo
else
    echo
    echo " System information disabled due to load higher than $threshold"
fi

Do you think it was landscape-sysinfo running that made your login take more than 60s and thus timeout?

Till Klampaeckel (till-php) wrote :

My login session never timeout, I was actually authenticated but never saw a prompt.

I let it running (sitting there) over night and the shell was still 'active' there after 8 hours – but no prompt.

The load had nothing to do with this. I booted the server (pretty blank), logged in and then further attempts failed right away. If I waited too long (and I don't have an exact time), I could not log in at all.

I *think* it stalled at trying to find out how many people are logged in to the system. I saw a "[who] defunct" in my process list. But I have no idea why that caused the my login process to block.

I looked at this script and also ran it while I was logged in and it completely within reason. Though I would say that it adds too much time to the login still. It's a noticable delay.

Btw, check out the 'ask ubuntu' link I left in my comment, it contains the process list with the defunct who and the sysinfo script running.

Andreas Hasenack (ahasenack) wrote :

Do you have any reason to believe "who" would stall? Do you have network users, stored in ldap or nis? Doesn't look like it, but it doesn't hurt to ask.

sysinfo uses this to get the logged in users:

def get_logged_in_users():
    result = getProcessOutputAndValue("who", ["-q"], env=os.environ)

So basically it calls "who -q".

What are all the scripts you have in /etc/update-motd.d/*?

Till Klampaeckel (till-php) wrote :

I have no such thing – my users are 'local'.

It's a stock 10.04.4, this is how it looks like right now:

    till@statsd1:~$ ls -lah /etc/update-motd.d/
    total 48K
    drwxr-xr-x 2 root root 4.0K 2012-07-12 15:19 .
    drwxr-xr-x 89 root root 4.0K 2012-07-13 10:25 ..
    -rwxr-xr-x 1 root root 57 2010-04-23 09:45 00-header
    -rwxr-xr-x 1 root root 248 2010-04-23 09:45 10-help-text
    -rwxr-xr-x 1 root root 65 2010-04-13 20:45 20-cpu-checker
    -rwxr-xr-x 1 root root 627 2011-09-30 07:16 51_update-motd
    -rwxr-xr-x 1 root root 71 2010-04-13 20:45 90-updates-available
    -rwxr-xr-x 1 root root 61 2010-10-13 07:40 91-release-upgrade
    -rwxr-xr-x 1 root root 1.3K 2010-12-03 15:50 92-uec-upgrade-available
    -rwxr-xr-x 1 root root 306 2011-09-30 07:16 98-cloudguest
    -rwxr-xr-x 1 root root 69 2010-04-13 20:45 98-reboot-required
    -rwxr-xr-x 1 root root 261 2010-04-23 09:45 99-footer

Note: I removed landscape-client (which in turn removes the offending script from that directory) and the issues are gone.

@Andreas: Did you check out my askubuntu link (http://askubuntu.com/a/162373/11244)? It has process list and all that. It def. stalls at the sysinfo script and it seems like the call to "who" broke.

Andreas Hasenack (ahasenack) wrote :

Hi Till,

yes, I saw that askubuntu process list.

What is the AMI you used, and in which region?

Till Klampaeckel (till-php) wrote :

I believe it's this: ami-6936fb00

Till Klampaeckel (till-php) wrote :

Region is 'east1'

Andreas Hasenack (ahasenack) wrote :

I tried to reproduce it a few times, but no luck. Does it happen often to you? If yes, we may be able to come up with a debug strategy.

Till Klampaeckel (till-php) wrote :

Well, I am a 100% sure it's /usr/bin/landscape-sysinfo. It happened fairly consistently last week. To a point where I couldn't login at all, unless I rebooted the instance and logged in right away. We have now removed the landscape-client to ensure login always works.

When I ran this script while logged in, it would always work as well. Just during the login it would block.

I'm not too familiar with python and could not step-through the twistd code. Would you have a tip what I could do to figure out why it blocks and doesn't let me login when I re-install landscape-client on a test instance? I am guessing there is no general error log which would allow me get errors from.

Andreas Hasenack (ahasenack) wrote :

If you can reproduce it at least some times, I have some suggestions:
- hack the script to only run if you login as a certain user. In that way, you can trigger it by logging in as, say, "ubuntu", and don't run if you login as someone else. Something like this (untested):

--- /usr/share/landscape/landscape-sysinfo.wrapper 2012-06-13 18:10:15.000000000 -0300
+++ landscape-sysinfo.wrapper 2012-07-18 09:21:29.868717152 -0300
@@ -2,6 +2,10 @@
 cores=$(grep -c ^processor /proc/cpuinfo 2>/dev/null)
 [ "$cores" -eq "0" ] && cores=1
 threshold="${cores:-1}.0"
+if [ "$USER" = "safeuser" ]; then
+ echo "Not running landscape-sysinfo because logging in as user 'safeuser'"
+ exit 0
+fi
 if [ $(echo "`cut -f1 -d ' ' /proc/loadavg` < $threshold" | bc) -eq 1 ]; then
     echo
     echo -n " System information as of "

Then you can observe what happens when it stalls.

You could also use the above plus strace the landscape-sysinfo script, but that *may* prevent the bug from happening. Something like this:

--- /usr/share/landscape/landscape-sysinfo.wrapper 2012-06-13 18:10:15.000000000 -0300
+++ landscape-sysinfo.wrapper 2012-07-18 09:22:56.466115731 -0300
@@ -2,12 +2,16 @@
 cores=$(grep -c ^processor /proc/cpuinfo 2>/dev/null)
 [ "$cores" -eq "0" ] && cores=1
 threshold="${cores:-1}.0"
+if [ "$USER" = "safeuser" ]; then
+ echo "Not running landscape-sysinfo because logging in as user 'safeuser'"
+ exit 0
+fi
 if [ $(echo "`cut -f1 -d ' ' /proc/loadavg` < $threshold" | bc) -eq 1 ]; then
     echo
     echo -n " System information as of "
     /bin/date
     echo
- /usr/bin/landscape-sysinfo
+ strace -f -o /tmp/sysinfo.strace /usr/bin/landscape-sysinfo
 else
     echo
     echo " System information disabled due to load higher than $threshold"

Till Klampaeckel (till-php) wrote :

Hey,

I tried your code:

1. $USER doesn't seem to be set in this context so it runs regardless. ;-) At least it doesn't seem to be set during the login.

2. Whenever I add strace, it magically works – without, it defuncts.

I looked around further and it took me a while to figure this out – so the offending code in my case is in here:
/usr/lib/python2.6/dist-packages/landscape/lib/sysstats.py

and imports getProcessOutputAndValue from here:
/usr/share/pyshared/twisted/internet/utils.py

I couldn't figure out how this would block, but essentially it looks like you could pass callbacks to the function which is currently not happending.

I was wondering if there is a way to debug this in Python. E.g. my skills are currently limited to putting "print" into files. But that doesn't work to well. ;-)

Till

Till Klampaeckel (till-php) wrote :

Here is an idea for a patch – but I am not sure if it actually works:

--- /usr/share/pyshared/landscape/lib/sysstats.py.orig 2012-07-26 15:26:42.000000000 +0000
+++ /usr/share/pyshared/landscape/lib/sysstats.py 2012-07-26 15:32:38.000000000 +0000
@@ -55,6 +55,13 @@ class MemoryStats(object):
 def get_logged_in_users():
     result = getProcessOutputAndValue("who", ["-q"], env=os.environ)

+ def logged_in_users_err_callback(result):
+ """ Errback from getProcessOutputAndValue """
+ out, err, code = result
+ raise Exception("Error getting users exited %d with error: %s (%s)" % (code, err, out))
+
+ result.addErrback(logged_in_users_err_callback)
+
     def parse_output((stdout_data, stderr_data, status)):
         if status != 0:
             raise CommandError(stderr_data)

Till Klampaeckel (till-php) wrote :

Any update, help?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers