Comment 3 for bug 62139

Revision history for this message
Sebastien Bacher (seb128) wrote :

Comment from upstream:

"Typically the problem when the GDM daemon becomes so unresponsive is because it
has crashed. Note that GDM works like this;

1. The main GDM daemon starts. This is the process that has the pid listed in
   /var/run/gdm.pid. Note that most programs try to connect to this process
   when they want to communicate. For example gdmflexiserver looks up the
   daemon process Pid in this file.

2. For each display, a forked daemon (aka the slave daemon) is started.

3. The forked daemon starts the GUI program (gdmlogin/gdmgreeter).

Note it can look like daemon's are still running even when the main daemon has
crashed since the forked slave daemons (#2 above) may still be running.

If the main daemon is crashing, has it left a core file somewhere. Look for
GDM related core files in the /, /var/lib/gdm, /var/lib/log/gdm, or the GDM
user's $HOME directory (if any).

I suspect all of the above won't help you since it sounds like the main daemon
isn't crashing. I just mention this so you can double check, since this is
typically the problem for this sort of symptom.

GDM uses a couple of different mechanisms to communicate between the main
daemon and other processes. There is the GDM_SUP* messages that are used by
various programs to query the daemon about different hings. What happens when
you
run this command. It tests the GDM_SUP* messages. Does it work?

   gdmflexiserver --command=VERSION

The other kind of messages are GDM_SOP_* messages. For these messages, the
forked slave process sends the main daemon a SIGUSR1 signal and then uses a
fifo protocol for passing the messages. Note the function create_connections
in daemon/gdm.c and note gdm_slave_send in daemon/slave.c. In your syslog, it
seems like messages like GDM_SOP_LOGGED_IN (aka LOGGED_IN) is failing. So
this is probably where it is breaking. I'm wondering if this is the *only*
kind of message that is breaking or if just GDM_SOP messages are broken for
you.

If I were to guess what the problem might be, it could be that GDM may be
executing some code in the SIGUSR1 signal handler that is not signal handler \
safe. Since it works most of the time, but fails in weird situations where
many users log out at once, this might be triggering a normally rare event that
makes the signal handler mad.

Note in slave.c, the gdm_slave_usr2_handler function. Might be worth looking
into this function more closely.

At this point, I'm just giving you some pointers to hopefully narrow down what
the problem may be. Perhaps you could investigate, and if possible add some
more gdm_debug messages so we can have a better understanding of where the
failure may be happening?"