login console 0 in user namespace container is not configured right

Bug #1263738 reported by Serge Hallyn
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
upstart
Fix Released
Undecided
James Hunt
linux (Ubuntu)
Confirmed
High
Seth Forshee
Trusty
Confirmed
High
Seth Forshee
lxc (Ubuntu)
Invalid
High
Unassigned
Trusty
Invalid
High
Unassigned
upstart (Ubuntu)
Fix Released
High
Unassigned
Trusty
Fix Released
High
Unassigned

Bug Description

When you create a container in a private user namespace, when you start the
container without the '-d' flag, that console is not properly set up. Logging in
gives you

-bash: no job control in this shell

and hitting ctrl-c reboots the container.

Consoles from 'lxc-console -n $container' behave correctly.

This may be a kernel issue, as discussed here:

http://lists.linuxcontainers.org/pipermail/lxc-devel/2013-October/005843.html

so also marking this as affecting the kernel.

This can be worked around, but really needs to be fixed before trusty is frozen.

Related branches

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1263738

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: bot-stop-nagging
Changed in linux (Ubuntu):
status: Incomplete → New
Brad Figg (brad-figg)
Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-key trusty
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Do you happen to know if this is also an issue in the latest mainline kernel[0]?

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.13-rc6-trusty/

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
tags: added: kernel-da-key
removed: kernel-key
Revision history for this message
Serge Hallyn (serge-hallyn) wrote : Re: [Bug 1263738] Re: login console 0 in user namespace container is not configured right

Yup, just tested, same still happens.

ls -l /proc/1/* shows everything belonging to root. But picking
my login process,

root 631 1 0 19:43 ? 00:00:00 /bin/login --

I get:
dr-xr-xr-x 2 root ubuntu 0 Jan 2 19:43 attr
-rw-r--r-- 1 nobody nogroup 0 Jan 2 19:43 autogroup
-r-------- 1 nobody nogroup 0 Jan 2 19:43 auxv
-r--r--r-- 1 nobody nogroup 0 Jan 2 19:43 cgroup
--w------- 1 nobody nogroup 0 Jan 2 19:43 clear_refs
-r--r--r-- 1 nobody nogroup 0 Jan 2 19:43 cmdline
-rw-r--r-- 1 nobody nogroup 0 Jan 2 19:43 comm
-rw-r--r-- 1 nobody nogroup 0 Jan 2 19:43 coredump_filter
-r--r--r-- 1 nobody nogroup 0 Jan 2 19:43 cpuset
lrwxrwxrwx 1 nobody nogroup 0 Jan 2 19:43 cwd
-r-------- 1 nobody nogroup 0 Jan 2 19:43 environ
lrwxrwxrwx 1 nobody nogroup 0 Jan 2 19:43 exe
dr-x------ 2 nobody nogroup 0 Jan 2 19:43 fd
dr-x------ 2 nobody nogroup 0 Jan 2 19:43 fdinfo
-rw-r--r-- 1 nobody nogroup 0 Jan 2 19:43 gid_map
-r-------- 1 nobody nogroup 0 Jan 2 19:43 io
-r--r--r-- 1 nobody nogroup 0 Jan 2 19:43 latency
-r--r--r-- 1 nobody nogroup 0 Jan 2 19:43 limits
-rw-r--r-- 1 nobody nogroup 0 Jan 2 19:43 loginuid
dr-x------ 2 nobody nogroup 0 Jan 2 19:43 map_files
-r--r--r-- 1 nobody nogroup 0 Jan 2 19:43 maps
-rw------- 1 nobody nogroup 0 Jan 2 19:43 mem
-r--r--r-- 1 nobody nogroup 0 Jan 2 19:43 mountinfo
-r--r--r-- 1 nobody nogroup 0 Jan 2 19:43 mounts
-r-------- 1 nobody nogroup 0 Jan 2 19:43 mountstats
dr-xr-xr-x 5 root ubuntu 0 Jan 2 19:43 net
dr-x--x--x 2 nobody nogroup 0 Jan 2 19:43 ns
-r--r--r-- 1 nobody nogroup 0 Jan 2 19:43 numa_maps
-rw-r--r-- 1 nobody nogroup 0 Jan 2 19:43 oom_adj
-r--r--r-- 1 nobody nogroup 0 Jan 2 19:43 oom_score
-rw-r--r-- 1 nobody nogroup 0 Jan 2 19:43 oom_score_adj
-r--r--r-- 1 nobody nogroup 0 Jan 2 19:43 pagemap
-r--r--r-- 1 nobody nogroup 0 Jan 2 19:43 personality
-rw-r--r-- 1 nobody nogroup 0 Jan 2 19:43 projid_map
lrwxrwxrwx 1 nobody nogroup 0 Jan 2 19:43 root
-rw-r--r-- 1 nobody nogroup 0 Jan 2 19:43 sched
-r--r--r-- 1 nobody nogroup 0 Jan 2 19:43 schedstat
-r--r--r-- 1 nobody nogroup 0 Jan 2 19:43 sessionid
-r--r--r-- 1 nobody nogroup 0 Jan 2 19:43 smaps
-r--r--r-- 1 nobody nogroup 0 Jan 2 19:43 stack
-r--r--r-- 1 nobody nogroup 0 Jan 2 19:43 stat
-r--r--r-- 1 nobody nogroup 0 Jan 2 19:43 statm
-r--r--r-- 1 nobody nogroup 0 Jan 2 19:43 status
-r--r--r-- 1 nobody nogroup 0 Jan 2 19:43 syscall
dr-xr-xr-x 3 root ubuntu 0 Jan 2 19:43 task
-r--r--r-- 1 nobody nogroup 0 Jan 2 19:43 timers
-rw-r--r-- 1 nobody nogroup 0 Jan 2 19:43 uid_map
-r--r--r-- 1 nobody nogroup 0 Jan 2 19:43 wchan

And, on login I got '-bash: no job control in this shell'
and ctrl-c reboots the container.

 status: confirmed

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

One additional question, do you happen to know if this is a regression? Did this not happen with previous releases/kernels?

tags: added: kernel-key
removed: kernel-da-key
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Joseph Salisbury (<email address hidden>):
> One additional question, do you happen to know if this is a regression?
> Did this not happen with previous releases/kernels?

This is not a regression, it has never worked right.

We believe the problem is that if a task is !dumpable, then the kernel
marks some of its /proc/pid files as owned by the global host root,
which is not mapped into a user namespace. If that is the case, then
the question is whether it is safe to mark them owned by the container
root; or whether we can distinguish between tasks which became dumpable
before switching namespaces; or whether there is something else we can
do.

Revision history for this message
Seth Forshee (sforshee) wrote :

I tried the kernel patch from the mailing list, but that doesn't fix the problem. It does fix permissions for most /proc/pid/* files in setuid processes, but the console problems remain.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Seth Forshee (<email address hidden>):
> I tried the kernel patch from the mailing list, but that doesn't fix the
> problem. It does fix permissions for most /proc/pid/* files in setuid
> processes, but the console problems remain.

That's interesting! Thanks for testing.

Revision history for this message
Seth Forshee (sforshee) wrote :

I straced bash, and I think this is what ends up causing job control to be disabled:

ioctl(255, SNDRV_TIMER_IOCTL_SELECT or TIOCSPGRP, [1144]) = -1 ENOTTY (Inappropriate ioctl for device)

255 is stderr duped to a high fd, so it looks like whatever stderr is mapped to is not a tty.

Revision history for this message
Seth Forshee (sforshee) wrote :

stderr actually is mapped to a pty. The problem seems to be that getty can't set /dev/console as its controlling terminal because it's already the controlling tty for init, which is in a different process group. Thus getty ends up with no controlling tty, this is inherited by bash, and thus bash cannot set up job control.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Seth Forshee (<email address hidden>):
> stderr actually is mapped to a pty. The problem seems to be that getty
> can't set /dev/console as its controlling terminal because it's already
> the controlling tty for init, which is in a different process group.
> Thus getty ends up with no controlling tty, this is inherited by bash,
> and thus bash cannot set up job control.

Interesting.

Note that what you describe should also be the case if using a regular
container

 sudo lxc-create -t ubuntu-cloud -n u1
 sudo lxc-start -n u1

Is the process group of init somehow ending up different in the user
namespace case? Or else why would this only be a problem in the
user namespace case?

Revision history for this message
Seth Forshee (sforshee) wrote :

On Tue, Jan 14, 2014 at 08:42:06PM -0000, Serge Hallyn wrote:
> Note that what you describe should also be the case if using a regular
> container
>
> sudo lxc-create -t ubuntu-cloud -n u1
> sudo lxc-start -n u1
>
> Is the process group of init somehow ending up different in the user
> namespace case? Or else why would this only be a problem in the
> user namespace case?

It is diffferent. Here's the controlling ttys without user namespaces:

ubuntu@u1:~$ cat /proc/$$/stat | cut -d' ' -f7
34826
ubuntu@u1:~$ cat /proc/1/stat | cut -d' ' -f7
0

and with user namsepaces:

ubuntu@c1:~$ cat /proc/$$/stat | cut -d' ' -f7
0
ubuntu@c1:~$ cat /proc/1/stat | cut -d' ' -f7
34826

init should have its controlling terminal cleared when it calls
setsid(), so either it isn't calling setsid() or else setsid() is
failing. The reasons setsid() would fail are that the process is already
a session group leader or else a session with the same id already
exists. I haven't found how user namespaces would have any effect on
those things, however.

Revision history for this message
Seth Forshee (sforshee) wrote :

The same basic sequence of events happens with and without user namespaces. init sheds its tty with setsid() but then opens /dev/console, which as the effect of making /dev/console it's controlling tty. Later getty also opens /dev/console and tries the TIOCSCTTY ioctl on the fd. At this point I think the following code in the kernel handling of that ioctl comes into play:

        if (tty->session) {
                /*
                 * This tty is already the controlling
                 * tty for another session group!
                 */
                if (arg == 1 && capable(CAP_SYS_ADMIN)) {
                        /*
                         * Steal it away
                         */
                        read_lock(&tasklist_lock);
                        session_clear_tty(tty->session);
                        read_unlock(&tasklist_lock);
                } else {
                        ret = -EPERM;
                        goto unlock;
                }
        }

I.e. getty doesn't have CAP_SYS_ADMIN and thus can't steal the console from init. I'm not sure what the fix is yet, whether there's something we can do here which can allow root within a namespace to steal the console or whether upstart just needs to explicitly shed the console after opening it.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Seth Forshee (<email address hidden>):
> The same basic sequence of events happens with and without user
> namespaces. init sheds its tty with setsid() but then opens
> /dev/console, which as the effect of making /dev/console it's
> controlling tty. Later getty also opens /dev/console and tries the
> TIOCSCTTY ioctl on the fd. At this point I think the following code in
> the kernel handling of that ioctl comes into play:
>
> if (tty->session) {
> /*
> * This tty is already the controlling
> * tty for another session group!
> */
> if (arg == 1 && capable(CAP_SYS_ADMIN)) {
> /*
> * Steal it away
> */
> read_lock(&tasklist_lock);
> session_clear_tty(tty->session);
> read_unlock(&tasklist_lock);
> } else {
> ret = -EPERM;
> goto unlock;
> }
> }
>
> I.e. getty doesn't have CAP_SYS_ADMIN and thus can't steal the console
> from init. I'm not sure what the fix is yet, whether there's something
> we can do here which can allow root within a namespace to steal the
> console or whether upstart just needs to explicitly shed the console
> after opening it.

If it is possible to get to the inode backing the tty at this point
then we should be able to do inode_capable(tty_inode(tty),
CAP_SYS_ADMIN), which should be safe and adquate right?

But I dont' think we can get inode from tty. However we can get the
tty->session which is a struct pid*. So we can check whether we have
ns_capable(ns_of_pid(tty->session), CAP_SYS_ADMIN)

-serge

Revision history for this message
Seth Forshee (sforshee) wrote :

On Wed, Jan 15, 2014 at 06:37:41PM -0000, Serge Hallyn wrote:
> If it is possible to get to the inode backing the tty at this point
> then we should be able to do inode_capable(tty_inode(tty),
> CAP_SYS_ADMIN), which should be safe and adquate right?
>
> But I dont' think we can get inode from tty. However we can get the

I'm new to how capabilities are handled with user namespaces, but at a
glance I think inode_capable() looks sufficient. We can't get the inode
from the tty but it could easily be passed as an argument the function
containing that code.

> tty->session which is a struct pid*. So we can check whether we have
> ns_capable(ns_of_pid(tty->session), CAP_SYS_ADMIN)

Except that we're not interested in the capabilities of tty->session but
of current since current is the one doing the stealing. So that should
probably be ns_capable(current_user_ns(), CAP_SYS_ADMIN).

I'm thinking though we also need to verify that tty->session is in the
same namespace, otherwise nothing seems to prevent a lesser priveleged
namespace from doing mknod and stealing any tty from another namespace,
which seems like a serious security issue. So something along the lines
of:

  if (arg == 1 &&
      (capable(CAP_SYS_ADMIN) ||
       (current_user_namespace() == ns_of_pid(tty->session) &&
        ns_capable(current_user_ns(), CAP_SYS_ADMIN)))) {
          /* steal tty */
  }

Or am I being too paranoid?

Revision history for this message
Stéphane Graber (stgraber) wrote :

On Wed, Jan 15, 2014 at 07:53:54PM -0000, Seth Forshee wrote:
> On Wed, Jan 15, 2014 at 06:37:41PM -0000, Serge Hallyn wrote:
> > If it is possible to get to the inode backing the tty at this point
> > then we should be able to do inode_capable(tty_inode(tty),
> > CAP_SYS_ADMIN), which should be safe and adquate right?
> >
> > But I dont' think we can get inode from tty. However we can get the
>
> I'm new to how capabilities are handled with user namespaces, but at a
> glance I think inode_capable() looks sufficient. We can't get the inode
> from the tty but it could easily be passed as an argument the function
> containing that code.
>
> > tty->session which is a struct pid*. So we can check whether we have
> > ns_capable(ns_of_pid(tty->session), CAP_SYS_ADMIN)
>
> Except that we're not interested in the capabilities of tty->session but
> of current since current is the one doing the stealing. So that should
> probably be ns_capable(current_user_ns(), CAP_SYS_ADMIN).
>
> I'm thinking though we also need to verify that tty->session is in the
> same namespace, otherwise nothing seems to prevent a lesser priveleged
> namespace from doing mknod and stealing any tty from another namespace,
> which seems like a serious security issue. So something along the lines
> of:
>
> if (arg == 1 &&
> (capable(CAP_SYS_ADMIN) ||
> (current_user_namespace() == ns_of_pid(tty->session) &&
> ns_capable(current_user_ns(), CAP_SYS_ADMIN)))) {
> /* steal tty */
> }
>
> Or am I being too paranoid?

mknod isn't possible from a userns, otherwise we'd be in a lot more
problem than just tty devices (think what would hapeen if I could mknod
sda in a container).

>
> --
> You received this bug notification because you are subscribed to lxc in
> Ubuntu.
> Matching subscriptions: containers
> https://bugs.launchpad.net/bugs/1263738
>
> Title:
> login console 0 in user namespace container is not configured right
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1263738/+subscriptions

--
Stéphane Graber
Ubuntu developer
http://www.ubuntu.com

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Seth Forshee (<email address hidden>):
> On Wed, Jan 15, 2014 at 06:37:41PM -0000, Serge Hallyn wrote:
> > If it is possible to get to the inode backing the tty at this point
> > then we should be able to do inode_capable(tty_inode(tty),
> > CAP_SYS_ADMIN), which should be safe and adquate right?
> >
> > But I dont' think we can get inode from tty. However we can get the
>
> I'm new to how capabilities are handled with user namespaces, but at a
> glance I think inode_capable() looks sufficient. We can't get the inode
> from the tty but it could easily be passed as an argument the function
> containing that code.

The question actually remains: what do we need privilege toward? If
user A has file F open, and we are going to steal F from A... IIUC we
already should have check for permission to access F right? So now the
question is only whether we can take something from A, or whether A is
more privileged than us.

> > tty->session which is a struct pid*. So we can check whether we have
> > ns_capable(ns_of_pid(tty->session), CAP_SYS_ADMIN)
>
> Except that we're not interested in the capabilities of tty->session but

The ns_capable line doesn't check the capabilities of tty->session,
but rather current's capabilities targeted toward the user namespace
which owns tty->session.

> of current since current is the one doing the stealing. So that should
> probably be ns_capable(current_user_ns(), CAP_SYS_ADMIN).

That would check the privilege of current toward his own userns. Any
unprivileged user can clone(CLONE_NEWUSER) and have that test evaluate
to true.

> I'm thinking though we also need to verify that tty->session is in the
> same namespace, otherwise nothing seems to prevent a lesser priveleged
> namespace from doing mknod and stealing any tty from another namespace,
> which seems like a serious security issue. So something along the lines
> of:
>
> if (arg == 1 &&
> (capable(CAP_SYS_ADMIN) ||
> (current_user_namespace() == ns_of_pid(tty->session) &&
> ns_capable(current_user_ns(), CAP_SYS_ADMIN)))) {
> /* steal tty */
> }
>
> Or am I being too paranoid?

That would be the point of doing:

 ns_capable(ns_of_pid(tty->session), CAP_SYS_ADMIN)

If you are in a child userns of init, you cannot CAP_SYS_ADMIN toward
init's pidns.

Revision history for this message
Seth Forshee (sforshee) wrote :

On Wed, Jan 15, 2014 at 08:18:04PM -0000, Serge Hallyn wrote:
> The ns_capable line doesn't check the capabilities of tty->session,
> but rather current's capabilities targeted toward the user namespace
> which owns tty->session.

Okay, this was my fundamental misunderstanding. It makes sense now. This
plus the fact that mknod isn't allowed from a user ns alleviates my
concerns.

I'll try this out.

Revision history for this message
Seth Forshee (sforshee) wrote :

Serge: I've got a patch that fixes the problem. I've uploaded a test build along with the patch to:

http://people.canonical.com/~sforshee/lp1263738/linux-3.13.0-3.18~lp1263738v201401152110/

I still want to verify that it's impossible to steal a tty from a process in a parent namespace, but if that checks out and the patch looks good to you I'll send it upstream.

I do think however that upstart should also be issuing TIOCNOTTY after opening /dev/console. It seems fairly clear from the code that the intention is to not own the console device.

Revision history for this message
Seth Forshee (sforshee) wrote :

I've added an upstart task to the bug. After looking a bit more it seems upstart is trying to always open terminal devices with O_NOCTTY, so the tty ownership by init is likely unintentional and therefore a bug. I haven't been able to find where in upstart this is happening, but on the kernel side I can tell that it's due to an open() without O_NOCTTY. So while I think the kernel change makes sense it seems like it's more of a workaround for a bug in upstart.

Revision history for this message
James Hunt (jamesodhunt) wrote :

> I haven't been able to find where in upstart this is happening, but on the kernel side I can tell that it's due to an open() without
> O_NOCTTY.

Upstart does not open /dev/console without O_NOCTTY afaics. Are you sure your kernel debug is showing pid 1 is doing this?

Revision history for this message
Seth Forshee (sforshee) wrote :

On Mon, Jan 20, 2014 at 06:30:24PM -0000, James Hunt wrote:
> Upstart does not open /dev/console without O_NOCTTY afaics. Are you sure
> your kernel debug is showing pid 1 is doing this?

Yes, pid 1 within the namespace at least. I couldn't find anywhere where
upstart opened /dev/console without O_NOCTTY set either, but evidently
it is happening somehow.

I just looked though, and this doesn't seem to be happening for actual
pid 1, only pid 1 in containers. The container is saucy and the host is
trusty, but they both have upstart 1.11-0ubuntu1.

tags: removed: kernel-key
Revision history for this message
Seth Forshee (sforshee) wrote :

I figured out what's happening. lxc sets up /dev/kmsg as a symlink to /dev/console, init fopens kmsg, and suddenly it owns the console. Not sure whether the fix is to handle kmsg differently or special-case it in upstart to be opened with O_NOCTTY. I'll leave it to Serge and James to figure that out, and in the meantime I'll attend to the kernel patch.

Changed in lxc (Ubuntu Trusty):
assignee: nobody → Seth Forshee (sforshee)
assignee: Seth Forshee (sforshee) → nobody
Changed in linux (Ubuntu Trusty):
assignee: nobody → Seth Forshee (sforshee)
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Hi James Hunt,

can you comment on how you would feel about opening /dev/kmsg with
O_NOCTTY ?

Revision history for this message
James Hunt (jamesodhunt) wrote :

@Serge - perfectly happy to change that, in fact...

@Stéphane - could you review and test lp:~jamesodhunt/upstart/kmsg-noctty? Thanks!

James Hunt (jamesodhunt)
Changed in upstart:
assignee: nobody → James Hunt (jamesodhunt)
status: New → Fix Committed
Changed in upstart (Ubuntu Trusty):
status: New → Confirmed
Revision history for this message
Stéphane Graber (stgraber) wrote :

I'm closing the lxc task as there's nothing we can do in lxc itself to avoid this, the upstart and kernel patches will solve this for us.

Btw, the branch proposed by James above does work fine for me and has since been accepted upstream, the next upload should include this fix.

Changed in lxc (Ubuntu Trusty):
status: Triaged → Invalid
Changed in upstart (Ubuntu Trusty):
importance: Undecided → High
Revision history for this message
Brian Murray (brian-murray) wrote :

This seems to be fixed the in the trusty version of upstart, although no debian/changelog entry was created for this bug.

Changed in upstart:
status: Fix Committed → Fix Released
Changed in upstart (Ubuntu Trusty):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.