Canonical Juju

jujud panic output does not appear in log

Bug #1735120 reported by Roger Peppe on 2017-11-29

6

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Triaged	Low	Unassigned

Bug Description

Using Juju 2.2.6, when I kill the machine 0 machine agent with -QUIT, I do not see any panic output in /var/log/juju/machine-0.log (or anywhere else for that matter).

If confirmed, this is a significant issue because sometimes the panic stack trace can be the only thing that allows us to diagnose an issue.

The script that starts jujud (/var/lib/juju/init/jujud-machine-0/exec-start.sh) looks like this:

#!/usr/bin/env bash

# Set up logging.
touch '/var/log/juju/machine-0.log'
chown syslog:syslog '/var/log/juju/machine-0.log'
chmod 0600 '/var/log/juju/machine-0.log'
exec >> '/var/log/juju/machine-0.log'
exec 2>&1

# Run the script.
'/var/lib/juju/tools/machine-0/jujud' machine --data-dir '/var/lib/juju' --machine-id 0 --debug

Since jujud no longer prints its logging output to stdout, but instead opens and writes the machine-0.log file itself, I wonder if there is some raciness between that output and the stderr output which is going to the same file but using a different file descriptor.

Tags:

Revision history for this message

John A Meinel (jameinel) wrote on 2017-11-29:

#1

I did a quick test, bootstrapping juju 2.2.6 on lxd, and I can't reproduce this.
I see the traceback in the machine-0.log file as expected
...
goroutine 3501 [select, 1 minutes]:
github.com/juju/juju/worker/catacomb.(*Catacomb).add.func2(0xc420a9f4a0, 0x7ff97c6216d0, 0xc4208b1320, 0xc4208b13e0)
/tmp/tmpwONtpt/juju-core_2.2.6/src/github.com/juju/juju/worker/catacomb/catacomb.go:181 +0x152
created by github.com/juju/juju/worker/catacomb.(*Catacomb).add
/tmp/tmpwONtpt/juju-core_2.2.6/src/github.com/juju/juju/worker/catacomb/catacomb.go:187 +0xe1

Is it possible you passed -QUIT to the wrong process?
I killed the 'jujud' process, rather than the "exec-start.sh" process.

Changed in juju:
status:	New → Incomplete

Revision history for this message

John A Meinel (jameinel) wrote on 2017-11-29:

#2

I tried doing a -QUIT to the exec-start.sh process, but that seemed to just ignore the kill. I can see it being easy to try to kill the wrong process, given:
root 3341 3337 0 11:24 ? 00:00:06 /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug

it would be easy to try to kill 3337 when 3341 is the correct process to kill.

Did you see a line with "running jujud" which is the first thing that a new process logs?
I see the QUIT goroutine listing just before that everytime I tried (I did it about 4 times).

Revision history for this message

John A Meinel (jameinel) wrote on 2017-11-29:

#3

(just to complete, I also see the traceback using 2.3rc2)

Revision history for this message

Roger Peppe (rogpeppe) wrote on 2017-11-29:

#4

I definitely killed the right process, but when I tried to reproduce the issue again, I didn't see it.

ISTM that this might only apply after juju has rolled over the log file. The exec-start.sh script redirects output to machine-0.log. When lumberjack rotates the files, it will close that file and remove it, but jujud's fd 2 will still refer to the original file, even though that file has now been compressed and removed, so that's where the panic output will go. If that's true, then there's another problem too, because the original file will still be taking up disk space. That problem might be mitigated by getting lumberjack to truncate the file before removing it, but that still leaves this current issue.

In support of this theory, the controller I killed had rolled its log files over.

Revision history for this message

John A Meinel (jameinel) wrote on 2017-11-29:

#5

That would make sense. Not too hard to test if you leave a controller running at DEBUG or TRACE log level and fill up the log file. (You can add a lot of empty models, or deploy a lot of dummy charms to one machine if you want to speed it up a bit.)

I don't think we'd want to restart the controller when we rotate files, as that causes all connections to get closed, etc.
I'm not sure what the best answer here is. One possibility is to just keep a different log file for stdout/stderr related content, rather than mixing it with machine-X.log.

Revision history for this message

Roger Peppe (rogpeppe) wrote on 2017-11-29:

#6

That seems like a plausible solution, although it might be awkward trying to debug and not knowing where in the log file a given panic happened (note that a panic print doesn't necessarily imply a restart - the HTTP handler prints panic stack traces too AFAIR). Another possibility might be to move log rotation out of juju into another program that reads juju's and writes it to the appropriate file, rotating/compressing logs as appropriate.

Anastasia (anastasia-macmood) on 2017-11-29

Changed in juju:
status:	Incomplete → New

Revision history for this message

John A Meinel (jameinel) wrote on 2017-12-12:

#7

I think I'd rather just have the stdout/stderr always pointed to a single file (eg, /var/log/juju/jujud-machine-0-errors.log) or something to that effect.
Having a different log rotation mechanism mostly just complicates the system. (you have to somehow sighup juju itself to allow it to reopen its file handles, without also restarting all of the 1000s of units that are actively communicating with juju.)

Changed in juju:
importance:	Undecided → Medium
status:	New → Triaged

Revision history for this message

Roger Peppe (rogpeppe) wrote on 2017-12-15:

#8

> Having a different log rotation mechanism mostly just complicates the system. (you have to somehow sighup juju itself to allow it to reopen its file handles, without also restarting all of the 1000s of units that are actively communicating with juju.)

I was suggesting having juju print all output to a pipe that's processed by the log rotator. That way you don't have to restart jujud to rotate the log file, nor does juju need to know about log file rotation. A couple of possible down sides: it becomes harder for juju

Revision history for this message

Roger Peppe (rogpeppe) wrote on 2017-12-15:

#9

[sorry - it posted without me finishing for some reason]

... it becomes harder for juju to dynamically change the log rotation parameters;
it also might impose some performance penalty because we're not writing directly to disk.

I still think this is better than having stack dumps out of line from the log.

Another couple of possibilities:
- use the current scheme but redirect stderr to the current log file (using syscall.Dup2).
- send stderr to a new file each time (continue relying on juju to rotate logs) but on startup, look to see if there's a new stack trace and if so, insert a line in the log referring to the file that it's in (and keep the file). At least then you can correlate panics with log events.

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2022-11-03:

#10

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance:	Medium → Low
tags:	added: expirebugs-bot

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.