Rsyslog constantly restarts after upgrade to 1.25.11

Bug #1683075 reported by Jill Rouleau on 2017-04-15
40
This bug affects 9 people
Affects Status Importance Assigned to Milestone
juju-core
Undecided
Unassigned

Bug Description

Trusty, Mitaka, Juju 1.25.11

Juju is restarting rsyslog as often as every 1 minute. This seems to have started approximately when juju was upgraded from 1.25.10 to .11, on 2017-03-31.

When we first began investigating, units were falling over with the message https://pastebin.canonical.com/185611/ - looks very much like LP1662272 in which it is clearly suggested to mgopurge the db. We ran mgopuge, results looked happy https://pastebin.canonical.com/185614/
All units returned from 'lost' or 'failed' to 'idle'.

syslog shows frequent queue errors, queue and queue index files are left on disks and never processed. Contents of queue files at times at least appear uncorrupted, afaict. Clearing the spool only temporarily alleviates the problem.

sosreports are uploading to https://private-fileshare.canonical.com/~jillr/sf00137831/

I'm downloading the SOS report (taking a long time).

In the mean time can you take a look in /var/log/syslog on the controller(s) and check if rsyslog is falling over? We have had situations in 1.25 where rsyslog got into a bad state and kept crashing.

Changed in juju-core:
status: New → Incomplete

I've now got the SOS report and I can see that rsyslog has corrupt data files as indicated by these messages:

Apr 15 21:15:06 jala-os-cs-2 rsyslogd-2040: fatal error on disk queue 'action 4 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Apr 15 21:15:06 jala-os-cs-2 rsyslogd-2040: fatal error on disk queue 'action 5 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Apr 15 21:15:06 jala-os-cs-2 rsyslogd-2040: fatal error on disk queue 'action 6 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Apr 15 21:15:06 jala-os-cs-2 rsyslogd-2040: fatal error on disk queue 'action 7 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Apr 15 21:15:06 jala-os-cs-2 rsyslogd-2040: fatal error on disk queue 'action 8 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Apr 15 21:15:06 jala-os-cs-2 rsyslogd-2040: fatal error on disk queue 'action 9 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Apr 15 21:15:06 jala-os-cs-2 rsyslogd-2040: fatal error on disk queue 'action 10 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Apr 15 21:15:06 jala-os-cs-2 rsyslogd-2040: fatal error on disk queue 'action 11 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Apr 15 21:15:06 jala-os-cs-2 rsyslogd-2040: fatal error on disk queue 'action 12 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Apr 15 21:15:06 jala-os-cs-2 rsyslogd-2040: fatal error on disk queue 'action 13 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]

rsyslogd is repeatedly restarting because of this.

Can you check what's in /var/spool/rsyslog? It might be worth stopping rsyslogd and moving the files in that directory out of the way to see if that resolves the issue.

Xav Paice (xavpaice) wrote :

We've done that on multiple occasions, and unfortunately the errors just return. The dir mentioned contains multiple spool files plus .qi files.

Xav Paice (xavpaice) wrote :

Marked back as 'confirmed' because 'incomplete' implies waiting on more info - if that's the case, could you please clarify what's needed?

Changed in juju-core:
status: Incomplete → Confirmed
Anastasia (anastasia-macmood) wrote :

It was marked Incomplete because we had a question. As you've answered it, you could flip it back to New or Confirmed.

As a follow up, I have another question based on your answer :) you said that the errors do come back after the files are cleaned. Does the failure happen immediately? And if not, is there a possibility you could setup a cron job to do the clean up periodically?

We are not likely to be able to address much in Juju 1 series due to various constraints. It'd be great if we could find a workaround here.

Changed in juju-core:
status: Confirmed → Incomplete
Gareth Woolridge (moon127) wrote :

With the errors and apparently corrupt queues in place we get rsyslogd restarts every minute.

If we clear out the queue spool dir and restart rsyslogd the errors and restart behavior return within a few minutes.

Changed in juju-core:
status: Incomplete → Confirmed
Anastasia (anastasia-macmood) wrote :

I wonder if it is pure rsyslog failure. At this stage, I am not convinced that we can do much on Juju side either way...

According to a search on similar failures in the wild, the actual cause could be a corrupt .qi file, see for example https://github.com/rsyslog/rsyslog/issues/868.

Corruption may happen when, for example, the recorded queue size is larger than the actual number of messages inside the disk queue. Juju does not explicitly control that, although there may be some configuration that you could to set for rsyslog. I suspect the rsyslog itself may need to be upgraded... Is that a possibility?

From what you describe above, it looks like that ever after removing the corrupt queue files and even on rsyslog restart, the disk queue cannot be properly initialised. I am not too sure if there is a way to process the corrupted queue files. I do wish we had a test environment where this failure could be easily re-produced to confirm the fix but I cannot see what Juju can do here.

Gareth Woolridge (moon127) wrote :

As requested I also checked for any sign of rsyslog being updated on the underlying OS around the timeframe, and can find no evidence of rsyslog package updated in apt history logs across the 3 HA state servers exhibiting this behaviour.

Gareth Woolridge (moon127) wrote :

rsyslog:
  Installed: 7.4.4-1ubuntu2.6
  Candidate: 7.4.4-1ubuntu2.6

Anastasia (anastasia-macmood) wrote :

@Gareth,

So the same rsyslog version was used when this environment was running 1.25.6? And this failures just started to appear once you've upgraded to 1.25.11?

Changed in juju-core:
status: Confirmed → Incomplete
Drew Freiberger (afreiberger) wrote :

We have had this same version of rsyslog on trusty on this node since Feb 17, 2016.

Unfortunately, logs have rotated off the server from prior to the charm upgrades, so we can't tell if this issue was certainly happening or not before the charm upgrade date. I believe that the correllation with charm upgrades was made during issue triage based on when the issue started exhibiting alerts in our event history.

Drew Freiberger (afreiberger) wrote :

sorry, mild confusion there. jujud upgrade, not charm upgrade.

Changed in juju-core:
status: Incomplete → Confirmed
Anastasia (anastasia-macmood) wrote :

I cannot find anything obvious related to rsyslog behavior described between 1.25.6 and 1.25.11. I had a look at milestones and the code differences.

The only thing that is remotely related to logs is bug # 1654528 but it was done to ensure that handshake takes place.

Is there a potential to patch rsyslog itself?

Changed in juju-core:
status: Confirmed → Incomplete

Was the behavior of rsyslog as described in Bug #1683075, by any chance?
I'm assuming not, as after the resolution of this bug no other problems
were reported(?).

Just trying to confirm if *3075 is still an unresolved issue.

Sorry, ignore above comment #14, I meant to post it in the *1683076 bug.

I'm trying to determine if this bug is undiagnosed, unresolved.

Xav Paice (xavpaice) wrote :
Download full text (8.4 KiB)

Another instance, no upgrades have happened recently (in fact, I'm trying to prep the site for one).

Site is running 1.25.13 on Trusty. Juju is HA on machines 0 1 and 2, and there are several other machines. Cloud is Maas 1.9.

The units residing on machine 2 (not in LXC's but on the machine itself) are in state 'failed', I have tried restarting the machine and unit agents, the machines on 0 and 1 as well, all the juju-db's, and all the rsyslog daemons.

I ran mgopurge (1.6) with all the state servers stopped.

In the logs for the unit (with the log set to TRACE) I see the following when I try to run the following:

juju run --unit ceph/1 'uptime'

2018-03-06 03:10:12 DEBUG juju.worker.uniter runlistener.go:61 RunCommands: {Commands:uptime RelationId:-1 RemoteUnitName: ForceRemoteUnit:false}
2018-03-06 03:10:12 TRACE juju.worker.uniter uniter.go:336 run commands: uptime

However the command never returns, the agents don't move away from failed status, and hooks don't run. I don't see anything in the machine log that looks related at all (can attach but there's potentially sensitive info would need scrubbing).

Also, I note there's a number of rsyslog connection attempts and frequent disconnects which could be a red herring or could be significant - e.g.
2018-03-06 03:15:08 INFO juju.worker.dependency engine.go:352 "rsyslog-config-updater" manifold worker stopped: dial tcp 10.28.16.13:6514: getsockopt: connection refused
2018-03-06 03:15:08 DEBUG juju.worker.dependency engine.go:444 restarting dependents of "rsyslog-config-updater" manifold
2018-03-06 03:15:08 INFO juju.worker.dependency engine.go:294 starting "rsyslog-config-updater" manifold worker in 3s...
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:302 starting "rsyslog-config-updater" manifold worker
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:268 "rsyslog-config-updater" manifold requested "agent" resource
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:268 "rsyslog-config-updater" manifold requested "api-caller" resource
2018-03-06 03:15:11 DEBUG juju.worker.rsyslog worker.go:108 starting rsyslog worker mode 1 for "unit-os-cs-1" ""
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:309 running "rsyslog-config-updater" manifold worker
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:315 registered "rsyslog-config-updater" manifold worker
2018-03-06 03:15:11 INFO juju.worker.dependency engine.go:339 "rsyslog-config-updater" manifold worker started
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:444 restarting dependents of "rsyslog-config-updater" manifold
2018-03-06 03:15:11 DEBUG juju.worker.rsyslog worker.go:225 making syslog connection for "juju-unit-os-cs-1" to 10.28.16.13:6514
2018-03-06 03:15:11 INFO juju.worker.dependency engine.go:352 "rsyslog-config-updater" manifold worker stopped: dial tcp 10.28.16.13:6514: getsockopt: connection refused
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:444 restarting dependents of "rsyslog-config-updater" manifold
2018-03-06 03:15:11 INFO juju.worker.dependency engine.go:294 starting "rsyslog-config-updater" manifold worker in 3s...
2018-03-06 03:15:12 DEBUG j...

Read more...

Xav Paice (xavpaice) on 2018-03-06
Changed in juju-core:
status: Incomplete → Confirmed

Can you test with the latest version of rsyslog which has a fix for a similar issue as noted in Comment #7?

https://github.com/rsyslog/rsyslog/issues/868

Fix went in to v8.18 release of rsyslog:

https://github.com/rsyslog/rsyslog/issues/921

Note that this is a specific and limited fix. It did not work for at least one other manifestation of such symptoms (Fatal error on disk queue, queue not found).

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.