Bug #1683075 “Rsyslog constantly restarts after upgrade to 1.25....” : Bugs : juju-core

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2017-04-20:

#1

I'm downloading the SOS report (taking a long time).

In the mean time can you take a look in /var/log/syslog on the controller(s) and check if rsyslog is falling over? We have had situations in 1.25 where rsyslog got into a bad state and kept crashing.

Anastasia (anastasia-macmood) on 2017-04-20

Changed in juju-core:
status:	New → Incomplete

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2017-04-20:

#2

I've now got the SOS report and I can see that rsyslog has corrupt data files as indicated by these messages:

Apr 15 21:15:06 jala-os-cs-2 rsyslogd-2040: fatal error on disk queue 'action 4 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Apr 15 21:15:06 jala-os-cs-2 rsyslogd-2040: fatal error on disk queue 'action 5 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Apr 15 21:15:06 jala-os-cs-2 rsyslogd-2040: fatal error on disk queue 'action 6 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Apr 15 21:15:06 jala-os-cs-2 rsyslogd-2040: fatal error on disk queue 'action 7 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Apr 15 21:15:06 jala-os-cs-2 rsyslogd-2040: fatal error on disk queue 'action 8 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Apr 15 21:15:06 jala-os-cs-2 rsyslogd-2040: fatal error on disk queue 'action 9 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Apr 15 21:15:06 jala-os-cs-2 rsyslogd-2040: fatal error on disk queue 'action 10 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Apr 15 21:15:06 jala-os-cs-2 rsyslogd-2040: fatal error on disk queue 'action 11 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Apr 15 21:15:06 jala-os-cs-2 rsyslogd-2040: fatal error on disk queue 'action 12 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Apr 15 21:15:06 jala-os-cs-2 rsyslogd-2040: fatal error on disk queue 'action 13 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]

rsyslogd is repeatedly restarting because of this.

Can you check what's in /var/spool/rsyslog? It might be worth stopping rsyslogd and moving the files in that directory out of the way to see if that resolves the issue.

Revision history for this message

Xav Paice (xavpaice) wrote on 2017-04-20:

#3

We've done that on multiple occasions, and unfortunately the errors just return. The dir mentioned contains multiple spool files plus .qi files.

Revision history for this message

Xav Paice (xavpaice) wrote on 2017-04-21:

#4

Marked back as 'confirmed' because 'incomplete' implies waiting on more info - if that's the case, could you please clarify what's needed?

Changed in juju-core:
status:	Incomplete → Confirmed

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2017-04-21:

#5

It was marked Incomplete because we had a question. As you've answered it, you could flip it back to New or Confirmed.

As a follow up, I have another question based on your answer :) you said that the errors do come back after the files are cleaned. Does the failure happen immediately? And if not, is there a possibility you could setup a cron job to do the clean up periodically?

We are not likely to be able to address much in Juju 1 series due to various constraints. It'd be great if we could find a workaround here.

Changed in juju-core:
status:	Confirmed → Incomplete

Revision history for this message

Gareth Woolridge (moon127) wrote on 2017-04-21:

#6

With the errors and apparently corrupt queues in place we get rsyslogd restarts every minute.

If we clear out the queue spool dir and restart rsyslogd the errors and restart behavior return within a few minutes.

Changed in juju-core:
status:	Incomplete → Confirmed

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2017-04-21:

#7

I wonder if it is pure rsyslog failure. At this stage, I am not convinced that we can do much on Juju side either way...

According to a search on similar failures in the wild, the actual cause could be a corrupt .qi file, see for example https://github.com/rsyslog/rsyslog/issues/868.

Corruption may happen when, for example, the recorded queue size is larger than the actual number of messages inside the disk queue. Juju does not explicitly control that, although there may be some configuration that you could to set for rsyslog. I suspect the rsyslog itself may need to be upgraded... Is that a possibility?

From what you describe above, it looks like that ever after removing the corrupt queue files and even on rsyslog restart, the disk queue cannot be properly initialised. I am not too sure if there is a way to process the corrupted queue files. I do wish we had a test environment where this failure could be easily re-produced to confirm the fix but I cannot see what Juju can do here.

Revision history for this message

Gareth Woolridge (moon127) wrote on 2017-04-21:

#8

As requested I also checked for any sign of rsyslog being updated on the underlying OS around the timeframe, and can find no evidence of rsyslog package updated in apt history logs across the 3 HA state servers exhibiting this behaviour.

Revision history for this message

Gareth Woolridge (moon127) wrote on 2017-04-21:

#9

rsyslog:
Installed: 7.4.4-1ubuntu2.6
Candidate: 7.4.4-1ubuntu2.6

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2017-04-21:

#10

@Gareth,

So the same rsyslog version was used when this environment was running 1.25.6? And this failures just started to appear once you've upgraded to 1.25.11?

Changed in juju-core:
status:	Confirmed → Incomplete

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2017-04-21:

#11

We have had this same version of rsyslog on trusty on this node since Feb 17, 2016.

Unfortunately, logs have rotated off the server from prior to the charm upgrades, so we can't tell if this issue was certainly happening or not before the charm upgrade date. I believe that the correllation with charm upgrades was made during issue triage based on when the issue started exhibiting alerts in our event history.

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2017-04-21:

#12

sorry, mild confusion there. jujud upgrade, not charm upgrade.

Changed in juju-core:
status:	Incomplete → Confirmed

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2017-04-26:

#13

I cannot find anything obvious related to rsyslog behavior described between 1.25.6 and 1.25.11. I had a look at milestones and the code differences.

The only thing that is remotely related to logs is bug # 1654528 but it was done to ensure that handshake takes place.

Is there a potential to patch rsyslog itself?

Changed in juju-core:
status:	Confirmed → Incomplete

Revision history for this message

Nivedita Singhvi (niveditasinghvi) wrote on 2018-02-16:

#14

Was the behavior of rsyslog as described in Bug #1683075, by any chance?
I'm assuming not, as after the resolution of this bug no other problems
were reported(?).

Just trying to confirm if *3075 is still an unresolved issue.

Revision history for this message

Nivedita Singhvi (niveditasinghvi) wrote on 2018-02-16:

#15

Sorry, ignore above comment #14, I meant to post it in the *1683076 bug.

I'm trying to determine if this bug is undiagnosed, unresolved.

Revision history for this message

Xav Paice (xavpaice) wrote on 2018-03-06:

#16

Download full text (8.4 KiB)

Another instance, no upgrades have happened recently (in fact, I'm trying to prep the site for one).

Site is running 1.25.13 on Trusty. Juju is HA on machines 0 1 and 2, and there are several other machines. Cloud is Maas 1.9.

The units residing on machine 2 (not in LXC's but on the machine itself) are in state 'failed', I have tried restarting the machine and unit agents, the machines on 0 and 1 as well, all the juju-db's, and all the rsyslog daemons.

I ran mgopurge (1.6) with all the state servers stopped.

In the logs for the unit (with the log set to TRACE) I see the following when I try to run the following:

juju run --unit ceph/1 'uptime'

2018-03-06 03:10:12 DEBUG juju.worker.uniter runlistener.go:61 RunCommands: {Commands:uptime RelationId:-1 RemoteUnitName: ForceRemoteUnit:false}
2018-03-06 03:10:12 TRACE juju.worker.uniter uniter.go:336 run commands: uptime

However the command never returns, the agents don't move away from failed status, and hooks don't run. I don't see anything in the machine log that looks related at all (can attach but there's potentially sensitive info would need scrubbing).

Also, I note there's a number of rsyslog connection attempts and frequent disconnects which could be a red herring or could be significant - e.g.
2018-03-06 03:15:08 INFO juju.worker.dependency engine.go:352 "rsyslog-config-updater" manifold worker stopped: dial tcp 10.28.16.13:6514: getsockopt: connection refused
2018-03-06 03:15:08 DEBUG juju.worker.dependency engine.go:444 restarting dependents of "rsyslog-config-updater" manifold
2018-03-06 03:15:08 INFO juju.worker.dependency engine.go:294 starting "rsyslog-config-updater" manifold worker in 3s...
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:302 starting "rsyslog-config-updater" manifold worker
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:268 "rsyslog-config-updater" manifold requested "agent" resource
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:268 "rsyslog-config-updater" manifold requested "api-caller" resource
2018-03-06 03:15:11 DEBUG juju.worker.rsyslog worker.go:108 starting rsyslog worker mode 1 for "unit-os-cs-1" ""
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:309 running "rsyslog-config-updater" manifold worker
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:315 registered "rsyslog-config-updater" manifold worker
2018-03-06 03:15:11 INFO juju.worker.dependency engine.go:339 "rsyslog-config-updater" manifold worker started
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:444 restarting dependents of "rsyslog-config-updater" manifold
2018-03-06 03:15:11 DEBUG juju.worker.rsyslog worker.go:225 making syslog connection for "juju-unit-os-cs-1" to 10.28.16.13:6514
2018-03-06 03:15:11 INFO juju.worker.dependency engine.go:352 "rsyslog-config-updater" manifold worker stopped: dial tcp 10.28.16.13:6514: getsockopt: connection refused
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:444 restarting dependents of "rsyslog-config-updater" manifold
2018-03-06 03:15:11 INFO juju.worker.dependency engine.go:294 starting "rsyslog-config-updater" manifold worker in 3s...
2018-03-06 03:15:12 DEBUG j...

Another instance, no upgrades have happened recently (in fact, I'm trying to prep the site for one).

Site is running 1.25.13 on Trusty.  Juju is HA on machines 0 1 and 2, and there are several other machines.  Cloud is Maas 1.9.

The units residing on machine 2 (not in LXC's but on the machine itself) are in state 'failed', I have tried restarting the machine and unit agents, the machines on 0 and 1 as well, all the juju-db's, and all the rsyslog daemons.

I ran mgopurge (1.6) with all the state servers stopped.

In the logs for the unit (with the log set to TRACE) I see the following when I try to run the following:

juju run --unit ceph/1 'uptime'

2018-03-06 03:10:12 DEBUG juju.worker.uniter runlistener.go:61 RunCommands: {Commands:uptime RelationId:-1 RemoteUnitName: ForceRemoteUnit:false}
2018-03-06 03:10:12 TRACE juju.worker.uniter uniter.go:336 run commands: uptime

However the command never returns, the agents don't move away from failed status, and hooks don't run.  I don't see anything in the machine log that looks related at all (can attach but there's potentially sensitive info would need scrubbing).

Also, I note there's a number of rsyslog connection attempts and frequent disconnects which could be a red herring or could be significant - e.g. 
2018-03-06 03:15:08 INFO juju.worker.dependency engine.go:352 "rsyslog-config-updater" manifold worker stopped: dial tcp 10.28.16.13:6514: getsockopt: connection refused
2018-03-06 03:15:08 DEBUG juju.worker.dependency engine.go:444 restarting dependents of "rsyslog-config-updater" manifold
2018-03-06 03:15:08 INFO juju.worker.dependency engine.go:294 starting "rsyslog-config-updater" manifold worker in 3s...
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:302 starting "rsyslog-config-updater" manifold worker
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:268 "rsyslog-config-updater" manifold requested "agent" resource
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:268 "rsyslog-config-updater" manifold requested "api-caller" resource
2018-03-06 03:15:11 DEBUG juju.worker.rsyslog worker.go:108 starting rsyslog worker mode 1 for "unit-os-cs-1" ""
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:309 running "rsyslog-config-updater" manifold worker
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:315 registered "rsyslog-config-updater" manifold worker
2018-03-06 03:15:11 INFO juju.worker.dependency engine.go:339 "rsyslog-config-updater" manifold worker started
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:444 restarting dependents of "rsyslog-config-updater" manifold
2018-03-06 03:15:11 DEBUG juju.worker.rsyslog worker.go:225 making syslog connection for "juju-unit-os-cs-1" to 10.28.16.13:6514
2018-03-06 03:15:11 INFO juju.worker.dependency engine.go:352 "rsyslog-config-updater" manifold worker stopped: dial tcp 10.28.16.13:6514: getsockopt: connection refused
2018-03-06 03:15:11 DEBUG juju.worker.dependency engine.go:444 restarting dependents of "rsyslog-config-updater" manifold
2018-03-06 03:15:11 INFO juju.worker.dependency engine.go:294 starting "rsyslog-config-updater" manifold worker in 3s...
2018-03-06 03:15:12 DEBUG juju.worker.leadership tracker.go:138 os-cs/1 renewing lease for os-cs leadership
2018-03-06 03:15:12 DEBUG juju.worker.leadership tracker.go:165 checking os-cs/1 for os-cs leadership
2018-03-06 03:15:13 DEBUG juju.worker.leadership tracker.go:180 os-cs/1 confirmed for os-cs leadership until 2018-03-06 03:16:12.552651545 +0000 UTC
2018-03-06 03:15:13 INFO juju.worker.leadership tracker.go:182 os-cs/1 will renew os-cs leadership at 2018-03-06 03:15:42.552651545 +0000 UTC
2018-03-06 03:15:14 DEBUG juju.worker.dependency engine.go:302 starting "rsyslog-config-updater" manifold worker
2018-03-06 03:15:14 DEBUG juju.worker.dependency engine.go:268 "rsyslog-config-updater" manifold requested "agent" resource
2018-03-06 03:15:14 DEBUG juju.worker.dependency engine.go:268 "rsyslog-config-updater" manifold requested "api-caller" resource
2018-03-06 03:15:14 DEBUG juju.worker.rsyslog worker.go:108 starting rsyslog worker mode 1 for "unit-os-cs-1" ""
2018-03-06 03:15:14 DEBUG juju.worker.dependency engine.go:309 running "rsyslog-config-updater" manifold worker
2018-03-06 03:15:14 DEBUG juju.worker.dependency engine.go:315 registered "rsyslog-config-updater" manifold worker
2018-03-06 03:15:14 INFO juju.worker.dependency engine.go:339 "rsyslog-config-updater" manifold worker started
2018-03-06 03:15:14 DEBUG juju.worker.dependency engine.go:444 restarting dependents of "rsyslog-config-updater" manifold
2018-03-06 03:15:14 DEBUG juju.worker.rsyslog worker.go:225 making syslog connection for "juju-unit-os-cs-1" to 10.28.16.13:6514
2018-03-06 03:15:14 DEBUG juju.worker.rsyslog worker.go:225 making syslog connection for "juju-unit-os-cs-1" to 10.28.2.22:6514
2018-03-06 03:15:14 DEBUG juju.worker.rsyslog worker.go:225 making syslog connection for "juju-unit-os-cs-1" to 10.28.24.13:6514
2018-03-06 03:15:14 DEBUG juju.worker.rsyslog worker.go:225 making syslog connection for "juju-unit-os-cs-1" to 10.28.6.13:6514
2018-03-06 03:15:14 DEBUG juju.worker.rsyslog worker.go:225 making syslog connection for "juju-unit-os-cs-1" to 10.28.8.13:6514
2018-03-06 03:15:14 DEBUG juju.worker.rsyslog worker.go:225 making syslog connection for "juju-unit-os-cs-1" to 10.28.2.20:6514
2018-03-06 03:15:14 DEBUG juju.worker.rsyslog worker.go:225 making syslog connection for "juju-unit-os-cs-1" to 10.28.16.12:6514

At a similar time in syslog:
Mar  6 03:15:08 hostname rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="1778857" x-info="http://www.rsyslog.com"] exiting on signal 15.
Mar  6 03:15:12 hostname rsyslogd-2040: fatal error on disk queue 'action 14 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Mar  6 03:15:12 hostname rsyslogd: [origin software="rsyslogd" swVersion="7.4.4" x-pid="1788531" x-info="http://www.rsyslog.com"] start
Mar  6 03:15:12 hostname rsyslogd-2307: warning: ~ action is deprecated, consider using the 'stop' statement instead [try http://www.rsyslog.com/e/2307 ]
Mar  6 03:15:12 hostname rsyslogd-2221: module 'imuxsock' already in this config, cannot be added
 [try http://www.rsyslog.com/e/2221 ]
Mar  6 03:15:12 hostname rsyslogd: rsyslogd's groupid changed to 104 
Mar  6 03:15:12 hostname rsyslogd: rsyslogd's userid changed to 101 
Mar  6 03:15:12 hostname rsyslogd-2040: fatal error on disk queue 'action 4 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Mar  6 03:15:12 hostname rsyslogd-2040: fatal error on disk queue 'action 5 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Mar  6 03:15:12 hostname rsyslogd-2040: fatal error on disk queue 'action 6 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Mar  6 03:15:12 hostname rsyslogd-2040: fatal error on disk queue 'action 7 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Mar  6 03:15:12 hostname rsyslogd-2040: fatal error on disk queue 'action 8 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Mar  6 03:15:12 hostname rsyslogd-2040: fatal error on disk queue 'action 9 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Mar  6 03:15:12 hostname rsyslogd-2040: fatal error on disk queue 'action 11 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Mar  6 03:15:12 hostname rsyslogd-2040: fatal error on disk queue 'action 10 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Mar  6 03:15:12 hostname rsyslogd-2040: fatal error on disk queue 'action 12 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Mar  6 03:15:12 hostname rsyslogd-2040: fatal error on disk queue 'action 13 queue[DA]', emergency switch to direct mode [try http://www.rsyslog.com/e/2040 ]
Mar  6 03:15:15 hostname rsyslogd-2083: gnutls returned error on handshake: A TLS warning alert has been received.
 [try http://www.rsyslog.com/e/2083 ]
Mar  6 03:15:22 hostname rsyslogd-2027: imfile: could not persist state file machine-2 - data may be repeated on next startup. Is WorkDirectory set? [try http://www.rsyslog.com/e/2027 ]

I tried clearing out the rsyslog config from /etc/rsyslog.d/25-juju.conf, emptying out /var/spool/rsyslog to clean out any broken files (with rsyslog stopped), and restarting the machine agent, but the .qi etc files all came back immediately as did these errors.

Xav Paice (xavpaice) on 2018-03-06

Changed in juju-core:
status:	Incomplete → Confirmed

Revision history for this message

Nivedita Singhvi (niveditasinghvi) wrote on 2018-03-08:

#17

Can you test with the latest version of rsyslog which has a fix for a similar issue as noted in Comment #7?

https://github.com/rsyslog/rsyslog/issues/868

Fix went in to v8.18 release of rsyslog:

https://github.com/rsyslog/rsyslog/issues/921

Note that this is a specific and limited fix. It did not work for at least one other manifestation of such symptoms (Fatal error on disk queue, queue not found).

juju-core

Rsyslog constantly restarts after upgrade to 1.25.11

Bug Description

Other bug subscribers

Remote bug watches