all-machine logging stopped, x509: certificate signed by unknown authority
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | juju-core |
High
|
Unassigned | ||
| | 1.25 |
Critical
|
Unassigned | ||
Bug Description
1.24.5, landscape openstack autopilot install. Not an upgrade.
Logging, and the deploy itself proceeded normally through this whole thing, but the big symptom was all-machines.log stopped receiving updates. Also, every log was getting an error every 5 seconds, leading to some very large logs. There are about 50 machines in total for this deployment (with all the lxcs involved).
## The error messages in the unit logs:
2015-09-02 20:13:17 ERROR juju.worker runner.go:223 exited "rsyslog": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "juju-generated CA for environment \"rsyslog\"")
2015-09-02 20:13:20 ERROR juju.worker runner.go:223 exited "rsyslog": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "juju-generated CA for environment \"rsyslog\"")
2015-09-02 20:13:23 ERROR juju.worker runner.go:223 exited "rsyslog": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "juju-generated CA for environment \"rsyslog\"")
| David Britton (davidpbritton) wrote : | #1 |
| David Britton (davidpbritton) wrote : | #2 |
| David Britton (davidpbritton) wrote : | #3 |
| David Britton (davidpbritton) wrote : | #4 |
| David Britton (davidpbritton) wrote : | #5 |
| David Britton (davidpbritton) wrote : | #6 |
| tags: | added: logging |
| tags: | added: rsyslog |
| Changed in juju-core: | |
| status: | New → Triaged |
| importance: | Undecided → High |
| milestone: | none → 1.26.0 |
| tags: | added: sts |
| Felipe Reyes (freyes) wrote : | #7 |
Maybe regression of bug 1474614
| Changed in juju-core: | |
| milestone: | 1.26.0 → 1.25-beta2 |
| Changed in juju-core: | |
| assignee: | nobody → Wayne Witzel III (wwitzel3) |
| Changed in juju-core: | |
| assignee: | Wayne Witzel III (wwitzel3) → nobody |
| Changed in juju-core: | |
| importance: | High → Critical |
| milestone: | 1.25-beta2 → 1.26-alpha1 |
| Changed in juju-core: | |
| status: | Triaged → In Progress |
| Andrew Wilkins (axwalk) wrote : | #8 |
From machine-0.log:
2015-09-02 19:54:53 ERROR juju.worker runner.go:223 exited "rsyslog": failed to write rsyslog certificates: cannot write settings: EOF
I guess we're not recovering from that properly. Will dig in.
| Changed in juju-core: | |
| status: | In Progress → Triaged |
| Andrew Wilkins (axwalk) wrote : | #9 |
We do recover from that normally. It's hard to tell exactly what's going on from the logs, because they're at WARNING level. It would be most helpful if I can get access to the broken environment.
| Andrew Wilkins (axwalk) wrote : | #10 |
Only other thing of interest in the logs that I can pick out is that machine-0's logs are repeating in all-machines.log. I'm out of clues.
| Andrew Wilkins (axwalk) wrote : | #11 |
Marking incomplete as I cannot reproduce the error. If anyone is able to produce steps to trigger this, please reopen.
| Changed in juju-core: | |
| status: | Triaged → Incomplete |
| tags: | added: kanban-cross-team |
| tags: | removed: kanban-cross-team |
| Andreas Hasenack (ahasenack) wrote : | #12 |
I just got this error:
2015-10-29 20:45:22 ERROR juju.worker runner.go:223 exited "rsyslog": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "juju-generated CA for environment \"rsyslog\"")
2015-10-29 20:45:26 ERROR juju.worker runner.go:223 exited "rsyslog": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "juju-generated CA for environment \"rsyslog\"")
It's repeating every 3s in this particular unit's logs.
All clocks are fine: http://
For a wile the agent was reporting this:
landscape-
That's when I logged in and spotted the rsyslog error. While I was poking around, without changing anything, the deployment eventually finished, but the error is still going on every 3s in that particular unit log.
| Cheryl Jennings (cherylj) wrote : | #13 |
Andreas - were you able to get DEBUG logs when recreating this issue?
| Andreas Hasenack (ahasenack) wrote : Re: [Bug 1491688] Re: all-machine logging stopped, x509: certificate signed by unknown authority | #14 |
No, I wasn't trying to reproduce this specifically. The environment is
still up, though. Nothing was restarted.
On Oct 29, 2015 19:11, "Cheryl Jennings" <email address hidden>
wrote:
> Andreas - were you able to get DEBUG logs when recreating this issue?
>
> --
> You received this bug notification because you are a member of
> Landscape, which is subscribed to the bug report.
> https:/
>
> Title:
> all-machine logging stopped, x509: certificate signed by unknown
> authority
>
> To manage notifications about this bug go to:
> https:/
>
| Cheryl Jennings (cherylj) wrote : | #15 |
As per the discussion on IRC, could you grab the following from each state server:
- the machine logs
- /var/log/syslog
And the unit logs from the machines experiencing this error?
| Andreas Hasenack (ahasenack) wrote : | #16 |
-rw-r--r-- 1 root root 883 Oct 29 22:47 /etc/juju/
| Andreas Hasenack (ahasenack) wrote : | #17 |
-rw------- 1 syslog syslog 908 Oct 29 19:55 /etc/juju/
| Andreas Hasenack (ahasenack) wrote : | #18 |
-rw-r--r-- 1 root root 883 Oct 29 22:49 /etc/juju/
| Andreas Hasenack (ahasenack) wrote : | #19 |
Interestingly, the ca-cert.pem file on both the bootstrap node and the other nodes keeps getting rewritten. The contents are always the same, but the timestamp of the file changes:
root@node-6:~# stat /etc/juju/
File: ‘/etc/juju/
Size: 883 Blocks: 8 IO Block: 4096 regular file
Device: 801h/2049d Inode: 834784 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2015-10-29 23:04:55.726720392 +0000
Modify: 2015-10-29 23:04:55.726720392 +0000
Change: 2015-10-29 23:04:55.726720392 +0000
Birth: -
| Andreas Hasenack (ahasenack) wrote : | #20 |
Using curl, or wget, or openssl's s_client or even gnutls-cli, all complain that the server certificate is "bad".
End part of a gnutls-cli run with -d 255:
root@node-1:~# gnutls-cli 10.245.203.1 -p 6514 --x509cafile /etc/juju/
(...)
|<3>| HSK[0x1e154b0]: CERTIFICATE was received [639 bytes]
|<6>| BUF[REC][HD]: Read 635 bytes of Data(22)
|<6>| BUF[HSK]: Peeked 218 bytes of Data
|<6>| BUF[HSK]: Emptied buffer
|<6>| BUF[HSK]: Inserted 4 bytes of Data
|<6>| BUF[HSK]: Inserted 635 bytes of Data
|<2>| ASSERT: ext_signature.c:393
|<2>| ASSERT: verify.c:306
|<2>| ASSERT: verify.c:357
|<2>| ASSERT: verify.c:595
*** Verifying server certificate failed...
|<2>| ASSERT: gnutls_kx.c:705
|<2>| ASSERT: gnutls_
|<6>| BUF[HSK]: Cleared Data from buffer
*** Fatal error: Error in the certificate.
|<4>| REC: Sending Alert[2|42] - Certificate is bad
|<4>| REC[0x1e154b0]: Sending Packet[1] Alert(21) with length: 2
|<7>| WRITE: enqueued 7 bytes for 0x4. Total 7 bytes.
|<7>| WRITE FLUSH: 7 bytes in buffer.
|<7>| WRITE: wrote 7 bytes, 0 bytes left.
|<4>| REC[0x1e154b0]: Sent Packet[2] Alert(21) with length: 7
*** Handshake has failed
GnuTLS error: Error in the certificate.
|<6>| BUF[HSK]: Cleared Data from buffer
|<4>| REC[0x1e154b0]: Epoch #0 freed
|<4>| REC[0x1e154b0]: Epoch #1 freed
openssl s_client run: http://
| Andreas Hasenack (ahasenack) wrote : | #21 |
machine-0.log from the stateserver
| Cheryl Jennings (cherylj) wrote : | #22 |
From Menno:
I've had a good look around but due to the machine-0 log level being at the default of ERROR it's hard to know what happened.
One unusual thing is the various EOF errors that caused a bunch of workers to die at 19:55:36. This was caused by the MongoDB node briefly dropping from PRIMARY to SECONDARY and back again. MongoDB drops all its connections when the replicaset status changes. I'm not sure exactly why or how a node can become SECONDARY when there's only one host in the replicaset but I've seen MongoDB do this before.
Juju's workers are designed to able to cope with MongoDB dropping connections (it happens whenever the MonogDB master changes) but it's possible that something didn't recover properly when this happened. This is further evidenced by the fact that ca-cert.pem was being regularly updated. The rsyslog worker in machine-0 writes out ca-cert.pem when it starts up indicating that it was continually failing and restarting. What's curious is that there was no evidence of this in machine-0.log. There should have been at least one ERROR line for each failed start attempt.
What's also weird is that the MongoDB blip didn't seem to upset as many workers as I would have expected. It only seems to have affected workers that tried to do something with the DB at the time of the blip. Specifically I would have expected to see the "state" worker restart.
I restarted jujud on machine-0 and everything recovered. The Juju agents that weren't able to talk to rsyslog were now able to, presumably because the rsyslog worker on machine-0 was no longer broken and was able to correctly configure rsyslogd.
So my best theory is that the MongoDB replicaset blip, somehow broke the rsyslog worker (and possibly others too) in such a way that they couldn't recover. Without more detailed logs that's just a guess however.
It might be worth seeing if we can trigger this situation in a dev environment by manually forcing MongoDB to flip to SECONDARY mode at just the right time. There are MongoDB commands for manipulating the replicaset like this.
| Menno Finlay-Smits (menno.smits) wrote : | #23 |
The contents of /var/log/juju and /etc/juju for all nodes in andreas' and dpb's environment.
| tags: | added: bug-squad |
| Changed in juju-core: | |
| status: | Incomplete → Triaged |
| Changed in juju-core: | |
| milestone: | 1.26-alpha1 → 1.26-alpha2 |
| no longer affects: | juju-core/1.24 |
| Nate Finch (natefinch) wrote : | #24 |
has this been reproduced on 1.25 or later?
| Changed in juju-core: | |
| milestone: | 1.26-alpha2 → 1.26-beta1 |
| Menno Finlay-Smits (menno.smits) wrote : | #25 |
I've been digging further into this. I now understand what causes the mongodb "blip". When the replicaset config is changed in any significant way, the MongoDB instance temporarily drops from PRIMARY to SECONDARY, even when there's just one node.
I could reproduce the blip by introducing a machine agent worker like this:
a.startWork
return worker.
logger.
config, err := replicaset.
if err != nil {
return err
}
n := len(config.Members)
if n != 1 {
return errors.
}
m := config.Members[0]
v := float64(2)
m.Priority = &v
config.
logger.
<
close(
return replicaset.
}), nil
})
I've been investigating all sorts of possibilities in terms of if and how the blip might have contributed to the rsyslog issue. I feel like I might be close but I'm not there yet.
| Changed in juju-core: | |
| milestone: | 1.26-beta1 → 2.0-alpha2 |
| Changed in juju-core: | |
| importance: | Critical → High |
| Cheryl Jennings (cherylj) wrote : | #26 |
As of 2.0-alpha1, juju logs to the database by default, and in 2.0-alpha2, the rsyslog worker is removed entirely. Marking as invalid for master.
This problem will still exist on 1.25.
| Changed in juju-core: | |
| milestone: | 2.0-alpha2 → 2.0-alpha3 |
| status: | Triaged → Invalid |
| Changed in juju-core: | |
| milestone: | 2.0-alpha3 → none |
| tags: | removed: sts |


Sample unit log file showing the spewing error messages.