all-machine logging stopped, x509: certificate signed by unknown authority

Bug #1491688 reported by David Britton
32
This bug affects 6 people
Affects Status Importance Assigned to Milestone
juju-core
Invalid
High
Unassigned
1.25
Triaged
Critical
Unassigned

Bug Description

1.24.5, landscape openstack autopilot install. Not an upgrade.

Logging, and the deploy itself proceeded normally through this whole thing, but the big symptom was all-machines.log stopped receiving updates. Also, every log was getting an error every 5 seconds, leading to some very large logs. There are about 50 machines in total for this deployment (with all the lxcs involved).

## The error messages in the unit logs:

2015-09-02 20:13:17 ERROR juju.worker runner.go:223 exited "rsyslog": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "juju-generated CA for environment \"rsyslog\"")
2015-09-02 20:13:20 ERROR juju.worker runner.go:223 exited "rsyslog": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "juju-generated CA for environment \"rsyslog\"")
2015-09-02 20:13:23 ERROR juju.worker runner.go:223 exited "rsyslog": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "juju-generated CA for environment \"rsyslog\"")

Revision history for this message
David Britton (dpb) wrote :
Revision history for this message
David Britton (dpb) wrote :
Revision history for this message
David Britton (dpb) wrote :
Revision history for this message
David Britton (dpb) wrote :

Sample unit log file showing the spewing error messages.

Revision history for this message
David Britton (dpb) wrote :
Revision history for this message
David Britton (dpb) wrote :
Curtis Hovey (sinzui)
tags: added: logging
tags: added: rsyslog
Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.26.0
tags: added: sts
Revision history for this message
Felipe Reyes (freyes) wrote :

Maybe regression of bug 1474614

Changed in juju-core:
milestone: 1.26.0 → 1.25-beta2
Changed in juju-core:
assignee: nobody → Wayne Witzel III (wwitzel3)
Changed in juju-core:
assignee: Wayne Witzel III (wwitzel3) → nobody
Ian Booth (wallyworld)
Changed in juju-core:
importance: High → Critical
milestone: 1.25-beta2 → 1.26-alpha1
Andrew Wilkins (axwalk)
Changed in juju-core:
status: Triaged → In Progress
Revision history for this message
Andrew Wilkins (axwalk) wrote :

From machine-0.log:
2015-09-02 19:54:53 ERROR juju.worker runner.go:223 exited "rsyslog": failed to write rsyslog certificates: cannot write settings: EOF

I guess we're not recovering from that properly. Will dig in.

Changed in juju-core:
status: In Progress → Triaged
Revision history for this message
Andrew Wilkins (axwalk) wrote :

We do recover from that normally. It's hard to tell exactly what's going on from the logs, because they're at WARNING level. It would be most helpful if I can get access to the broken environment.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Only other thing of interest in the logs that I can pick out is that machine-0's logs are repeating in all-machines.log. I'm out of clues.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Marking incomplete as I cannot reproduce the error. If anyone is able to produce steps to trigger this, please reopen.

Changed in juju-core:
status: Triaged → Incomplete
tags: added: kanban-cross-team
tags: removed: kanban-cross-team
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I just got this error:
2015-10-29 20:45:22 ERROR juju.worker runner.go:223 exited "rsyslog": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "juju-generated CA for environment \"rsyslog\"")
2015-10-29 20:45:26 ERROR juju.worker runner.go:223 exited "rsyslog": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "juju-generated CA for environment \"rsyslog\"")

It's repeating every 3s in this particular unit's logs.

All clocks are fine: http://pastebin.ubuntu.com/13002468/

For a wile the agent was reporting this:
  landscape-client/25 unknown allocating 1.24.6 10.245.201.10 Waiting for agent initialization to finish

That's when I logged in and spotted the rsyslog error. While I was poking around, without changing anything, the deployment eventually finished, but the error is still going on every 3s in that particular unit log.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Andreas - were you able to get DEBUG logs when recreating this issue?

Revision history for this message
Andreas Hasenack (ahasenack) wrote : Re: [Bug 1491688] Re: all-machine logging stopped, x509: certificate signed by unknown authority

No, I wasn't trying to reproduce this specifically. The environment is
still up, though. Nothing was restarted.
On Oct 29, 2015 19:11, "Cheryl Jennings" <email address hidden>
wrote:

> Andreas - were you able to get DEBUG logs when recreating this issue?
>
> --
> You received this bug notification because you are a member of
> Landscape, which is subscribed to the bug report.
> https://bugs.launchpad.net/bugs/1491688
>
> Title:
> all-machine logging stopped, x509: certificate signed by unknown
> authority
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1491688/+subscriptions
>

Revision history for this message
Cheryl Jennings (cherylj) wrote :

As per the discussion on IRC, could you grab the following from each state server:
- the machine logs
- /var/log/syslog

And the unit logs from the machines experiencing this error?

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

-rw-r--r-- 1 root root 883 Oct 29 22:47 /etc/juju/rsyslog/ca-cert.pem

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

-rw------- 1 syslog syslog 908 Oct 29 19:55 /etc/juju/rsyslog/rsyslog-cert.pem

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

-rw-r--r-- 1 root root 883 Oct 29 22:49 /etc/juju/rsyslog/ca-cert.pem

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Interestingly, the ca-cert.pem file on both the bootstrap node and the other nodes keeps getting rewritten. The contents are always the same, but the timestamp of the file changes:

root@node-6:~# stat /etc/juju/rsyslog/ca-cert.pem
  File: ‘/etc/juju/rsyslog/ca-cert.pem’
  Size: 883 Blocks: 8 IO Block: 4096 regular file
Device: 801h/2049d Inode: 834784 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2015-10-29 23:04:55.726720392 +0000
Modify: 2015-10-29 23:04:55.726720392 +0000
Change: 2015-10-29 23:04:55.726720392 +0000
 Birth: -

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Using curl, or wget, or openssl's s_client or even gnutls-cli, all complain that the server certificate is "bad".

End part of a gnutls-cli run with -d 255:
root@node-1:~# gnutls-cli 10.245.203.1 -p 6514 --x509cafile /etc/juju/rsyslog/ca-cert.pem -d 255
(...)
|<3>| HSK[0x1e154b0]: CERTIFICATE was received [639 bytes]
|<6>| BUF[REC][HD]: Read 635 bytes of Data(22)
|<6>| BUF[HSK]: Peeked 218 bytes of Data
|<6>| BUF[HSK]: Emptied buffer
|<6>| BUF[HSK]: Inserted 4 bytes of Data
|<6>| BUF[HSK]: Inserted 635 bytes of Data
|<2>| ASSERT: ext_signature.c:393
|<2>| ASSERT: verify.c:306
|<2>| ASSERT: verify.c:357
|<2>| ASSERT: verify.c:595
*** Verifying server certificate failed...
|<2>| ASSERT: gnutls_kx.c:705
|<2>| ASSERT: gnutls_handshake.c:2777
|<6>| BUF[HSK]: Cleared Data from buffer
*** Fatal error: Error in the certificate.
|<4>| REC: Sending Alert[2|42] - Certificate is bad
|<4>| REC[0x1e154b0]: Sending Packet[1] Alert(21) with length: 2
|<7>| WRITE: enqueued 7 bytes for 0x4. Total 7 bytes.
|<7>| WRITE FLUSH: 7 bytes in buffer.
|<7>| WRITE: wrote 7 bytes, 0 bytes left.
|<4>| REC[0x1e154b0]: Sent Packet[2] Alert(21) with length: 7
*** Handshake has failed
GnuTLS error: Error in the certificate.
|<6>| BUF[HSK]: Cleared Data from buffer
|<4>| REC[0x1e154b0]: Epoch #0 freed
|<4>| REC[0x1e154b0]: Epoch #1 freed

openssl s_client run: http://pastebin.ubuntu.com/13003972/

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

machine-0.log from the stateserver

Revision history for this message
Cheryl Jennings (cherylj) wrote :

From Menno:

I've had a good look around but due to the machine-0 log level being at the default of ERROR it's hard to know what happened.

One unusual thing is the various EOF errors that caused a bunch of workers to die at 19:55:36. This was caused by the MongoDB node briefly dropping from PRIMARY to SECONDARY and back again. MongoDB drops all its connections when the replicaset status changes. I'm not sure exactly why or how a node can become SECONDARY when there's only one host in the replicaset but I've seen MongoDB do this before.

Juju's workers are designed to able to cope with MongoDB dropping connections (it happens whenever the MonogDB master changes) but it's possible that something didn't recover properly when this happened. This is further evidenced by the fact that ca-cert.pem was being regularly updated. The rsyslog worker in machine-0 writes out ca-cert.pem when it starts up indicating that it was continually failing and restarting. What's curious is that there was no evidence of this in machine-0.log. There should have been at least one ERROR line for each failed start attempt.

What's also weird is that the MongoDB blip didn't seem to upset as many workers as I would have expected. It only seems to have affected workers that tried to do something with the DB at the time of the blip. Specifically I would have expected to see the "state" worker restart.

I restarted jujud on machine-0 and everything recovered. The Juju agents that weren't able to talk to rsyslog were now able to, presumably because the rsyslog worker on machine-0 was no longer broken and was able to correctly configure rsyslogd.

So my best theory is that the MongoDB replicaset blip, somehow broke the rsyslog worker (and possibly others too) in such a way that they couldn't recover. Without more detailed logs that's just a guess however.

It might be worth seeing if we can trigger this situation in a dev environment by manually forcing MongoDB to flip to SECONDARY mode at just the right time. There are MongoDB commands for manipulating the replicaset like this.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

The contents of /var/log/juju and /etc/juju for all nodes in andreas' and dpb's environment.

tags: added: bug-squad
Changed in juju-core:
status: Incomplete → Triaged
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.26-alpha1 → 1.26-alpha2
no longer affects: juju-core/1.24
Revision history for this message
Nate Finch (natefinch) wrote :

has this been reproduced on 1.25 or later?

Changed in juju-core:
milestone: 1.26-alpha2 → 1.26-beta1
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

I've been digging further into this. I now understand what causes the mongodb "blip". When the replicaset config is changed in any significant way, the MongoDB instance temporarily drops from PRIMARY to SECONDARY, even when there's just one node.

I could reproduce the blip by introducing a machine agent worker like this:

    a.startWorkerAfterUpgrade(runner, "breaker", func() (worker.Worker, error) {
     return worker.NewSimpleWorker(func(<-chan struct{}) error {
      logger.Infof("breaker: getting current replicaset config")
      config, err := replicaset.CurrentConfig(st.MongoSession())
      if err != nil {
       return err
      }
      n := len(config.Members)
      if n != 1 {
       return errors.Errorf("breaker: expected 1 member, saw %d", n)
      }
      m := config.Members[0]
      v := float64(2)
      m.Priority = &v
      config.Members[0] = m
      logger.Infof("breaker: setting new replicaset config")

      <-rsyslog.BreakCh
      close(rsyslog.BreakCh)

      return replicaset.Set(st.MongoSession(), config.Members)
     }), nil
    })

I've been investigating all sorts of possibilities in terms of if and how the blip might have contributed to the rsyslog issue. I feel like I might be close but I'm not there yet.

Changed in juju-core:
milestone: 1.26-beta1 → 2.0-alpha2
Curtis Hovey (sinzui)
Changed in juju-core:
importance: Critical → High
Revision history for this message
Cheryl Jennings (cherylj) wrote :

As of 2.0-alpha1, juju logs to the database by default, and in 2.0-alpha2, the rsyslog worker is removed entirely. Marking as invalid for master.

This problem will still exist on 1.25.

Changed in juju-core:
milestone: 2.0-alpha2 → 2.0-alpha3
status: Triaged → Invalid
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-alpha3 → none
tags: removed: sts
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.