Add note about system load to some error messages

Bug #1980115 reported by Leon
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Wishlist
Unassigned

Bug Description

I'm running a load test for our observability stack, and every two hours the system is under high load because prometheus is flushing data to disk every two hours.

As a result, Juju has the following log entries:

controller-0: 05:01:34 ERROR juju.worker.caasapplicationprovisioner.runner exited "alertmanager": Get "https://10.152.183.1:443/apis/apps/v1/namespaces/cos-lite-load-test/statefulsets/alertmanager-operator": net/http: TLS handshake timeout
controller-0: 05:01:34 ERROR juju.worker.caasapplicationprovisioner.runner exited "prometheus": Get "https://10.152.183.1:443/apis/apps/v1/namespaces/cos-lite-load-test/statefulsets/prometheus-operator": net/http: TLS handshake timeout
controller-0: 05:01:34 ERROR juju.worker.caasapplicationprovisioner.runner exited "grafana": Get "https://10.152.183.1:443/apis/apps/v1/namespaces/cos-lite-load-test/statefulsets/grafana-operator": net/http: TLS handshake timeout
controller-0: 05:01:34 ERROR juju.worker.caasapplicationprovisioner.runner exited "loki": Get "https://10.152.183.1:443/apis/apps/v1/namespaces/cos-lite-load-test/statefulsets/loki-operator": net/http: TLS handshake timeout

Followed by:

controller-0: 05:02:37 ERROR juju.worker.caasapplicationprovisioner.runner exited "alertmanager": Get "https://10.152.183.1:443/apis/apps/v1/namespaces/cos-lite-load-test/statefulsets/alertmanager-operator": dial tcp 10.152.183.1:443: connect: connection refused
controller-0: 05:02:37 ERROR juju.worker.caasapplicationprovisioner.runner exited "prometheus": Get "https://10.152.183.1:443/apis/apps/v1/namespaces/cos-lite-load-test/statefulsets/prometheus-operator": dial tcp 10.152.183.1:443: connect: connection refused
controller-0: 05:02:37 ERROR juju.worker.caasapplicationprovisioner.runner exited "loki": Get "https://10.152.183.1:443/apis/apps/v1/namespaces/cos-lite-load-test/statefulsets/loki-operator": dial tcp 10.152.183.1:443: connect: connection refused
controller-0: 05:02:37 ERROR juju.worker.caasapplicationprovisioner.runner exited "grafana": Get "https://10.152.183.1:443/apis/apps/v1/namespaces/cos-lite-load-test/statefulsets/grafana-operator": dial tcp 10.152.183.1:443: connect: connection refused

This happens exactly every two hours and seems to be the result of temporary high system load.
It could be handy if Juju included a note, e.g.: "Note: this could be because system load is such-and-such".

Revision history for this message
Ian Booth (wallyworld) wrote :

As the controller charm gains the capability to integrate with our observability stack, this sort of info is probably best surfaced as part of that work.

tags: added: observability
Changed in juju:
importance: Undecided → Wishlist
Changed in juju:
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.