telegraf restarted unnecessarily in update-status hook

Bug #1686638 reported by Adam Collard
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Telegraf Charm
Fix Released
High
Xav Paice

Bug Description

The update-status hook will (if related to certain other charms e.g. haproxy) cause the configuration file to be rewritten in a non-idempotent way, in turn triggering a restart of the telegraf service (see https://pastebin.canonical.com/186853/ for an example juju unit log)

Related branches

Revision history for this message
Paul Gear (paulgear) wrote :

In addition, the telegraf charm logs considerable noise on every update-status hook run: https://pastebin.canonical.com/199361/

Changed in telegraf-charm:
assignee: Adam Collard (adam-collard) → nobody
status: In Progress → Confirmed
Changed in charm-telegraf:
assignee: nobody → Peter Sabaini (peter-sabaini)
Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Hey, I wasn't able to reproduce the restart on update-status. I realize this is quite an old issue, but would you have any more details around this issue? Ideally a bundle and full logs, or maybe even an sosreport. Thanks!

Changed in charm-telegraf:
status: Confirmed → Incomplete
Revision history for this message
Drew Freiberger (afreiberger) wrote :

@peter-sabaini

I've actually found that the update_status hook in the lastest master will continually have data_changed = True, which is then clearing the flags causing full reconfigure and restart.

It appears to me that the data returned from get_base_inputs() is not an ordered dictionary, so it's sha256 changes each time the hook runs. This needs to be put through some sort filter to ensure the keys are being sorted throughout the returned dictionary when run through data_changed.

Here's a hint that you can find this happening in the log files of any telegraf unit:

egrep 'cleared flag telegraf.configured|update_status' unit-telegraf-13.log |tail -200

tracer: ++ queue handler reactive/telegraf.py:1388:update_status
2020-09-12 00:57:43 INFO juju-log Invoking reactive handler: reactive/telegraf.py:1388:update_status
2020-09-12 00:57:43 DEBUG juju-log tracer: cleared flag telegraf.configured
tracer: ++ queue handler reactive/telegraf.py:1388:update_status
2020-09-12 01:02:02 INFO juju-log Invoking reactive handler: reactive/telegraf.py:1388:update_status
2020-09-12 01:02:02 DEBUG juju-log tracer: cleared flag telegraf.configured
tracer: ++ queue handler reactive/telegraf.py:1388:update_status
2020-09-12 01:07:19 INFO juju-log Invoking reactive handler: reactive/telegraf.py:1388:update_status
2020-09-12 01:07:19 DEBUG juju-log tracer: cleared flag telegraf.configured
tracer: ++ queue handler reactive/telegraf.py:1388:update_status
2020-09-12 01:12:49 INFO juju-log Invoking reactive handler: reactive/telegraf.py:1388:update_status
2020-09-12 01:12:49 DEBUG juju-log tracer: cleared flag telegraf.configured
tracer: ++ queue handler reactive/telegraf.py:1388:update_status
2020-09-12 01:17:29 INFO juju-log Invoking reactive handler: reactive/telegraf.py:1388:update_status
2020-09-12 01:17:29 DEBUG juju-log tracer: cleared flag telegraf.configured
tracer: ++ queue handler reactive/telegraf.py:1388:update_status
2020-09-12 01:22:34 INFO juju-log Invoking reactive handler: reactive/telegraf.py:1388:update_status
2020-09-12 01:22:34 DEBUG juju-log tracer: cleared flag telegraf.configured
tracer: ++ queue handler reactive/telegraf.py:1388:update_status
2020-09-12 01:26:48 INFO juju-log Invoking reactive handler: reactive/telegraf.py:1388:update_status
...

Specifically cs:telegraf-38 and master branch at commit ee1a92f9c0787413dea69a9a8c0d08a111a2d4be

I'm seeing debug logs for telegraf units hitting 124MB in about 1 month due to this.

Changed in charm-telegraf:
status: Incomplete → Confirmed
Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

Can confirm, this is still actual:

ubuntu@prometheus-3:~$ grep -RiP 'Hang' /var/log/telegraf
/var/log/telegraf/telegraf.log:2020-09-22T19:34:11Z I! [agent] Hang on, flushing any cached metrics before shutdown
/var/log/telegraf/telegraf.log:2020-09-22T19:38:51Z I! [agent] Hang on, flushing any cached metrics before shutdown
/var/log/telegraf/telegraf.log:2020-09-22T19:44:41Z I! [agent] Hang on, flushing any cached metrics before shutdown
/var/log/telegraf/telegraf.log:2020-09-22T19:49:18Z I! [agent] Hang on, flushing any cached metrics before shutdown
/var/log/telegraf/telegraf.log:2020-09-22T19:54:40Z I! [agent] Hang on, flushing any cached metrics before shutdown
<250 more log entries are omitted>

<extract from telegraf.log>
2020-09-23T15:38:34Z I! Reloading Telegraf config
2020-09-23T15:38:34Z I! [agent] Hang on, flushing any cached metrics before shutdown
2020-09-23T15:38:34Z I! Starting Telegraf 1.13.3

Revision history for this message
Nikolay Vinogradov (nikolay.vinogradov) wrote :

Seeing this bug in current bionic/ussuri deployment, still actual:

ubuntu@landscapeha-1:~$ service telegraf status
● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB
   Loaded: loaded (/lib/systemd/system/telegraf.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2020-09-22 16:54:03 UTC; 22h ago
     Docs: https://github.com/influxdata/telegraf
 Main PID: 1882 (telegraf)
    Tasks: 17 (limit: 3505)
   CGroup: /system.slice/telegraf.service
           └─1882 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d

Sep 23 15:09:45 landscapeha-1 systemd[1]: Reloading The plugin-driven server agent for reporting metrics into InfluxDB.
Sep 23 15:09:45 landscapeha-1 systemd[1]: Reloaded The plugin-driven server agent for reporting metrics into InfluxDB.
Sep 23 15:14:16 landscapeha-1 systemd[1]: Reloading The plugin-driven server agent for reporting metrics into InfluxDB.
Sep 23 15:14:16 landscapeha-1 systemd[1]: Reloaded The plugin-driven server agent for reporting metrics into InfluxDB.
Sep 23 15:18:24 landscapeha-1 systemd[1]: Reloading The plugin-driven server agent for reporting metrics into InfluxDB.
Sep 23 15:18:24 landscapeha-1 systemd[1]: Reloaded The plugin-driven server agent for reporting metrics into InfluxDB.
Sep 23 15:23:00 landscapeha-1 systemd[1]: Reloading The plugin-driven server agent for reporting metrics into InfluxDB.
Sep 23 15:23:00 landscapeha-1 systemd[1]: Reloaded The plugin-driven server agent for reporting metrics into InfluxDB.
Sep 23 15:27:13 landscapeha-1 systemd[1]: Reloading The plugin-driven server agent for reporting metrics into InfluxDB.
Sep 23 15:27:13 landscapeha-1 systemd[1]: Reloaded The plugin-driven server agent for reporting metrics into InfluxDB.

ubuntu@landscapeha-1:~$ grep 'Hang on, flushing any cached metrics before shutdown' /var/log/telegraf/telegraf.log | tail
2020-09-23T14:44:42Z I! [agent] Hang on, flushing any cached metrics before shutdown
2020-09-23T14:49:27Z I! [agent] Hang on, flushing any cached metrics before shutdown
2020-09-23T14:53:56Z I! [agent] Hang on, flushing any cached metrics before shutdown
2020-09-23T14:59:13Z I! [agent] Hang on, flushing any cached metrics before shutdown
2020-09-23T15:04:12Z I! [agent] Hang on, flushing any cached metrics before shutdown
2020-09-23T15:09:45Z I! [agent] Hang on, flushing any cached metrics before shutdown
2020-09-23T15:14:16Z I! [agent] Hang on, flushing any cached metrics before shutdown
2020-09-23T15:18:24Z I! [agent] Hang on, flushing any cached metrics before shutdown
2020-09-23T15:23:01Z I! [agent] Hang on, flushing any cached metrics before shutdown
2020-09-23T15:27:13Z I! [agent] Hang on, flushing any cached metrics before shutdown

Changed in charm-telegraf:
assignee: Peter Sabaini (peter-sabaini) → nobody
Revision history for this message
Xav Paice (xavpaice) wrote :

I'm currently unable to reproduce this, could you share charm versions and Juju agent versions as well as any more details on the bundle?

Changed in charm-telegraf:
status: Confirmed → Incomplete
Revision history for this message
Xav Paice (xavpaice) wrote :
Download full text (4.7 KiB)

Telegraf log:

2021-04-08T06:10:20Z I! Reloading Telegraf config
2021-04-08T06:10:20Z I! [agent] Hang on, flushing any cached metrics before shutdown
2021-04-08T06:10:20Z I! Starting Telegraf 1.18.0
2021-04-08T06:10:20Z I! Loaded inputs: bond cgroup conntrack cpu disk diskio exec (7x) internal kernel_vmstat mem mysql net netstat nstat processes socket_listener swap system
2021-04-08T06:10:20Z I! Loaded aggregators:
2021-04-08T06:10:20Z I! Loaded processors:
2021-04-08T06:10:20Z I! Loaded outputs: prometheus_client
2021-04-08T06:10:20Z I! Tags enabled: host=zaza-da504c10178b:percona-cluster-0 juju_application=percona-cluster juju_model=zaza-da504c10178b juju_unit=percona-cluster-0
2021-04-08T06:10:20Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"zaza-da504c10178b:percona-cluster-0", Flush Interval:10s
2021-04-08T06:10:20Z W! [outputs.prometheus_client] Use of deprecated configuration: metric_version = 1; please update to metric_version = 2
2021-04-08T06:10:20Z E! [agent] Failed to connect to [outputs.prometheus_client], retrying in 15s, error was 'listen tcp :9103: bind: address already in use'
2021-04-08T06:10:35Z E! [telegraf] Error running agent: connecting output outputs.prometheus_client: Error connecting to output "outputs.prometheus_client": listen tcp :9103: bind: address already in use
2021-04-08T06:10:35Z I! Loaded inputs: bond cgroup conntrack cpu disk diskio exec (7x) internal kernel_vmstat mem mysql net netstat nstat processes socket_listener swap system
2021-04-08T06:10:35Z I! Loaded aggregators:
2021-04-08T06:10:35Z I! Loaded processors:
2021-04-08T06:10:35Z I! Loaded outputs: prometheus_client
2021-04-08T06:10:35Z I! Tags enabled: host=zaza-da504c10178b:percona-cluster-0 juju_application=percona-cluster juju_model=zaza-da504c10178b juju_unit=percona-cluster-0
2021-04-08T06:10:35Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"zaza-da504c10178b:percona-cluster-0", Flush Interval:10s
2021-04-08T06:10:35Z W! [outputs.prometheus_client] Use of deprecated configuration: metric_version = 1; please update to metric_version = 2
2021-04-08T06:10:35Z I! [outputs.prometheus_client] Listening on http://[::]:9103/metrics
2021-04-08T06:10:35Z I! [inputs.socket_listener] Listening on tcp://127.0.0.1:8094

Juju unit log:

2021-04-08 06:10:17 INFO juju-log Reactive main running for hook update-status
2021-04-08 06:10:19 INFO juju-log Initializing Snap Layer
2021-04-08 06:10:19 INFO juju-log Initializing Apt Layer
2021-04-08 06:10:19 INFO juju-log Initializing Leadership Layer (is leader)
2021-04-08 06:10:19 INFO juju-log Invoking reactive handler: reactive/telegraf.py:1653:update_status
2021-04-08 06:10:19 INFO juju-log Invoking reactive handler: reactive/apt.py:50:ensure_package_status
2021-04-08 06:10:19 INFO juju-log Invoking reactive handler: reactive/telegraf.py:754:configure_telegraf_deb
2021-04-08 06:10:19 INFO juju-log No output plugins in main config.
2021-04-08 06:10:20 INFO juju-log disabling the following metrics, since OVS is not available: ['ovs_dump_flows', 'ovs_dpctl']
2021-04-08 06:10:20 INFO juju-log Rendering exec metrics config files: /var/lib/juju/agents/unit-telegraf-0/charm/files/telegraf_exec_met...

Read more...

Changed in charm-telegraf:
status: Incomplete → Confirmed
Revision history for this message
Xav Paice (xavpaice) wrote :

Fixed for LXD, need to consider if get_base_inputs() also needs the same change.

Eric Chen (eric-chen)
Changed in charm-telegraf:
status: Confirmed → Fix Committed
milestone: none → 22.08
Changed in charm-telegraf:
status: Fix Committed → Fix Released
Changed in charm-telegraf:
assignee: nobody → Xav Paice (xavpaice)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.