Grafana Charm

Changing the prometheus web-listen-port leaves the charm in a permanent error state

Bug #1893320 reported by Trent Lloyd on 2020-08-28

This bug affects 5 people

Affects		Status	Importance	Assigned to	Milestone
	Grafana Charm	Fix Released	Undecided	Brett Milford	Grafana Charm 21.07

Bug Description

* Problem Description *

Changing the related prometheus charm's web-listen-port gets the grafana charm stuck in an error state in two different ways

(1) If an update-status hook is scheduled after the port has changed, but before the grafana-source-relation-changed hook has run to update the URL, the update-status hook gets stuck in error - it tries to query the old URL and the failure to connect exception bubbles up to a hook error

File "charm/reactive/grafana.py", line 580, in configure_sources
generate_prometheus_dashboards(gf_adminpasswd, ds)
File "charm/reactive/grafana.py", line 853, in generate_prometheus_dashboards
response = requests.get("{}/api/v1/label/__name__/values".format(ds["url"]))
requests.exceptions.ConnectionError: HTTPConnectionPool(host='10.5.1.43', port=9090): Max retries exceeded with url: /api/v1/label/__name__/values (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb1be6beda0>: Failed to establish a new connection: [Errno 111] Connection refused',))

(2) Because juju hooks are ordered, this update-status failure prevents the grafana-source-relation-changed hook from running because it attempts to keep re-executing the update-status hook with the same context and old view of the configuration it had before it fialed. We can bypass this with "juju resolved --no-retry grafana/0" which should allow it to progress to the changed hook.

(3) Once the grafana-source-relation-changed hook runs (regardless of whether you got stuck in an update-status hook or not), we get a different error:

File "/var/lib/juju/agents/unit-grafana-0/charm/reactive/grafana.py", line 578, in configure_sources
check_datasource(ds)
File "/var/lib/juju/agents/unit-grafana-0/charm/reactive/grafana.py", line 687, in check_datasource
cur.execute(stmt, values)
sqlite3.IntegrityError: UNIQUE constraint failed: data_source.org_id, data_source.name

This happens because the relevant code is keying off of the URL to update entries in the database:
"if row[1] == ds["type"] and row[3] == ds["url"]:" from https://git.launchpad.net/charm-grafana/tree/src/reactive/grafana.py?h=stable/20.08#n677

Since that check fails, it attempts to add a new data source however the grafana sqlite database has a UNIQUE constraint on (org_id, name) so this also fails.

* Reproducer *

juju deploy cs:grafana --config port=3000 --config install_method=snap
juju deploy cs:prometheus2 prometheus
juju add-relation prometheus:grafana-source grafana:grafana-source
juju add-relation telegraf:prometheus-client prometheus:target

juju config prometheus web-listen-port=80

* Suggested Solution *

This code was previously modified NOT to check name, to allow users to change the name in the Grafana configuraton editor to a friendly name they prefer:
https://git.launchpad.net/charm-grafana/commit?h=stable/20.08&id=7540bfadb1cd717ad1c3b44872aa142e97e8308a

So to fix this we will need to either revert that ability, or, find some other way to key the change, possible ideas:
- storing some kind of tag/metadata
- using the data source description (currently set to "name - Juju generated source")
- Using the charmhelpers that store the old configuration data to check the Old URL

* Workaround *
(1) Resolve the broken update-status hook that is failing with the Failed to establish a new connection: [Errno 111] Connection refused

juju resolved grafana/0 --no-retry

(2) Watch "juju debug-log grafana/0" and wait for grafana-source-relation-changed to fail with the error "sqlite3.IntegrityError: UNIQUE constraint failed: data_source.org_id, data_source.name" instead

If you get more hook failures with the "Connection refused" error, re-run the resolved command and wait again and hopefully you will get to the UNIQUE constraint error.

(3) Manually update the grafana source list to use the new URL

You can attempt to do this through the Grafana UI. Settings Menu -> Data Sources -> Click the relevant entry.

However for some reason if you setup a very simple reproduction environment this page throws an error in the Grafana UI "TypeError: Cannot read property 'timeInterval' of undefined". I assume because the reproduction environment only has prometheus2/grafana and no data source like telegraf that triggers this. In such a case, we can update the sqlite configuration file manually:

juju ssh grafana/0 sudo -i
apt-get install sqlite3
sqlite3 /var/snap/grafana/common/data/grafana.db
SELECT * FROM data_source;
UPDATE data_source SET url='http://IP_HOST:PORT' WHERE url='http://IP_HOST:OLD_PORT';
.quit

(4) Mark the error resolved WITHOUT specifying --no-retry, so that the hook retries and should succeed.

juju resolved grafana/0

Tags:

Related branches

~brettmilford/charm-grafana:lp-1893320

Merged into charm-grafana:master at revision 247b88d3f2a0f1735f5091f560c4c61843778daa

Celia Wang: Approve on 2021-07-12

Chris Johnston (community): Approve on 2021-06-18

Paul Goins: Needs Fixing on 2021-05-28

~brettmilford/charm-grafana:lp-1893320

Merged into charm-grafana:master at revision 7929c5b8ef30a3f4e61b5aa55d6600dca62110a0

Drew Freiberger (community): Approve on 2021-01-28

Paul Goins: Approve on 2021-01-18

Chris Johnston (cjohnston) on 2020-08-28

Changed in charm-grafana:
status:	New → Confirmed
tags:	added: sts

Revision history for this message

Brett Milford (brettmilford) wrote on 2020-11-03:

So its interesting to note, the code doesn't traverse the UPDATE path because we're comparing URL's including the port number which by this point has changed.

If its possible to capture and compare the previous URL to be sure we're updating the same entry this would be ideal.

Otherwise I think its sufficient it compare the rest of the URL except the port.

Revision history for this message

Brett Milford (brettmilford) wrote on 2020-11-03:

Another option might be to separate out joined/change hooks to trigger different flags for insert vs update.

https://git.launchpad.net/interface-grafana-source/tree/requires.py#n9

Brett Milford (brettmilford) on 2020-11-05

Changed in charm-grafana:
assignee:	nobody → Brett Milford (brettmilford)

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-11-10:

Given that datasource names can't be changed with the new dashboard relations still having hard-coded 'prometheus - juju configured datasource' as the datasource name, I'm okay with the revert of the prior commit, since data source names can't actually be changed successfully any longer.

Celia Wang (ziyiwang) on 2021-02-03

Changed in charm-grafana:
milestone:	none → 21.01
status:	Confirmed → Fix Committed
status:	Fix Committed → Fix Released

Revision history for this message

Chris Johnston (cjohnston) wrote on 2021-02-25:

Download full text (5.4 KiB)

After this fix I'm now running into a scenario where grafana goes into an indefinite blocked state.

I've deployed prometheus-21 and grafana-39 with my Kubernetes deployment. After things settle I change the web-listen-port to 80 and see:

grafana/0* blocked idle 4 10.5.1.77 3000/tcp Exception reaching prometheus API whilst updating dashboards

I took a look at the logs for grafana and I see that it looks like it's getting the updated port:

After this fix I'm now running into a scenario where grafana goes into an indefinite blocked state.

I've deployed prometheus-21 and grafana-39 with my Kubernetes deployment. After things settle I change the web-listen-port to 80 and see:

grafana/0*                blocked   idle   4        10.5.1.77       3000/tcp          Exception reaching prometheus API whilst updating dashboards

I took a look at the logs for grafana and I see that it looks like it's getting the updated port:

2021-02-25 00:59:06 INFO juju-log Invoking reactive handler: reactive/grafana.py:578:wipe_nrpe_checks
2021-02-25 00:59:06 INFO juju-log Invoking reactive handler: reactive/grafana.py:596:configure_sources
2021-02-25 00:59:06 INFO juju-log Found datasource: {'service_name': 'prometheus', 'type': 'prometheus', 'url': 'http://10.5.2.227:80', 'description': 'Juju generated source'}
2021-02-25 00:59:06 INFO juju-log Datasource already exist, updating: prometheus - Juju generated source
2021-02-25 00:59:06 INFO juju-log Checking Dashboard Template: CephCluster.json.j2
2021-02-25 00:59:06 DEBUG juju-log Skipping Dashboard Template: CephCluster.json.j2 missing 31 metrics.Missing: ceph_client_io_read_ops, ceph_osds, ceph_osds_down, ceph_osd_perf_apply_latency_seconds, ceph_cluster_used_bytes, ceph_cluster_capacity_bytes, ceph_osd_perf_commit_latency_seconds, ceph_misplaced_objects, ceph_monitor_quorum_count, ceph_stale_pgs, ceph_undersized_pgs, ceph_degraded_pgs, ceph_osd_up, ceph_stuck_stale_pgs, ceph_client_io_write_bytes, ceph_degraded_objects, ceph_pool_available_bytes, ceph_unclean_pgs, ceph_client_io_write_ops, ceph_health_status, ceph_recovery_io_bytes, ceph_osds_in, ceph_recovery_io_keys, ceph_recovery_io_objects, ceph_cluster_available_bytes, ceph_cluster_objects, ceph_stuck_unclean_pgs, ceph_stuck_degraded_pgs, ceph_osd_pgs, ceph_client_io_read_bytes, ceph_stuck_undersized_pgs
2021-02-25 00:59:06 INFO juju-log Checking Dashboard Template: Swift.json.j2
2021-02-25 00:59:06 DEBUG juju-log Skipping Dashboard Template: Swift.json.j2 missing 13 metrics.Missing: exec_swiftparts_object_handoff, exec_swiftparts_account_handoff, exec_swiftparts_object_primary, object_server_async_pendings, swift_disk_usage_bytes, exec_swiftparts_container_misplaced, swift_replication_stats, swift_replication_duration_seconds, exec_swiftparts_account_misplaced, exec_swiftparts_account_primary, exec_swiftparts_container_primary, exec_swiftparts_container_handoff, exec_swiftparts_object_misplaced
2021-02-25 00:59:06 INFO juju-log Checking Dashboard Template: OpenStackCloud.json.j2
2021-02-25 00:59:06 DEBUG juju-log Skipping Dashboard Template: OpenStackCloud.json.j2 missing 16 metrics.Missing: nova_resources_ram_mbs, hypervisor_disk_gbs_total, nova_resources_disk_gbs, hypervisor_vcpus_used, hypervisor_disk_gbs_used, hypervisor_memory_mbs_used, neutron_net_size, nova_resources_vcpus, nova_instances, hypervisor_memory_mbs_total, hypervisor_running_vms, openstack_allocation_ratio, openstack_exporter_cache_age_seconds, hypervisor_vcpus_total, openstack_exporter_cache_refresh_duration_seconds, hypervisor_schedulable_instances
2021-02-25 00:59:06 INFO juju-log Checking Dashboard Template: CephOSD.json.j2
2021-02-25 00:59:06 DEBUG juju-log Skipping Dashboard Template: CephOSD.json.j2 missing 10 metrics.Missing: ceph_osd_used_bytes, ceph_osd_in, ceph_osds, ceph_osd_perf_apply_latency_seconds, ceph_osd_perf_commit_latency_seconds, ceph_osd_avail_bytes, ceph_osd_variance, ceph_osd_up, ceph_osd_pgs, ceph_osd_utilization
2021-02-25 00:59:06 INFO juju-log Checking Dashboard Template: RabbitMQ.json.j2
2021-02-25 00:59:06 DEBUG juju-log Skipping Dashboard Template: RabbitMQ.json.j2 missing 18 metrics.Missing: rabbitmq_node_fd_total, rabbitmq_overview_messages_acked, rabbitmq_overview_exchanges, rabbitmq_overview_channels, rabbitmq_node_sockets_used, rabbitmq_overview_messages_ready, rabbitmq_overview_messages_published, rabbitmq_node_fd_used, rabbitmq_node_sockets_total, rabbitmq_overview_consumers, rabbitmq_overview_messages_unacked, rabbitmq_overview_connections, rabbitmq_node_mem_limit, rabbitmq_node_mem_used, rabbitmq_node_proc_total, rabbitmq_node_proc_used, rabbitmq_overview_queues, rabbitmq_overview_messages_delivered
2021-02-25 00:59:06 INFO juju-log Checking Dashboard Template: CephPools.json.j2
2021-02-25 00:59:06 DEBUG juju-log Skipping Dashboard Template: CephPools.json.j2 missing 9 metrics.Missing: ceph_pool_raw_used_bytes, ceph_pool_read_total, ceph_pool_read_bytes_total, ceph_pool_used_bytes, ceph_pool_objects_total, ceph_pool_available_bytes, ceph_pool_dirty_objects_total, ceph_pool_write_total, ceph_pool_write_bytes_total
2021-02-25 00:59:06 INFO juju-log Invoking reactive handler: reactive/grafana.py:1211:import_dashboards
2021-02-25 00:59:06 INFO juju-log import_dashboards: telegraf, digest None, is_new: False
2021-02-25 00:59:06 INFO juju-log import_dashboards: kubernetes, digest None, is_new: False
2021-02-25 00:59:06 INFO juju-log import_dashboards: prometheus, digest None, is_new: False
2021-02-25 00:59:06 INFO juju-log Invoking reactive handler: hooks/relations/http/provides.py:15:broken:website
2021-02-25 00:59:06 DEBUG update-status UPDATE DATA_SOURCE SET basic_auth_user = ?, basic_auth_password = ?, basic_auth = 0 ('', '')

But when looking at the db, something is preventing it from being properly updated:

sqlite> SELECT * FROM data_source;
1|1|0|prometheus|prometheus - Juju generated source|proxy|http://10.5.2.227:9090||||0|||0|{}|2021-02-25 00:06:19|2021-02-25 00:06:19|0|{}|0|2459839366
sqlite> .quit

Revision history for this message

Brett Milford (brettmilford) wrote on 2021-05-12:

@cjohnston I've added a commit to address this issue. Can you please help test it in your environment?

Revision history for this message

Joe Guo (guoqiao) wrote on 2021-07-13:

I also hit this issue, with prometheus2 rev 22:

juju status
...
prometheus active 1 prometheus2 jujucharms 22 ubuntu
...
grafana/0* blocked idle 0/lxd/2 10.98.160.214 3000/tcp Exception reaching prometheus API whilst updating dashboards

Revision history for this message

Joe Guo (guoqiao) wrote on 2021-07-13:

Ok, I can confirm the issue is fixed in latest master, with the patch from @brettmilford. Thank you!

Celia Wang (ziyiwang) on 2021-07-28

Changed in charm-grafana:
milestone:	21.01 → 21.07

Revision history for this message

sahul (buddy001) wrote on 2021-11-12:

Hi ,
Error: Exception reaching prometheus API whilst updating dashboards

i get this error, update me how to patch in the existing environment.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1893282

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.