PostgreSQL Charm

wal-e wal-push monitoring

Bug #1889697 reported by Haw Loeung on 2020-07-30

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	PostgreSQL Charm	Triaged	High	Unassigned

Bug Description

Hi,

We've had WAL-e wal-push stuck for weeks and was only noticed by a monitoring check for disk predicted full in X days.

postgres 20209 0.0 0.0 142312 4248 ? Ss Apr15 25:35 \_ postgres: archiver process archiving 000000030000B78400000050
postgres 29424 0.0 0.0 4456 676 ? S Jul16 0:00 | \_ sh -c /var/lib/postgresql/venv/bin/envdir /etc/postgresql/10/main/wal-e.env /var/lib/postgresql/venv/bin/wal-e wal-push pg_wal/000000030000B78400000050
postgres 29425 0.0 0.0 132964 4200 ? S Jul16 0:00 | \_ /var/lib/postgresql/venv/bin/python3 /var/lib/postgresql/venv/bin/wal-e wal-push pg_wal/000000030000B78400000050
postgres 29431 0.0 0.0 0 0 ? Z Jul16 0:00 | \_ [lzop] <defunct>
p

I think the wal-e process and postgresql-charm should do two things:

1. update a "last updated" file somewhere (touch file?) and ship out monitoring for when this wasn't recently updated.

2. wrap wal-push around some 'timeout' so processes are killed off if it takes too long when the next run recovering where it left of.

Revision history for this message

Stuart Bishop (stub) wrote on 2020-07-31:

The number of .ready files in $PGDATA/pg_wal would be a good thing to monitor, and the age of the oldest .ready file. This will catch all stuck WAL archiving (not just WAL-E). Maybe in the telegraf subordinate, or maybe just have the PostgreSQL charm bring up its own Prometheus scrape target.

Changed in postgresql-charm:
status:	New → Triaged
importance:	Undecided → High

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.