wal-e wal-push monitoring
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
PostgreSQL Charm |
Triaged
|
High
|
Unassigned |
Bug Description
Hi,
We've had WAL-e wal-push stuck for weeks and was only noticed by a monitoring check for disk predicted full in X days.
postgres 20209 0.0 0.0 142312 4248 ? Ss Apr15 25:35 \_ postgres: archiver process archiving 000000030000B78
postgres 29424 0.0 0.0 4456 676 ? S Jul16 0:00 | \_ sh -c /var/lib/
postgres 29425 0.0 0.0 132964 4200 ? S Jul16 0:00 | \_ /var/lib/
postgres 29431 0.0 0.0 0 0 ? Z Jul16 0:00 | \_ [lzop] <defunct>
p
I think the wal-e process and postgresql-charm should do two things:
1. update a "last updated" file somewhere (touch file?) and ship out monitoring for when this wasn't recently updated.
2. wrap wal-push around some 'timeout' so processes are killed off if it takes too long when the next run recovering where it left of.
The number of .ready files in $PGDATA/pg_wal would be a good thing to monitor, and the age of the oldest .ready file. This will catch all stuck WAL archiving (not just WAL-E). Maybe in the telegraf subordinate, or maybe just have the PostgreSQL charm bring up its own Prometheus scrape target.