Add a cronjob deployed by the charm to cleanup metrics for deleted resources

Bug #2013187 reported by Diko Parvanov
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Gnocchi Charm
Triaged
Wishlist
Unassigned

Bug Description

We are in the situation where the metric db has more than 40 million records, a lot of those for deleted resources. This [0] states that nothing will be done to automatically cleanup those, so we need to use the scripts from [1] to do so. Running the instance related cleanup script takes about 3 seconds per record to compare and mark as delete to metrics, so this is about 4 years of sql execution for the current situation.

We need a cronjob to be implemented in the charm and deployed by the charm (also config option to enable it on specific interval or disable it) that runs the queries from [1] on regular basis to not allow this situation to recur.

[0] https://github.com/gnocchixyz/gnocchi/issues/1050
[1] https://gist.github.com/sombrafam/a3259f2f8855194bdb4ffae743ddc0be

Tags: ps5
Diko Parvanov (dparv)
tags: added: ps5
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

I agree that something along these lines is useful, but I was curious about this bit:

> Running the instance related cleanup script takes about 3 seconds per record to compare and mark as delete to metrics, so this is about 4 years of sql execution for the current situation.

That seems a little slow. Looking at the sql, I'm wondering if joins might perform better that sub-queries (as the mysql optimiser is a bit more mature for joins)? However, it's been a while since I brushed up on my sql!

Changed in charm-gnocchi:
status: New → Triaged
importance: Undecided → Wishlist
Revision history for this message
Paul Goins (vultaire) wrote :

I've been looking at this a bit.

In one particular environment where we're hitting this, I see we're doing a substring match on a table containing 5,714 records, checking against 196,834 different deleted instance UUIDs, for a worst case of 1,124,709,476 substring comparisons. That is insane.

I'm hoping to spend some time on my side experimenting (since I happen to like these types of SQL issues), but there's a few things we can try in the queries from the gist:

* We can reduce the load by running in a loop against subsets of the deleted instance UUIDs, e,g, 1,000 at a time. That brings the load of a single query down by orders of magnitude, and seems perhaps the lowest risk path to making the scripts feasible to run regularly for cleanup purposes.

* We can consider dropping particularly old instance records from the nova instances table. If we create and destroy instances all the time, this table will grow without bound and make the problem progressively worse unless we trim old records. That being said, I'm not sure of all the references elsewhere to these stale instance IDs - the Gnocchi DB is one case where it's obvious we should leverage the deleted instance IDs to identify unneeded data, but there may be other cases... I'm not enough of an expert here to assert that this is a good path to go down or not.

Revision history for this message
Paul Goins (vultaire) wrote :

Another thought: if we don't want to touch the nova instances table, we could create a separate table for tracking instance IDs we've done such cleanup for already, and use that to further limit the number of deleted instances we try to perform this cleanup for while keeping the nova.instances table otherwise intact.

Revision history for this message
Paul Goins (vultaire) wrote :

Ended up working on this again for cleaning up things on the Prodstack clouds.

1. gnocchi.resource has a type column, and depending on the type, we can optimize how we search for substrings.

2. gnocchi.resource lacks indices on its original_resource_id column; copying select columns into a temporary table with additional indices can allow us to find records for update/deletion much more quickly.

I have a script which is currently running on one of the larger Prodstack clouds. I will see if I can get the +1 to share it here, as it does not have any Canonical-internal details, but it does use code from two other engineers who haven't given me permission to put the code external yet. Currently this code is on pace to clean up metrics for ~2.7 million deleted instances in probably less than 4 hours. I have spent time adding indices and using the "DESCRIBE" command to ensure those indices are actually being used by the commands in question, and it is unquestionably faster than the old method.

Regarding the actual cleanup method: I should call out the methodology used. It is:

* Identify gnocchi.resource records related to nova.instances, glance.images, cinder.volumes and cinder.snapshots.

* Find gnocchi.metrics records related to those resources, and set their status column to "delete" - apparently allowing Gnocchi to perform some sort of cleanup.

* Directly delete the gnocchi.resource records in question.

I've made improvements to the first part. The other 2 parts are essentially as they were written by other engineer(s), and I am not 100% sure regarding their correctness; those on the OpenStack team are likely in a better position to judge this.

Revision history for this message
Paul Goins (vultaire) wrote :

I've received permission to share this.

Original methodology from Erlon Cruz: https://github.com/gnocchixyz/gnocchi/issues/1050#issuecomment-542753802

Credit to Nikita Koltsov for the initial rework to use temporary tables and stored procedures to speed up some of the code.

I built upon Nikita's code and added additional cleanup/reworks/optimizations to bring the code to where it is now.

The code is not perfect, and there's some low-hanging optimization regarding volume/image metric cleanup which I've hinted at in one of the comments in the code, but which I simply have not gotten around to. Nonetheless, submitting as-is in case it can help give a good starting point for something baked into the charm..

Revision history for this message
Paul Goins (vultaire) wrote :

Also, copying from my team's internal wiki: this is how we set this up to run daily in MySQL:

/* Note: this causes the job to run daily at 5am UTC; if you want a different time, replace the "5" below with a different hour. */
DELIMITER //
CREATE EVENT gnocchi.gnocchi_daily_cleanup
 ON SCHEDULE
   EVERY 1 DAY
   STARTS (CURRENT_DATE() + INTERVAL '5' HOUR)
 DO
 CALL gnocchi.gnocchi_full_cleanup();
//
DELIMITER ;

Revision history for this message
Paul Goins (vultaire) wrote :

Code moved to a public repository, albeit under my personal Launchpad account for now: https://code.launchpad.net/~vultaire/+git/gnocchi_cleaner

Revision history for this message
Felipe Reyes (freyes) wrote : Re: [Bug 2013187] Re: Add a cronjob deployed by the charm to cleanup metrics for deleted resources

On Tue, 2023-10-24 at 15:30 +0000, Paul Goins wrote:
> Code moved to a public repository, albeit under my personal Launchpad
> account for now:
> https://code.launchpad.net/~vultaire/+git/gnocchi_cleaner
>

Hi Paul, I was taking a look the sql file[0], and it looks amazing, it's really great this was
implemented as a stored procedure instead of python (avoiding the back and forth of data), do you
(or your team) has plans of upstream this into the gnocchi charm?

[0] https://git.launchpad.net/~vultaire/+git/gnocchi_cleaner/tree/gnocchi.sql

Revision history for this message
Paul Goins (vultaire) wrote :

Hi Felipe,

I wasn't planning to submit an MR myself; there's always the chance of that but my bandwidth is limited for this type of work during normal work hours.

I would be thrilled if someone took this code and filed an MR against the gnocchi charm, or if the devs of that charm decided to pull that code to address this bug.

To be clear, it may still need some tuning. It may be necessary to reduce the chunk sizes further to minimize locks on the metrics/resources tables which may impair the Gnocchi APIs while the cleanup is run. Granted, for more common environments it may not be an issue, but in some of our internal clouds we are seeing some alerts which are suspected to be connected to this cleanup.

Revision history for this message
Paul Goins (vultaire) wrote :

Note: I've merged updates to the procedure today. It's been significantly rewritten and no longer depends on adding a tracking table; it can operate purely based upon the existing database tables.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.