Asynchronous RPM validation task in ansible-hardening produces RPM database corruption

Bug #1921292 reported by Jeff Albert
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
Fix Released
Medium
Dmitriy Rabotyagov

Bug Description

The ansible-hardening role starts a background process to run and record the output of `rpm -Va` early in the role's execution, and comes back later on to review the results: https://opendev.org/openstack/ansible-hardening/src/branch/master/tasks/rhel7stig/async_tasks.yml#L18-L34

However, doing a full verify including checksums for every package-provided file on a system that's carrying any significant number of packages can take a very long time: on our fairly-generously-spec'd bare metal control nodes, the task can take up to 13 minutes to complete.

Unfortunately, the task that starts the verify command specifies a hard, non-configurable timeout of 300 seconds. When the task runs for longer than this, Ansible terminates it forcefully; this results in RPM database locks left open, and RPM database corruption - and it doesn't even cause the role run to fail, since failed_when is false, so operators won't realize the corruption has happened until several steps later when another RPM task in the hardening role fails. Manual intervention is necessary to repair the database on every affected node. During an upgrade of a large site, this issue has created a major stumbling block for us.

I would propose either extending the timeout significantly and making it configurable, or else changing the command to call rpm with the `--nofiles --nodigest` flags, which will prevent the most IO-intensive parts of the verification and keep the run-time of the task reasonable.

I'll be happy to submit a pull request for either fix strategy, depending on what the maintainers feel is appropriate.

Revision history for this message
Dmitriy Rabotyagov (noonedeadpunk) wrote :

Pushed patch https://review.opendev.org/c/openstack/ansible-hardening/+/782909 that extends limit to 1 hour, which should be totally enough for the deployments I can imagine.

Changed in openstack-ansible:
status: New → Fix Committed
importance: Undecided → Medium
assignee: nobody → Dmitriy Rabotyagov (noonedeadpunk)
Changed in openstack-ansible:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ansible-hardening yoga-eom

This issue was fixed in the openstack/ansible-hardening yoga-eom release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ansible-hardening victoria-eom

This issue was fixed in the openstack/ansible-hardening victoria-eom release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ansible-hardening wallaby-eom

This issue was fixed in the openstack/ansible-hardening wallaby-eom release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ansible-hardening xena-eom

This issue was fixed in the openstack/ansible-hardening xena-eom release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.