I've made some modifications to the script (see attached), the changes include:
1. kill with ALRM first, then kill with KILL if this does not work after a small grace period. Also report on unkillable stressors
2. bump up async I/O threshold for machines with lots of CPUs
3. force hdd to do sync writes, that way we don't backlog with gazillions of pending I/Os on machines with a lot of memory and many CPUs
4. limit readahead file size so that this stressor does not spend most of it's time generating a test file before it can start testing readaheads
I've run this through several times with the latest stress-ng and it runs through to completion.
So I think we were suffering from issues where loads of pending I/Os from stressors plus bad cleanup on nuked stressors were causing massive I/O backlogs which caused the system to clag up.
I've made some modifications to the script (see attached), the changes include:
1. kill with ALRM first, then kill with KILL if this does not work after a small grace period. Also report on unkillable stressors
2. bump up async I/O threshold for machines with lots of CPUs
3. force hdd to do sync writes, that way we don't backlog with gazillions of pending I/Os on machines with a lot of memory and many CPUs
4. limit readahead file size so that this stressor does not spend most of it's time generating a test file before it can start testing readaheads
I've run this through several times with the latest stress-ng and it runs through to completion.
So I think we were suffering from issues where loads of pending I/Os from stressors plus bad cleanup on nuked stressors were causing massive I/O backlogs which caused the system to clag up.