Bug #1918340 “Fault Injection #1 - improve unit test effectivene...” : Bugs : OpenStack Compute (nova)

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2021-03-11:

#1

@Henrique: Thanks for the reports. Do you have an intention to proposing fixes for these? I'm asking this as it is hard to judge that if these are real faults in the system causing user visible problems. Or these are just missing test coverage.

If you are planning to propose fixes then I suggest to open only those bugs where you also are ready to propose the fix. I also suggests to do the fault reproduction on master as bugs, if exists, needs to be fixed on master first and then backported to stable branches.

If you are not planning to propose fixes then please only open those bugs that causing user facing faults.

I will mark this bug Incomplete until you answer my question above. Please set it back to New once you answered.
I will also mark the rest of the fault injection bug as Invalid until we clarify these questions.

Changed in nova:
status:	New → Incomplete

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2021-03-11:

#2

Hi when mass opening bugs like this its generally considered polite to reach out to the proejct in quetion before had.

Nova does not typeicly consier any bug cause by fault injection to be valid.
if you cannot recreate the same condition with the existing code using public api then its not a vaild bug.

setting this to incomplete.
please go through all the other fault injection bugs and ensure the same fault can be created without modifying code and close them if they cannot before opening any other fault injection bugs

a better way to approach Harding testing would be to compile a list of gaps in an etherpad
then discuss it on the mailing list and likely group it into a since bug with multiple patches to resolve the error.

mass filing bugs like this is strongly discusaged as it make tracking really issue users are facing much much harder.

feel free to reach out to us on irc #openstack-nova or the openstack discuss mailing list to talk about this more and what your goals are.
regards
sean

Revision history for this message

Henrique Marques (hmdmarques) wrote on 2021-03-13:

#3

Thank you for your time analysing these issues.

My intention is solely to pass information to the OpenStack community, allowing to improve tests and have a more effective test suite (my apologies for any information supplied in a wrong manner). In the end, having a test suite that is more capable of capturing future (probable) bugs.

The fault injection performed on the compute/api.py was done in the stable/ussuri because when we started it was the most recent released version.
I must say that I cannot repeat the process in the master branch in a timely manner, because the faults we injected (defined based on closed and resolved bug reports) lead to too many faulty versions to test (11309 versions to be more specific). Using the setup I have available testing all faulty versions takes over 200 days and testing the faulty versions that pass undetected through the tests takes nearly 50 days.

With this said, I must emphasize that I am reporting just part (72 cases) of what I found during experiments, that mostly require trivial changes to the test cases, but allow for more effective unit tests. In total we have found 290 probable bugs that are not being detected by any of the unit, functional and integration tests (notice that these are probable bugs, representative of what OpenStack has already experienced and fixed in the past).

Fixing these issues would allow to improve the test coverage and overall effectiveness.

I will highlight just some of the most relevant bugs detected:
-Removing @check_instance_lock allows operations to be executed on instances that are locked
-Changing condition expressions result in operations being performed when not supposed (e.g. cache reset)
-Exception handling being removed results in unexpected behaviour
-Many other fault types result in incorrect values being returned by the called functions. The reason for this is that mock functions in the tests are not validating the receveid parameters and return a fixed expected value. This is obfuscating some of the issues up until that function call.
All the problems described above are not being detected by the test suite.

At the moment I am unable to propose a fix due to time constraints (working full-time, doing a MSc), but would like to report these issues, so that they benefit the community.

Regards
Henrique

Thank you for your time analysing these issues.

My intention is solely to pass information to the OpenStack community, allowing to improve tests and have a more effective test suite (my apologies for any information supplied in a wrong manner). In the end, having a test suite that is more capable of capturing future (probable) bugs.

The fault injection performed on the compute/api.py was done in the stable/ussuri because when we started it was the most recent released version.
I must say that I cannot repeat the process in the master branch in a timely manner, because the faults we injected (defined based on closed and resolved bug reports) lead to too many faulty versions to test (11309 versions to be more specific). Using the setup I have available testing all faulty versions takes over 200 days and testing the faulty versions that pass undetected through the tests takes nearly 50 days.

With this said, I must emphasize that I am reporting just part (72 cases) of what I found during experiments, that mostly require trivial changes to the test cases, but allow for more effective unit tests. In total we have found 290 probable bugs that are not being detected by any of the unit, functional and integration tests (notice that these are probable bugs, representative of what OpenStack has already experienced and fixed in the past).

Fixing these issues would allow to improve the test coverage and overall effectiveness.

I will highlight just some of the most relevant bugs detected:
-Removing @check_instance_lock allows operations to be executed on instances that are locked
-Changing condition expressions result in operations being performed when not supposed (e.g. cache reset)
-Exception handling being removed results in unexpected behaviour
-Many other fault types result in incorrect values being returned by the called functions. The reason for this is that mock functions in the tests are not validating the receveid parameters and return a fixed expected value. This is obfuscating some of the issues up until that function call.
All the problems described above are not being detected by the test suite.

At the moment I am unable to propose a fix due to time constraints (working full-time, doing a MSc), but would like to report these issues, so that they benefit the community.

Regards
Henrique

Changed in nova:
status:	Incomplete → New

Revision history for this message

Lee Yarwood (lyarwood) wrote on 2021-03-17:

#4

Apologies if I'm being overly simplistic here but how is changing a _constant_ fault injection?

The value of the constant isn't something we assert, just the behaviour it causes when used.

Revision history for this message

Henrique Marques (hmdmarques) wrote on 2021-03-30:

#5

Dear All,

Regardless of the interest in covering fault injection cases or not, I must emphasize that the point of this message is solely to improve the current battery of tests - which clearly are not covering a few situations. This is regardless of the usefulness of the fault injection process itself (that is applied to systems where reliability is of utmost concern), which usually includes different kinds of faults that may represent typical developer mistakes (e.g., setting a constant with an incorrect value is a well-known case, but just an example among many others, like calling functions with wrong parameters, having extraneous code in an if instruction, etc). Depending on the goals, sometimes even faults that are not directly related with programmers actions are injected, e.g., bit-flips in memory.

Anyway, I think the point is not to discuss the merits of fault injection. Let me try to summarise:
- If you wish to keep the current tests from not covering certain cases they should be kept as is. In case you are interested in improving the current tests, they should be augmented preferably in the direction I am pointing out (the options are obviously immense, but the ones identified are based on the analysis of previous mistakes made by developers and reported on launchpad).

Best regards,
Henrique

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2021-04-06:

#6

Fixing unit tests or tech debt concern don't really need to have bug reports. That's also why we have Gerrit, for discussing whether the debt fix is good or not.
So, instead of discussing here about what to do, please upload a new change fixing what you want and ask us to review it by #openstack-nova, we'll do.

Changed in nova:
status:	New → Invalid

OpenStack Compute (nova)

Fault Injection #1 - improve unit test effectiveness

Bug Description

Other bug subscribers

Remote bug watches