check_commissioning fails test: test_check_with_commissioning_but_not_expired_node

Bug #1354511 reported by Jeroen T. Vermeulen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Jeroen T. Vermeulen
1.6
Fix Released
Undecided
Jeroen T. Vermeulen

Bug Description

This test fails consistently for me in trunk, and Julian mentioned seeing it fail as well: src/maasserver/tests/test_api_commissioning.py:TestCommissioningTimeout.test_check_with_commissioning_but_not_expired_node

A node is created in Commissioning state, then the check_commissioning API call is invoked, and the node is immediately marked Failed. Julian has reported seeing a similar test failure.

Looking at the code, it seems obvious: check_commissioning compares the node's last-update timestamp, which is based on the database's now() transaction clock, to Python's datetime.now() minus one hour — instead of datetime.utcnow() as you'd expect. Julian and I observed the failures in large positive timezone offsets, so datetime.now() would be well over an hour ahead of the database now().

What still puzzles me is that the same test consistently passes for me in 1.6 or earlier, and consistently fails in trunk. The check_commissioning code hasn't changed, and neither has the TimestampedModel code that sets the update timestamp. The test is unchanged as well. Could we be faking the clock for testing, but in a different way than before? Or has there been a change in some dependency that our dev systems don't take from an installed package?

(On a sidenote, can we really count on the node's latest update timestamp being an accurate representation of when it started commissioning? How do we know there are no periodic updates to refresh the timestamp before the timeout can elapse?)

I'm treating this as a potential production problem, not as a spurious test failure, until we know better.

Related branches

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

I scanned the diff of the “src” directory between 1.6 and trunk for “[Tt]ime” and “[Cc]lock” but found nothing that looked like it might cause this. :-|

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

Found it! The database cluster in my trunk branch was still on UTC+1. Setting it up afresh made the test pass.

This does reveal a bug in the API code, however: it shouldn't be mixing its client-side clock with the database clock. It'd be great if it could just issue the SQL expression “now() - <interval>” instead of computing the value itself.

Changed in maas:
milestone: none → 1.6.1
Changed in maas:
assignee: nobody → Jeroen T. Vermeulen (jtv)
status: Triaged → In Progress
Changed in maas:
status: In Progress → Fix Committed
Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

The backport to 1.2 is nontrivial. Not sure it's worth the trouble — the conditions for the failure aren't all that common.

Changed in maas:
milestone: 1.6.1 → none
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.