Change already reported error resulting in pipeline processing error loop

Bug #1349486 reported by Clark Boylan
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Zuul
Fix Committed
High
Unassigned

Bug Description

This happened:

2014-07-26 18:08:19,961 DEBUG zuul.Scheduler: Run handler sleeping
2014-07-26 18:08:19,961 DEBUG zuul.Scheduler: Run handler awake
2014-07-26 18:08:19,961 DEBUG zuul.IndependentPipelineManager: Starting queue processor: check
2014-07-26 18:08:19,961 DEBUG zuul.IndependentPipelineManager: <QueueItem 0x7f6fd56c2790 for <Change 0x7f6f8e963b10 108825,27> in check> is a failing item because ['at least one job failed']
2014-07-26 18:08:19,963 DEBUG zuul.IndependentPipelineManager: Finished queue processor: check (changed: False)
2014-07-26 18:08:19,963 DEBUG zuul.DependentPipelineManager: Starting queue processor: gate
2014-07-26 18:08:19,963 DEBUG zuul.DependentPipelineManager: Checking for changes needed by <Change 0x7f6fd4f53e90 109738,1>:
2014-07-26 18:08:19,963 DEBUG zuul.DependentPipelineManager: No changes needed
2014-07-26 18:08:19,963 ERROR zuul.Scheduler: Exception in run handler:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/zuul/scheduler.py", line 775, in run
    while pipeline.manager.processQueue():
  File "/usr/local/lib/python2.7/dist-packages/zuul/scheduler.py", line 1353, in processQueue
    item, nnfi, ready_ahead)
  File "/usr/local/lib/python2.7/dist-packages/zuul/scheduler.py", line 1325, in _processOneItem
    self.reportItem(item)
  File "/usr/local/lib/python2.7/dist-packages/zuul/scheduler.py", line 1409, in reportItem
    raise Exception("Already reported change %s" % item.change)
Exception: Already reported change <Change 0x7f6fd4f53e90 109738,1>

The change in question appeared to have been properly reported on the gerrit side and it was merged. But zuul would not remove it from its gate pipeline, it would error during processing of the gate pipeline, then start again and error again. This resulted in many gigs of logs in a short period of time.

Looking at https://git.openstack.org/cgit/openstack-infra/zuul/tree/zuul/scheduler.py#n1409 we are just checking an zuul internal flag set at https://git.openstack.org/cgit/openstack-infra/zuul/tree/zuul/scheduler.py#n1447. So we must not be catching an unhandled exception the first time we report the item after we set the reported flag to true, then the uncaught exception makes us try again then the loop starts.

Revision history for this message
Clark Boylan (cboylan) wrote :

This line https://git.openstack.org/cgit/openstack-infra/zuul/tree/zuul/scheduler.py#n1415 through and SSHException when paramiko failed to do a thing.

I think I can fix this by changing our retry logic slightly to allow us to retry on ssh failures gracefully.

Revision history for this message
Antoine "hashar" Musso (hashar) wrote :

Fix proposed by Clark Boylan: https://review.openstack.org/110066 and merged.

Changed in zuul:
status: Triaged → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.