health checks being queued - multinode

Bug #1220414 reported by Tyler Baker
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
LAVA Scheduler (deprecated)
Fix Released
High
Neil Williams

Bug Description

When bringing boards back online after the multinode upgrade I've noticed a small issue. Any jobs queued before the health check will run _FIRST_

For example:

http://validation.linaro.org/scheduler/job/70590 - This queued job

http://validation.linaro.org/scheduler/job/70672 - This is the health check

The queued job has been ran before the health check has been passed.

Revision history for this message
Neil Williams (codehelp) wrote :

http://playground.validation.linaro.org/scheduler/job/156 - submitted whilst all arndales were offline on playground.

http://playground.validation.linaro.org/scheduler/job/157 - the health check which should have been run first.

Changed in lava-scheduler:
status: New → Confirmed
assignee: nobody → Neil Williams (codehelp)
Revision history for this message
Neil Williams (codehelp) wrote :

The code only distinguishes between a newly generated health check and submitted jobs. If the health check cannot run immediately, it just becomes another submitted job and the sequence is then calculated by submit_time.

Need to check the job to see if it is a health check.

Also, the job_list is an unordered set, so the ordering from the DB query is being lost. Patch in development.

Changed in lava-scheduler:
status: Confirmed → In Progress
Revision history for this message
Neil Williams (codehelp) wrote : Re: [Bug 1220414] [NEW] health checks being queued - multinode

In addition to the currently testing fix, Dave requested a new device
state of Unavailable - used for devices which are currently being
fixed, are unlikely to be put back online imminently but are not fully
retired. This allows working boards to be put offline for general tasks
on the server or on the lab whilst allowing admins to manually move
boards to a state in which boards are not available for newly submitted
jobs.

--

Neil Williams
=============
http://www.linux.codehelp.co.uk/

Dave Pigott (dpigott)
Changed in lava-scheduler:
importance: Undecided → High
Revision history for this message
Neil Williams (codehelp) wrote :

Further tests on community.validation.linaro.org show that a change of Priority would be insufficient. At the point where the code was calculating which devices could be assigned to which jobs, there were *no* health check jobs in the list returned by the status=TestJob.SUBMITTED filter. So the dispatcher, running lava_scheduler_daemon and with write access to the DB, saw that the devices were IDLE and assigned the MultiNode jobs. The very next run of the Refreshing Jobs() loop showed the health checks but by then, it was too late. The greedy scheduler model had assigned jobs to the devices, just as it should.

So the problem is that the device needs to be marked such that the first job assigned to the device *must* be a health check. A new health status of HEALTH_ASSIGN is proposed. _fix_device will refuse to assign a job to a device in health state Looping or Assign unless that job is a health check. On completion, the health status is updated and normal scheduling proceeds.

Revision history for this message
Dave Pigott (dpigott) wrote :

Can someone do the code review as a matter of urgency

Changed in lava-scheduler:
status: In Progress → Fix Committed
Dave Pigott (dpigott)
Changed in lava-scheduler:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.