LAVA Scheduler (deprecated)

health checks being queued - multinode

Bug #1220414 reported by Tyler Baker on 2013-09-03

6

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	LAVA Scheduler (deprecated)	Fix Released	High	Neil Williams

Bug Description

When bringing boards back online after the multinode upgrade I've noticed a small issue. Any jobs queued before the health check will run _FIRST_

For example:

http://validation.linaro.org/scheduler/job/70590 - This queued job

http://validation.linaro.org/scheduler/job/70672 - This is the health check

The queued job has been ran before the health check has been passed.

Revision history for this message

Neil Williams (codehelp) wrote on 2013-09-04:

#1

http://playground.validation.linaro.org/scheduler/job/156 - submitted whilst all arndales were offline on playground.

http://playground.validation.linaro.org/scheduler/job/157 - the health check which should have been run first.

Changed in lava-scheduler:
status:	New → Confirmed
assignee:	nobody → Neil Williams (codehelp)

Revision history for this message

Neil Williams (codehelp) wrote on 2013-09-04:

#2

The code only distinguishes between a newly generated health check and submitted jobs. If the health check cannot run immediately, it just becomes another submitted job and the sequence is then calculated by submit_time.

Need to check the job to see if it is a health check.

Also, the job_list is an unordered set, so the ordering from the DB query is being lost. Patch in development.

Changed in lava-scheduler:
status:	Confirmed → In Progress

Revision history for this message

Neil Williams (codehelp) wrote on 2013-09-05: Re: [Bug 1220414] [NEW] health checks being queued - multinode

#3

In addition to the currently testing fix, Dave requested a new device
state of Unavailable - used for devices which are currently being
fixed, are unlikely to be put back online imminently but are not fully
retired. This allows working boards to be put offline for general tasks
on the server or on the lab whilst allowing admins to manually move
boards to a state in which boards are not available for newly submitted
jobs.

--

Neil Williams
=============
http://www.linux.codehelp.co.uk/

Dave Pigott (dpigott) on 2013-09-05

Changed in lava-scheduler:
importance:	Undecided → High

Revision history for this message

Neil Williams (codehelp) wrote on 2013-09-09:

#4

Further tests on community.validation.linaro.org show that a change of Priority would be insufficient. At the point where the code was calculating which devices could be assigned to which jobs, there were *no* health check jobs in the list returned by the status=TestJob.SUBMITTED filter. So the dispatcher, running lava_scheduler_daemon and with write access to the DB, saw that the devices were IDLE and assigned the MultiNode jobs. The very next run of the Refreshing Jobs() loop showed the health checks but by then, it was too late. The greedy scheduler model had assigned jobs to the devices, just as it should.

So the problem is that the device needs to be marked such that the first job assigned to the device *must* be a health check. A new health status of HEALTH_ASSIGN is proposed. _fix_device will refuse to assign a job to a device in health state Looping or Assign unless that job is a health check. On completion, the health status is updated and normal scheduling proceeds.

Revision history for this message

Dave Pigott (dpigott) wrote on 2013-09-27:

#5

Can someone do the code review as a matter of urgency

Antonio Terceiro (terceiro) on 2013-10-04

Changed in lava-scheduler:
status:	In Progress → Fix Committed

Dave Pigott (dpigott) on 2013-10-18

Changed in lava-scheduler:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.