Bug #589521 “nagios monitoring of package imports needed” : Bugs : Ubuntu Distributed Development

Revision history for this message

John A Meinel (jameinel) wrote on 2011-01-26:

#1

Download full text (5.4 KiB)

The original rt #39614 that related to this was about nagios integration, but grew into moving the package importer under losa control. So I'm posting a bit of 'why this can and cannot work today' to this bug. Though really there are probably several bugs that should be split out of my post.

We talked a fair amount of this at the recent sprint. I wasn't aware of this rt, though I was aware of trying to get the importer under LOSA control.

For a quick summary:
I think we can migrate to another machine with minimal fuss.
We'll still need direct login to the new machine for the foreseeable
future because most maintenance tasks (restarting a failing import)
require manual intervention.
I would like to see at least a little nagios integration, so that we
can move polling the state of the import from being manually done to
being automated.

At the moment, there are a few steps of this, which I think are relevant.

1) package-import is currently monitored manually. Which prior to this
week basically meant whenever James Westby got around to checking
on it. (Or someone complained sufficiently about a failure.)

It would be nice to get some level of nagios warning/critical so
that we don't have to manually poll the service.

   Since the imports aren't perfect yet, we can't just say "we have any
   failing imports", but we could say "we normally have 500 failed
   imports, and now we have 1000". Which would help catch the "can no
   longer reach archive.debian.org through Canonical's firewall" cases.

   As we improve the UDD workflow, eventually this sort of
   infrastructure either becomes critical, or becomes obsolete. (People
   start depending on the branches to exist, but they may also start
   creating the branches directly, rather than having the importer
   doing the work.)

2) Jubany is a powerful server which is meant to be assigned to another task.
  a) We shouldn't need this much hardware. It really depends on the QoS
     we want to provide after major updates. Most of the time there
     aren't huge numbers of packages getting .deb updates. Except when
     we open up a new release series, etc. Also notable here are when
     we fix a major bug and suddenly 600 packages need to be re-scanned.
  b) Load on the system can probably be easily tuned by how many
     parallel imports we run. On Jubany it is 8. This relates to how
     many CPUs, how much peak memory, etc.
  c) The code isn't particularly optimized for low load per import yet.
     Depends on whether it is better to tweak that, or just spend $ for
     more hardware.
  d) The system doesn't scale to multiple machines particularly well.
     It currently uses an sqlite database for tracking its state. We
     could probably migrate it to a postgres db, etc, and then have a
     clearer way to scale it horizontally. (Ideally you could run it as
     a cloud-ish service, and then on a new release just fire up 20
     instances to churn through the queue.)

  e) Anyway, no real blockers *today* to just hosting the service on a
     new machine, as long as the state gets copied over correctly.
     (just copying the /srv/package-imp...

The original rt #39614 that related to this was about nagios integration, but grew into moving the package importer under losa control. So I'm posting a bit of 'why this can and cannot work today' to this bug. Though really there are probably several bugs that should be split out of my post.

We talked a fair amount of this at the recent sprint. I wasn't aware of this rt, though I was aware of trying to get the importer under LOSA control.

For a quick summary:
 I think we can migrate to another machine with minimal fuss.
 We'll still need direct login to the new machine for the foreseeable 
 future because most maintenance tasks (restarting a failing import) 
 require manual intervention.
 I would like to see at least a little nagios integration, so that we 
 can move polling the state of the import from being manually done to 
 being automated.
 
At the moment, there are a few steps of this, which I think are relevant.

1) package-import is currently monitored manually. Which prior to this  
   week basically meant whenever James Westby got around to checking
   on it. (Or someone complained sufficiently about a failure.)

It would be nice to get some level of nagios warning/critical so 
   that we don't have to manually poll the service.

Since the imports aren't perfect yet, we can't just say "we have any 
   failing imports", but we could say "we normally have 500 failed 
   imports, and now we have 1000". Which would help catch the "can no
   longer reach archive.debian.org through Canonical's firewall" cases.

As we improve the UDD workflow, eventually this sort of 
   infrastructure either becomes critical, or becomes obsolete. (People 
   start depending on the branches to exist, but they may also start 
   creating the branches directly, rather than having the importer 
   doing the work.)

2) Jubany is a powerful server which is meant to be assigned to another task.
  a) We shouldn't need this much hardware. It really depends on the QoS 
     we want to provide after major updates. Most of the time there 
     aren't huge numbers of packages getting .deb updates. Except when 
     we open up a new release series, etc. Also notable here are when 
     we fix a major bug and suddenly 600 packages need to be re-scanned.
  b) Load on the system can probably be easily tuned by how many 
     parallel imports we run. On Jubany it is 8. This relates to how 
     many CPUs, how much peak memory, etc.
  c) The code isn't particularly optimized for low load per import yet. 
     Depends on whether it is better to tweak that, or just spend $ for 
     more hardware.
  d) The system doesn't scale to multiple machines particularly well. 
     It currently uses an sqlite database for tracking its state. We 
     could probably migrate it to a postgres db, etc, and then have a 
     clearer way to scale it horizontally. (Ideally you could run it as 
     a cloud-ish service, and then on a new release just fire up 20 
     instances to churn through the queue.)

e) Anyway, no real blockers *today* to just hosting the service on a 
     new machine, as long as the state gets copied over correctly.
     (just copying the /srv/package-import.canonical.com/new directory 
     is probably sufficient.)

3) Restricting access to the machine, so that LOSAs are the only ones with direct access.

a) The way the process works today, this is not feasible.
     At least partially because of the sqlite state db. We would need 
     to expose a fair amount of functionality in order to do regular 
     maintenance.

b) For example, a package can get a temporary failure (connection 
     reset, etc). If this failure hasn't been seen before, it gets 
     marked as failing immediately, and needs manual intervention to 
     get going again.
     It would be possible to add automatic retry for all failures. 
     James was concerned about data corruption, but has state that he 
     hasn't seen any failures that would corrupt anything if they were 
     tried again. Stuff that would cause inconsistency at least 
     consistently fails.

c) On the other hand, there are still some race conditions, which 
     means that a package can get wedged if someone is adding new data 
     to the packaging branch, which the importer is trying to also add 
     to. Resolving this is still *very* manual, as it involves figuring 
     out what actually happened, then resetting state accordingly.

Some of this could be a button click "ignore local state". But the 
     importer actively processes multiple branches at a time, so it is 
     possible to have the state of the Launchpad branches get in a 
     weird state. (upstream vs debian vs ubuntu branches could all have 
     tags pointing at different revisions, claiming that they were all 
     'upstream-release-1.2')

If we really want to get this machine hidden behind the iron 
     curtain, then as we encounter issues, we can slowly generate more 
     external openings for us to fix things.

However, it is going to be a while before we have enough of them 
     to not avoid pestering a LOSA to do something specific for each of 
     the 500-ish failures we have today.

4) If we are doing major rework of the import system, we might consider 
   trying to rephrase the problem in terms of the vcs-import system. 
   Which already has hardware and process infrastructure to handle some 
   similar issues. (I need to import #N jobs, fire them off, make sure 
   they are running, report when they are failing, etc.)

Revision history for this message

James Westby (james-w) wrote on 2011-01-26: Re: [Bug 589521] Re: nagios monitoring of package imports needed

#2

Download full text (3.9 KiB)

On Wed, 26 Jan 2011 18:09:59 -0000, John A Meinel <email address hidden> wrote:
> For a quick summary:
> I think we can migrate to another machine with minimal fuss.
> We'll still need direct login to the new machine for the foreseeable
> future because most maintenance tasks (restarting a failing import)
> require manual intervention.

It would be great to make that clicky-clicky in the web UI, but I don't
think that's a trivial task.

> 1) package-import is currently monitored manually. Which prior to this
> week basically meant whenever James Westby got around to checking
> on it. (Or someone complained sufficiently about a failure.)
>
> It would be nice to get some level of nagios warning/critical so
> that we don't have to manually poll the service.

There was some discussion at the sprint about not overloading the LOSAs
with this, and perhaps notifying "us" rather than them when something
was wrong, but that would seem to be in conflict with it being a fully
LOSA-managed service.

> 2) Jubany is a powerful server which is meant to be assigned to another task.
> a) We shouldn't need this much hardware. It really depends on the QoS
> we want to provide after major updates. Most of the time there
> aren't huge numbers of packages getting .deb updates. Except when
> we open up a new release series, etc. Also notable here are when
> we fix a major bug and suddenly 600 packages need to be
> re-scanned.

The new series case could be optimised. Currently it does it the dumb
way, and we are just careful to stop the importer while we do the
shuffle of branches on codehosting to get the optimum disk usage from
stacking.

> b) Load on the system can probably be easily tuned by how many
> parallel imports we run. On Jubany it is 8. This relates to how
> many CPUs, how much peak memory, etc.

It's trivial to tune this at run time.

IIRC 8 was the limit becuase any higher was adversely affecting
codehosting at times. That may have been due to the bug that John fixed
where we weren't reusing SSH connections correctly.

It would be good to know what the bottleneck is anyway, disk I/O or
network communication/codehosting.

> d) The system doesn't scale to multiple machines particularly well.
> It currently uses an sqlite database for tracking its state. We
> could probably migrate it to a postgres db, etc, and then have a
> clearer way to scale it horizontally. (Ideally you could run it as
> a cloud-ish service, and then on a new release just fire up 20
> instances to churn through the queue.)

There's another RT for postgres. Aside from that it currently uses file
locks to ensure only one process is active per-package at one time. It
may be possible to avoid that, or use something else for locking, but
there's nothing that it is doing that means it couldn't be scalable in
this manner.

> 3) Restricting access to the machine, so that LOSAs are the only ones
> with direct access.
>
> a) The way the process works today, this is not feasible.
> At least partially because of the sqlite state db. We would need
> to expose a fair amount of functional...

On Wed, 26 Jan 2011 18:09:59 -0000, John A Meinel <john@arbash-meinel.com> wrote:
> For a quick summary:
>  I think we can migrate to another machine with minimal fuss.
>  We'll still need direct login to the new machine for the foreseeable 
>  future because most maintenance tasks (restarting a failing import) 
>  require manual intervention.

It would be great to make that clicky-clicky in the web UI, but I don't
think that's a trivial task.

> 1) package-import is currently monitored manually. Which prior to this  
>    week basically meant whenever James Westby got around to checking
>    on it. (Or someone complained sufficiently about a failure.)
> 
>    It would be nice to get some level of nagios warning/critical so 
>    that we don't have to manually poll the service.

There was some discussion at the sprint about not overloading the LOSAs
with this, and perhaps notifying "us" rather than them when something
was wrong, but that would seem to be in conflict with it being a fully
LOSA-managed service.

> 2) Jubany is a powerful server which is meant to be assigned to another task.
>   a) We shouldn't need this much hardware. It really depends on the QoS 
>      we want to provide after major updates. Most of the time there 
>      aren't huge numbers of packages getting .deb updates. Except when 
>      we open up a new release series, etc. Also notable here are when 
>      we fix a major bug and suddenly 600 packages need to be
>   re-scanned.

The new series case could be optimised. Currently it does it the dumb
way, and we are just careful to stop the importer while we do the
shuffle of branches on codehosting to get the optimum disk usage from
stacking.

>   b) Load on the system can probably be easily tuned by how many 
>      parallel imports we run. On Jubany it is 8. This relates to how 
>      many CPUs, how much peak memory, etc.

It's trivial to tune this at run time.

IIRC 8 was the limit becuase any higher was adversely affecting
codehosting at times. That may have been due to the bug that John fixed
where we weren't reusing SSH connections correctly.

It would be good to know what the bottleneck is anyway, disk I/O or
network communication/codehosting.

>   d) The system doesn't scale to multiple machines particularly well. 
>      It currently uses an sqlite database for tracking its state. We 
>      could probably migrate it to a postgres db, etc, and then have a 
>      clearer way to scale it horizontally. (Ideally you could run it as 
>      a cloud-ish service, and then on a new release just fire up 20 
>      instances to churn through the queue.)

There's another RT for postgres. Aside from that it currently uses file
locks to ensure only one process is active per-package at one time. It
may be possible to avoid that, or use something else for locking, but
there's nothing that it is doing that means it couldn't be scalable in
this manner.

> 3) Restricting access to the machine, so that LOSAs are the only ones
> with direct access.
> 
>   a) The way the process works today, this is not feasible.
>      At least partially because of the sqlite state db. We would need 
>      to expose a fair amount of functionality in order to do regular 
>      maintenance.

Or request the LOSAs to do it on your behalf I guess? I imagine that
would become tiring to everyone very rapidly.

> 4) If we are doing major rework of the import system, we might consider 
>    trying to rephrase the problem in terms of the vcs-import system. 
>    Which already has hardware and process infrastructure to handle some 
>    similar issues. (I need to import #N jobs, fire them off, make sure 
>    they are running, report when they are failing, etc.)

I started drafting some notes for this:

https://dev.launchpad.net/Code/PackageImporter

Perhaps some of this has already been solved with e.g. the bzr-git
caches, and so it would be easier to just add a new vcs-imports job
type.

Thanks,

James

Revision history for this message

Martin Pool (mbp) wrote on 2011-01-26:

#3

One thing from the discussion was that I think we should probably not use IS's nagios etc for monitoring. Instead, let's make its own web ui really clear and/or set up external specific monitoring.

Revision history for this message

Martin Pool (mbp) wrote on 2011-01-26:

#4

I think the correct state of this is:

* we are no longer specifically trying to make this a losa-managed service
* it's fine if ~canonical-bazaar can administer this machine

> 1) package-import is currently monitored manually.

Let's focus on getting good clear accurate reporting out of it, eg fixing the bug that currently-being-retried jobs are not shown in the list. Get it to the point where we can glance at the page and immediately see if it's healthy, declining, or critical. Then we can all bookmark and poll it. This service is not yet at the stage where we need to generate interrupts immediately.

> 2) Jubany is a powerful server which is meant to be assigned to another task.

If IS are keen to move it, they are welcome to do so as long as it doesn't break, and we will help. Moving to a less powerful machine is not a goal for us.

> 3) Restricting access to the machine, so that LOSAs are the only ones with direct access.

Wontfix.

> 4) If we are doing major rework of the import system, we might consider trying to rephrase the problem in terms of the vcs-import system.

Let's think about what would let them use common infrastructure, but actually doing it is not a priority.

Ubuntu Distributed Development

nagios monitoring of package imports needed

Bug Description

Other bug subscribers

Remote bug watches