Develop email-in tracer tool

Bug #770428 reported by Paul Everitt
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
KARL3
Fix Released
Medium
Christian Zagrodnick

Bug Description

We don't have a reliable way to know when email-in is wedged. Rather than trying to monitor each part of the end-to-end facility, we should instead monitor the failure of a tracer email to be received.

The idea is that something, somewhere (hopefully inside the OSF NYC LAN) sends a tracer email to KARL every N minutes. This won't be a blog entry, as we don't want some dummy blog filling up and the content feeds getting blasted.

Instead, we'll develop a new tool, either as part of a community or something hanging off the top. This tool will record receipt of the email. We'll also make sure that if the tracer wasn't received in the last N minutes, monitoring gets triggered.

Some notes:

- Exercise as much of the machinery as possible. We don't really need to test the ability to create a blog entry or blog comment. If it goes out of the NYC network, through the Internets tubes, into the outer Postfix, through its spam filters, to the inner, through repoze.postoffice, and is picked up by the daemon, that will handle 99% of the problems.

- Make sure admins like Robert/Nat/Paul can go and see something about the tracer. Most likely the admin screen.

- Document, in that tool, the email address that is used for the To: on the tracer. That lets people send the email manually if they want to.

Tags: r3.62
Changed in karl3:
milestone: m55 → m54
Changed in karl3:
milestone: m54 → m55
Changed in karl3:
milestone: m55 → m56
Revision history for this message
Christian Zagrodnick (zagy) wrote :

A very easy thing to check is a file age. So if a file like "tracer-last-receveid" was touched when the tracer is received the nagios check would be very simple.

Revision history for this message
Paul Everitt (paul-agendaless) wrote : Re: [Bug 770428] Re: Develop email-in tracer tool

(I made Nat and Robert.)

Zagy suggested: "A very easy thing to check is a file age. So if a file like "tracer- last-receveid" was touched when the tracer is received the nagios check would be very simple."

That's a clever suggestion if the goal is to just measure repoze.postoffice and whether it is functioning. Though we wouldn't know what threshold to set it at. What constitutes "too long ago" for KARL?

If we're planning on doing an end-to-end test, meaning writing blog content into a KARL community, we're back to some of the issues in the LP ticket. Robert, did you have in mind something that required no KARL development?

I'll admit, writing a new blog entry every N minutes has some negative consequences.

Is this something that we need to have a roadmap discussion about and move this to the top of the list, and get exclusive immediate use of Rossi's available budget until it is done?

--Paul

On May 10, 2011, at 10:23 AM, Christian Zagrodnick wrote:

> A very easy thing to check is a file age. So if a file like "tracer-
> last-receveid" was touched when the tracer is received the nagios check
> would be very simple.
>
> --
> You received this bug notification because you are a direct subscriber
> of the bug.
> https://bugs.launchpad.net/bugs/770428
>
> Title:
> Develop email-in tracer tool
>
> Status in KARL3:
> New
>
> Bug description:
> We don't have a reliable way to know when email-in is wedged. Rather
> than trying to monitor each part of the end-to-end facility, we should
> instead monitor the failure of a tracer email to be received.
>
> The idea is that something, somewhere (hopefully inside the OSF NYC
> LAN) sends a tracer email to KARL every N minutes. This won't be a
> blog entry, as we don't want some dummy blog filling up and the
> content feeds getting blasted.
>
> Instead, we'll develop a new tool, either as part of a community or
> something hanging off the top. This tool will record receipt of the
> email. We'll also make sure that if the tracer wasn't received in the
> last N minutes, monitoring gets triggered.
>
> Some notes:
>
> - Exercise as much of the machinery as possible. We don't really need
> to test the ability to create a blog entry or blog comment. If it
> goes out of the NYC network, through the Internets tubes, into the
> outer Postfix, through its spam filters, to the inner, through
> repoze.postoffice, and is picked up by the daemon, that will handle
> 99% of the problems.
>
> - Make sure admins like Robert/Nat/Paul can go and see something about
> the tracer. Most likely the admin screen.
>
> - Document, in that tool, the email address that is used for the To:
> on the tracer. That lets people send the email manually if they want
> to.
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/karl3/+bug/770428/+subscribe

Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

On Tue, May 10, 2011 at 10:38 AM, Paul Everitt <email address hidden> wrote:

> (I made Nat and Robert.)
>
> That's a clever suggestion if the goal is to just measure
> repoze.postoffice and whether it is functioning. Though we wouldn't
> know what threshold to set it at. What constitutes "too long ago" for
> KARL?
>
>
I figure it would 2 or 3 times the period of the tracer email. So, if we're
sending a tracer email every 5 minutes, then 10 or 15 minutes is too long.

> If we're planning on doing an end-to-end test, meaning writing blog
> content into a KARL community, we're back to some of the issues in the
> LP ticket.

Well, we had been discussing creating a special handler inside of Karl that
would note the receipt of the tracer and then tweak a timestamp in the
database that could be checked, instead of writing a new blog entry for
every tracer email. It would test 90% of the machinery and catch closer to
99% of the problems we typically have. Zagy's suggestion is really exactly
the same except that instead we touch a file on the filesystem. By using a
mailin handler inside of Karl to do it, we still get to test almost the
entirety of the path, but we avoid a bunch of unnecessary database
transactions. So I'm in favor of this.

Revision history for this message
Robert Marianski (rmarianski) wrote :

I was assuming these tracer emails would just be blog post entries, and then the test would be to verify that unique text, like a uuid or timestamp, was present on the page.

What are the negative consequences of just sending an email to a separate private community? If we want to eliminate the content feeds side effect, we can create a separate dummy user just for this tracer private community.

I guess this makes the test to verify that the mail went through more complicated, but it might be easier to move the complexity there than to build something separate in karl.

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

On May 10, 2011, at 10:59 AM, Robert Marianski wrote:

> I was assuming these tracer emails would just be blog post entries, and
> then the test would be to verify that unique text, like a uuid or
> timestamp, was present on the page.

We'd have to teach Nagios to have a login. Its requests are currently anonymous. We'd be moving some of the "Is KARL ok?" away from the error monitor page.

> What are the negative consequences of just sending an email to a
> separate private community? If we want to eliminate the content feeds
> side effect, we can create a separate dummy user just for this tracer
> private community.

We'll accumulate around 10,000 new pieces of bogus content a month. I suppose search/catalog time after a while will bog down.

On the plus side, you'll double the amount of "content" in KARL in about half a year I suppose. :)

--Paul

> I guess this makes the test to verify that the mail went through more
> complicated, but it might be easier to move the complexity there than to
> build something separate in karl.
>
> --
> You received this bug notification because you are a direct subscriber
> of the bug.
> https://bugs.launchpad.net/bugs/770428
>
> Title:
> Develop email-in tracer tool
>
> Status in KARL3:
> New
>
> Bug description:
> We don't have a reliable way to know when email-in is wedged. Rather
> than trying to monitor each part of the end-to-end facility, we should
> instead monitor the failure of a tracer email to be received.
>
> The idea is that something, somewhere (hopefully inside the OSF NYC
> LAN) sends a tracer email to KARL every N minutes. This won't be a
> blog entry, as we don't want some dummy blog filling up and the
> content feeds getting blasted.
>
> Instead, we'll develop a new tool, either as part of a community or
> something hanging off the top. This tool will record receipt of the
> email. We'll also make sure that if the tracer wasn't received in the
> last N minutes, monitoring gets triggered.
>
> Some notes:
>
> - Exercise as much of the machinery as possible. We don't really need
> to test the ability to create a blog entry or blog comment. If it
> goes out of the NYC network, through the Internets tubes, into the
> outer Postfix, through its spam filters, to the inner, through
> repoze.postoffice, and is picked up by the daemon, that will handle
> 99% of the problems.
>
> - Make sure admins like Robert/Nat/Paul can go and see something about
> the tracer. Most likely the admin screen.
>
> - Document, in that tool, the email address that is used for the To:
> on the tracer. That lets people send the email manually if they want
> to.
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/karl3/+bug/770428/+subscribe

Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

On Tue, May 10, 2011 at 10:59 AM, Robert Marianski <
<email address hidden>> wrote:

> I was assuming these tracer emails would just be blog post entries, and
> then the test would be to verify that unique text, like a uuid or
> timestamp, was present on the page.
>
> What are the negative consequences of just sending an email to a
> separate private community? If we want to eliminate the content feeds
> side effect, we can create a separate dummy user just for this tracer
> private community.
>
> I guess this makes the test to verify that the mail went through more
> complicated, but it might be easier to move the complexity there than to
> build something separate in karl.
>
>
Well, actually, now that I remember, what we had actually discussed is a
specialized blog tool in a specialized community. So, actually, all of the
mailin machinery would be exercised but you wouldn't end up with an enormous
number of blog entries stored unnecessarily in your database. If we're
sending a tracer email every 5 minutes, that's 288 blog posts a day. After
just a week, your test community has 2016 blog posts in it, and growing.
 That just doesn't seem sustainable.

So the idea was to just create an object that looks for all the world like a
blog for the mailin tool to write a blog entry to, but instead of storing
each blog entry it would simply not the time it had last received a blog
entry. So you still have a bunch of transactions, but at least not a ton of
data you're trying to store unnecessarily. Zagy's idea also eliminates the
unnecessary transactions, but by using this dummy blog tool, we still
exercise all of the mailin machinery end to end.

Chris

Revision history for this message
Nat Katin-Borland (nborland) wrote :
Download full text (4.2 KiB)

Good point about that much content potentially impacting search. About how much time would we need to implement something like this? Then we can run it by Tom. I don't want to wait too long to have something in place because we've already had 2 periods of down time in the last 6 days and we're going to start to loose users if email isn't reliably delivered.

-Nat

--
Nathaniel Katin-Borland
Support Specialist
Knowledge Management Initiative
KARL Support Team

Open Society Foundations - New York Office
400 West 59th Street
New York, NY 10019
Email: <email address hidden>
Phone: 212-547-6984
http://www.soros.org/
http://www.karlproject.org

-----Original Message-----
From: <email address hidden> [mailto:<email address hidden>] On Behalf Of Chris Rossi
Sent: Tuesday, May 10, 2011 11:18 AM
To: Nathaniel Katin-Borland
Subject: Re: [Bug 770428] Re: Develop email-in tracer tool

On Tue, May 10, 2011 at 10:59 AM, Robert Marianski < <email address hidden>> wrote:

> I was assuming these tracer emails would just be blog post entries,
> and then the test would be to verify that unique text, like a uuid or
> timestamp, was present on the page.
>
> What are the negative consequences of just sending an email to a
> separate private community? If we want to eliminate the content feeds
> side effect, we can create a separate dummy user just for this tracer
> private community.
>
> I guess this makes the test to verify that the mail went through more
> complicated, but it might be easier to move the complexity there than
> to build something separate in karl.
>
>
Well, actually, now that I remember, what we had actually discussed is a specialized blog tool in a specialized community. So, actually, all of the mailin machinery would be exercised but you wouldn't end up with an enormous number of blog entries stored unnecessarily in your database. If we're sending a tracer email every 5 minutes, that's 288 blog posts a day. After just a week, your test community has 2016 blog posts in it, and growing.
 That just doesn't seem sustainable.

So the idea was to just create an object that looks for all the world like a blog for the mailin tool to write a blog entry to, but instead of storing each blog entry it would simply not the time it had last received a blog entry. So you still have a bunch of transactions, but at least not a ton of data you're trying to store unnecessarily. Zagy's idea also eliminates the unnecessary transactions, but by using this dummy blog tool, we still exercise all of the mailin machinery end to end.

Chris

--
You received this bug notification because you are a direct subscriber of the bug.
https://bugs.launchpad.net/bugs/770428

Title:
  Develop email-in tracer tool

Status in KARL3:
  New

Bug description:
  We don't have a reliable way to know when email-in is wedged. Rather
  than trying to monitor each part of the end-to-end facility, we should
  instead monitor the failure of a tracer email to be received.

  The idea is that something, somewhere (hopefully inside the OSF NYC
  LAN) sends a tracer email to KARL every N minutes. This won't be a
  blog ...

Read more...

Revision history for this message
Paul Everitt (paul-agendaless) wrote :
Download full text (6.3 KiB)

Maybe 4 hours to make the special-purpose tool, done in a way that makes sense. (We have some more things to discuss on this.) Probably 1 hour to make the error monitor report it, then an hour to get all parts of it into place. Add padding as appropriate.

--Paul

On May 10, 2011, at 11:59 AM, Nat Katin-Borland wrote:

> Good point about that much content potentially impacting search. About
> how much time would we need to implement something like this? Then we
> can run it by Tom. I don't want to wait too long to have something in
> place because we've already had 2 periods of down time in the last 6
> days and we're going to start to loose users if email isn't reliably
> delivered.
>
> -Nat
>
> --
> Nathaniel Katin-Borland
> Support Specialist
> Knowledge Management Initiative
> KARL Support Team
>
> Open Society Foundations - New York Office
> 400 West 59th Street
> New York, NY 10019
> Email: <email address hidden>
> Phone: 212-547-6984
> http://www.soros.org/
> http://www.karlproject.org
>
> -----Original Message-----
> From: <email address hidden> [mailto:<email address hidden>] On Behalf Of Chris Rossi
> Sent: Tuesday, May 10, 2011 11:18 AM
> To: Nathaniel Katin-Borland
> Subject: Re: [Bug 770428] Re: Develop email-in tracer tool
>
> On Tue, May 10, 2011 at 10:59 AM, Robert Marianski <
> <email address hidden>> wrote:
>
>> I was assuming these tracer emails would just be blog post entries,
>> and then the test would be to verify that unique text, like a uuid or
>> timestamp, was present on the page.
>>
>> What are the negative consequences of just sending an email to a
>> separate private community? If we want to eliminate the content feeds
>> side effect, we can create a separate dummy user just for this tracer
>> private community.
>>
>> I guess this makes the test to verify that the mail went through more
>> complicated, but it might be easier to move the complexity there than
>> to build something separate in karl.
>>
>>
> Well, actually, now that I remember, what we had actually discussed is a specialized blog tool in a specialized community. So, actually, all of the mailin machinery would be exercised but you wouldn't end up with an enormous number of blog entries stored unnecessarily in your database. If we're sending a tracer email every 5 minutes, that's 288 blog posts a day. After just a week, your test community has 2016 blog posts in it, and growing.
> That just doesn't seem sustainable.
>
> So the idea was to just create an object that looks for all the world
> like a blog for the mailin tool to write a blog entry to, but instead of
> storing each blog entry it would simply not the time it had last
> received a blog entry. So you still have a bunch of transactions, but
> at least not a ton of data you're trying to store unnecessarily. Zagy's
> idea also eliminates the unnecessary transactions, but by using this
> dummy blog tool, we still exercise all of the mailin machinery end to
> end.
>
> Chris
>
> --
> You received this bug notification because you are a direct subscriber of the bug.
> https://bugs.launchpad.net/bugs/770428
>
> Title:
> Develop email-in tracer tool
>
> Status in KARL...

Read more...

Revision history for this message
Nat Katin-Borland (nborland) wrote :
Download full text (8.7 KiB)

Tom says let's move forward on this!

--
Nathaniel Katin-Borland
Support Specialist
Knowledge Management Initiative
KARL Support Team

Open Society Foundations - New York Office
400 West 59th Street
New York, NY 10019
Email: <email address hidden>
Phone: 212-547-6984
http://www.soros.org/
http://www.karlproject.org

-----Original Message-----
From: <email address hidden> [mailto:<email address hidden>] On Behalf Of Paul Everitt
Sent: Tuesday, May 10, 2011 12:18 PM
To: Nathaniel Katin-Borland
Subject: Re: [Bug 770428] Re: Develop email-in tracer tool

Maybe 4 hours to make the special-purpose tool, done in a way that makes sense. (We have some more things to discuss on this.) Probably 1 hour to make the error monitor report it, then an hour to get all parts of it into place. Add padding as appropriate.

--Paul

On May 10, 2011, at 11:59 AM, Nat Katin-Borland wrote:

> Good point about that much content potentially impacting search.
> About how much time would we need to implement something like this?
> Then we can run it by Tom. I don't want to wait too long to have
> something in place because we've already had 2 periods of down time in
> the last 6 days and we're going to start to loose users if email isn't
> reliably delivered.
>
> -Nat
>
> --
> Nathaniel Katin-Borland
> Support Specialist
> Knowledge Management Initiative
> KARL Support Team
>
> Open Society Foundations - New York Office
> 400 West 59th Street
> New York, NY 10019
> Email: <email address hidden>
> Phone: 212-547-6984
> http://www.soros.org/
> http://www.karlproject.org
>
> -----Original Message-----
> From: <email address hidden> [mailto:<email address hidden>] On Behalf
> Of Chris Rossi
> Sent: Tuesday, May 10, 2011 11:18 AM
> To: Nathaniel Katin-Borland
> Subject: Re: [Bug 770428] Re: Develop email-in tracer tool
>
> On Tue, May 10, 2011 at 10:59 AM, Robert Marianski <
> <email address hidden>> wrote:
>
>> I was assuming these tracer emails would just be blog post entries,
>> and then the test would be to verify that unique text, like a uuid or
>> timestamp, was present on the page.
>>
>> What are the negative consequences of just sending an email to a
>> separate private community? If we want to eliminate the content feeds
>> side effect, we can create a separate dummy user just for this tracer
>> private community.
>>
>> I guess this makes the test to verify that the mail went through more
>> complicated, but it might be easier to move the complexity there than
>> to build something separate in karl.
>>
>>
> Well, actually, now that I remember, what we had actually discussed is a specialized blog tool in a specialized community. So, actually, all of the mailin machinery would be exercised but you wouldn't end up with an enormous number of blog entries stored unnecessarily in your database. If we're sending a tracer email every 5 minutes, that's 288 blog posts a day. After just a week, your test community has 2016 blog posts in it, and growing.
> That just doesn't seem sustainable.
>
> So the idea was to just create an object that looks for all the world
> like...

Read more...

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

I'm going to get it on our nearer-term radar.

Changed in karl3:
importance: Low → Medium
milestone: m61 → m56
Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

I have created a tool like the one described above. It can be tested on staging by sending an email to this community:

https://karlstaging.gocept.com/branch1/osf/communities/mailin-trace/blog

There is now a new karlserve command, 'create_mailin_trace' which is used to configure the mailin trace tool:

osfkarltest@osfkarltest10 ~/staging/branch1 $ bin/karlserve create_mailin_trace --help
usage: karlserve create_mailin_trace [-h] instance community file

positional arguments:
  instance Instance name.
  community Community name.
  file Path to file to touch when a tracer email is received.

optional arguments:
  -h, --help show this help message and exit
osfkarltest@osfkarltest10 ~/staging/branch1 $ bin/karlserve create_mailin_trace osf mailin-trace ~/staging/branch1/var/mailin.trace
Added mailin trace tool at: /communities/mailin-trace/blog
The mailin trace file is: /srv/osfkarltest/staging/branch1/var/mailin.trace
You must restart the mailin daemon in order for the new settings to take effect.

Note that the tool will disappear and need to reconfigured on staging after every nightly sync.

It is assumed that generating the tracer email and setting up the monitor in production are tasks for OSF and gocept respectively.

Changed in karl3:
status: New → Fix Committed
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Next step is email getting sent in. I will talk to Chris about this after he returns.

Changed in karl3:
milestone: m56 → m58
status: Fix Committed → In Progress
Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

Ok, I have created the mailin trace community:

https://karl.soros.org/communities/mailin-trace/blog

I have set up a cronjob to send an email to that community once every five minutes. Every time this community receives an email message this file is touched:

/srv/osfkarl/production/current/var/mailin_trace

Zagy mentioned earlier that he can wire up a nagios alert based on the timestamp of that file. Zagy, can you wire up a nagios alert that goes off if the timestamp on that file is older than 15 minutes? Once that is done, we can test by stopping mailin and making sure the alert fires. At that point we can consider this ticket closed.

Changed in karl3:
assignee: Chris Rossi (chris-archimedeanco) → Christian Zagrodnick (zagy)
Revision history for this message
Robert Marianski (rmarianski) wrote :

On deployments, does the script copy over the var/ directory from the old to the new instance? It's probably useful in general so we'll get the latest log files, but I just wanted to make sure for this case so we don't get a false positive.

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Well done, Chris.

Zagy, do you need me to file a support@ ticket for this?

Changed in karl3:
milestone: m58 → m60
Revision history for this message
Christian Zagrodnick (zagy) wrote :

On 01.06.2011, at 17:30, Paul Everitt wrote:

> Well done, Chris.
>
> Zagy, do you need me to file a support@ ticket for this?

Nope. Got it. (will probably set it up tomorrow, as I'm on vacation next week)

--
Christian Zagrodnick · <email address hidden>
gocept gmbh & co. kg · forsterstraße 29 · 06112 halle (saale) · germany
http://gocept.com · tel +49 345 1229889 0 · fax +49 345 1229889 1
Zope and Plone consulting and development

Revision history for this message
Christian Zagrodnick (zagy) wrote :

The check is set up: https://monitor.rzob.gocept.net/nagios/cgi-bin/extinfo.cgi?type=2&host=osfkarl10&service=karl+mailin+alive

So, let's see if it does the right thing.

Note: The check does not trigger an emergency.

Revision history for this message
Christian Zagrodnick (zagy) wrote :

Ah, it's warning if the file is older than 650 seconds and getting critical if the file is older than 950 seconds.

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

On Jun 2, 2011, at 2:45 AM, Christian Zagrodnick wrote:

> Note: The check does not trigger an emergency.

Perfect.

I will mark this ticket as completed.

--Paul

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

We just did a test and Nagios detected it. Cool!

Changed in karl3:
status: In Progress → Fix Released
milestone: m60 → m58
JimPGlenn (jpglenn09)
tags: added: r3.62
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.