Librarian truncates large file

Bug #317482 reported by Jeroen T. Vermeulen
20
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Unassigned

Bug Description

When importing large translation files, usually somewhere around line 10,000 or half a megabyte, it looks as if sometimes the data just stops coming. This may produce errors like:

 * String not terminated. We used to blame these on hidden carriage returns, but Danilo tried re-approving some of these cases and saw them go through.

 * Truncated message.

 * String is not quoted.

 * Invalid content, being an incomplete gettext directive.

The cases we see happen at arbitrary points in lines, at points that are very different to the parser. So it's likely that the problem happens at a very low level, before the data is even fed into the parser.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

On a sidenote, POFile.importFromQueue opens a redundant copy of the file, without ever using it:

        import_file = librarian_client.getFileByAlias(
            entry_to_import.content.id)

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

In python 2.4's httplib, according to comments for _safe_read, if a socket's recv() returns no data, there's no way of telling whether it was caused by EINTR or by EOF. (Let's hope that confusion can't happen when the result is nonempty!)

But if that is true, then AFAICS the socket implementation has the same problem while reading a buffer's worth of data, and does not warn about the situation. See socket._fileobject.read, for the "size < 0" case (which is what's being used here). Question is, is it true? I don't see any place where EINTR is being treated specially as a non-error, so probably not.

Revision history for this message
Francis J. Lacoste (flacoste) wrote :

This looks like a librarian problem.

Changed in rosetta:
importance: Undecided → High
status: New → Triaged
Changed in launchpad-foundations:
milestone: none → 2.2.2
Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

The premature EOF doesn't always happen, but when it does, it seems to like happening in the same place as before. My best guess right now is that we hit some kind of "stop here, ask for another batch if you want it" situation not being handled properly between the Librarian server and client.

Revision history for this message
Henning Eggers (henninge) wrote :

Once this bug is fixed we should re-run the import queue for the failed files because there are still a few thousand files not being imported because of this.

Revision history for this message
Данило Шеган (danilo) wrote :

Report on https://bugs.edge.launchpad.net/launchpad/+bug/327221 might point in the direction of caching.

Revision history for this message
Dirk Stöcker (stoecker) wrote :

There must be another aspect of this issue as short read cannot result in too many strings imported. For JOSM project we have around 2700 strings, but imports show much more:

On 2009-02-09 15:17+0000 (42 minutes ago), you uploaded 4425
Dutch (nl) translations for keys in JOSM trunk in Launchpad.

This are nearly 1800 strings too much.

Also reads with too little number of strings but no errors are possible (resulting in too short green bars in the overview).

Revision history for this message
Stuart Bishop (stub) wrote :

We should disable the Squid front end, either by turning off Squid for the whole librarian or hacking the config being used to talk directly to the librarian rather than via the Squid server.

There is little point diagnosing further until this is done - Squid introduces too many variables out of our control.

Revision history for this message
Stuart Bishop (stub) wrote :

At the moment, we still don't know what part of the process is going wrong:
    - the Librarian isn't sending the whole file
    - the client isn't reading the whole file
    - something in between is truncating the file (squid, internal proxy)
    - the whole file isn't being uploaded in the first place

LOSAs need to disable Squid and any internal proxies for this script to reduce the variables.

Revision history for this message
Francis J. Lacoste (flacoste) wrote :

So in addition to removing the squid server from the equation, the other diagnosis we should have is whether the complete file is in the Librarian.

It's not clear to me if when it works, it's the exact same LibraryFileAlias that failed in the previous run (or it's a file that is uploaded and then downloaded).

Revision history for this message
Данило Шеган (danilo) wrote :

In at least some cases, I've reduced the chances of another upload coming in between by trying to download right after I got a failure email of this type, and got a complete, correct file. There were at most few minutes between failure and me successfully getting a complete file from librarian (and I've checked that there were no more recent uploads).

Revision history for this message
Данило Шеган (danilo) wrote :

Btw, I've filed an RT #33223 to disable any proxies between loganberry and librarian.

Revision history for this message
Barry Warsaw (barry) wrote :

potentially relevant bugs

http://bugs.python.org/issue1424152 (are we using a proxy?)
http://bugs.python.org/issue1537445 (urlib2 httplib _read_chunked timeout)
http://bugs.python.org/issue1628205 (socket.readline() interface doesn't handle EINTR properly)

Other than that, I don't have much to add except if it's EINTR related, then Oren's comment on 1628205 might be related.

Revision history for this message
Steve McInerney (spm) wrote :

have done some quick netstat checks while poimport is running, there are connections direct to the librarian from loganberry.

Was too slow to catch and confirm with a lsof that this was the poimport job, either way - sounds like would be best fixed in the LP config and go direct that way?

Revision history for this message
Данило Шеган (danilo) wrote :

With the modified config for loganberry poimport script, everything seems to work correctly (with the uploads we got), except that the failure emails now contain paths like:

  http://mizuho.canonical.com:8000/22741330/ca.po

If we are going to keep the set-up as-is for the time being before proxy issues are investigated and resolved, we'd need to fix this to point to world-accessible librarian again.

Revision history for this message
Francis J. Lacoste (flacoste) wrote :

Actually, we should actually change the librarian code to use download_host to retrieve file instead of download_url and only use download_url for user-visible URLs.

Revision history for this message
Gary Poster (gary) wrote :

As Francis suggests, I have committed a change (r7831) that addresses the problem internally by bypassing the proxy. That does not mean that this is fixed, AIUI.

Changed in launchpad-foundations:
status: Triaged → In Progress
Revision history for this message
Данило Шеган (danilo) wrote :

The change doesn't touch any config files, indicating that production-poconfig changes have not actually landed. Stuart, would it make sense to land this as a separate configuration instead of having it hacked on loganberry? Will rollout break this?

Revision history for this message
Stuart Bishop (stub) wrote : Re: [Bug 317482] Re: Librarian truncates large file

On Mon, Feb 23, 2009 at 6:40 PM, Данило Шеган <email address hidden> wrote:
> The change doesn't touch any config files, indicating that production-
> poconfig changes have not actually landed. Stuart, would it make sense
> to land this as a separate configuration instead of having it hacked on
> loganberry? Will rollout break this?

I don't think Gary's fix requires config file changes - it changes the
way client.py accesses the librarian. Previously it used the
download_url. Now it uses the download port and hostname to build the
URL. download_url is only used for constructing URLs that can be used
externally.

--
Stuart Bishop <email address hidden>
http://www.stuartbishop.net/

Revision history for this message
Francis J. Lacoste (flacoste) wrote :

So the work-around has been implemented and deployed. The app servers now fetches the data directly from the librarian.

It's probable that users going through the cache might encounters failures from time to time, but that should be tracked separately.

Changed in launchpad-foundations:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Related questions

Remote bug watches

Bug watches keep track of this bug in other bug trackers.