multiple copies of orig.tar.gz's in the librarian

Bug #38227 reported by James Troup
40
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Celso Providelo
Ubuntu
Invalid
Medium
Unassigned

Bug Description

The sync-source tool deliberately explodes when it finds more than one orig.tar.gz in the archive for a given source_version. This appears to be the case for, e.g. advi_1.6.0.orig.tar.gz. I downloaded both copies, and confirmed that the md5sums and sizes match. Even so, there should only be one copy in the librarian.

*****
<SourcePackageFilePublishing at 0x2aaab5737f50>
filename: advi_1.6.0.orig.tar.gz
alias: 1268214
distrorelease: hoary, component: universe, source: advi, status: 2
*****
<SourcePackageFilePublishing at 0x2aaab5737fd0>
filename: advi_1.6.0.orig.tar.gz
alias: 1331574
distrorelease: breezy, component: universe, source: advi, status: 2
*****
<SourcePackageFilePublishing at 0x2aaab71b30d0>
filename: advi_1.6.0.orig.tar.gz
alias: 1331574
distrorelease: dapper, component: universe, source: advi, status: 3
*****
<SourcePackageFilePublishing at 0x2aaab71b3150>
filename: advi_1.6.0.orig.tar.gz
alias: 1331574
distrorelease: dapper, component: universe, source: advi, status: 2
E: advi_1.6.0.orig.tar.gz (from advi) returns multiple IDs for orig.tar.gz. Help?

Tags: lp-soyuz
James Troup (elmo)
Changed in launchpad-upload-and-queue:
assignee: nobody → dsilvers
status: Unconfirmed → Confirmed
Revision history for this message
Andrew Bennetts (spiv) wrote :

Not directly helpful, but this background information about how the Librarian handles duplicates may be of interest:

The librarian tries to ensure that identical files are only stored once (and so the LibraryFileContent table will only have one row for that file), but by design allows duplicate aliases to that content. (Additionally, there can be duplicate content if the same new file is uploaded simultaneously in two seperate but concurrent transactions, but the Librarian GC process will find and collapse duplicate content rows daily).

Basically, this means this constraint needs to be enforced somewhere other than the librarian. A workaround might be to try relying on the librarian's existing duplicate detection, i.e. check if the duplicate aliases are linked to the same content or not, but I don't think this will be bulletproof.

Judging from a bit of grepping, the place where these files are added is nascentupload.py, in insert_source_into_db, so this would be the obvious place to start fixing.

Revision history for this message
Celso Providelo (cprov) wrote :

RF 3476, but we still missing a magic SQL to fix the duplicated files in the production DB

Changed in qprocd:
status: Confirmed → Fix Committed
Revision history for this message
Matt Zimmerman (mdz) wrote :

This is blocking certain pending archive requests, such as bug #41213, so raising severity.

Celso, it looks like you're working on this, so I'm assigning to you; hope that's OK.

Changed in qprocd:
assignee: dsilvers → cprov
Revision history for this message
Daniel Silverstone (dsilvers) wrote :

I have been working on the SQL to fix this up.

I need stuart to finish a librariangc run before I can continue

Revision history for this message
Stuart Bishop (stub) wrote :

librarian-gc.py run has completed.

Revision history for this message
Celso Providelo (cprov) wrote :

Daniel, Did you check the results ? should we continue the dapper-autotest process and see what happen ?

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Still seeing this today.

Revision history for this message
Adam Conrad (adconrad) wrote :

And 4 days after the last comment, I just ran across this with a sync I was attempting, and it's blocking some bugfixes we need to grab from Debian. If there's anything I can do to help get this tested and rolled out, let me know.

Revision history for this message
Matt Zimmerman (mdz) wrote :

I've opened an Ubuntu task to track the process of cleaning up after this bug, since the bug itself has been fixed

Revision history for this message
Matt Zimmerman (mdz) wrote :

Here's my analysis of the duplicates that kiko brought to my attention, which were found with a database query.

Orig needs fixing in pool:
 digikam_0.8.1.orig.tar.gz | 2
 kdissert_1.0.5.debian.orig.tar.gz | 2

Duplicate orig, but pool has the correct one:
 gossip_0.10.2.orig.tar.gz | 2
 xfce4-mixer_4.3.0svn+r19775.orig.tar.gz | 2

Needs upload:
 at_3.1.8-11ubuntu5.dsc | 2
 at_3.1.8-11ubuntu5.tar.gz | 2
 ebtables_2.0.6-3ubuntu1.diff.gz | 2
 ebtables_2.0.6-3ubuntu1.dsc | 2
 xfce4-dev-tools_4.3.3svn-r20589-0ubuntu1.diff.gz | 2
 xfce4-dev-tools_4.3.3svn-r20589-0ubuntu1.dsc | 2

Not currently published:
 gazpacho_0.6.2-0ubuntu4.diff.gz | 2
 gazpacho_0.6.2-0ubuntu4.dsc | 2
 gok_1.0.5-1ubuntu5.diff.gz | 2
 gok_1.0.5-1ubuntu5.dsc | 2
 gtksourceview_1.6.1-0ubuntu2.diff.gz | 2
 gtksourceview_1.6.1-0ubuntu2.dsc | 2
 hal_0.5.7-1ubuntu9.diff.gz | 2
 hal_0.5.7-1ubuntu9.dsc | 2
 hotkey-setup_0.1-15build1.dsc | 2
 hotkey-setup_0.1-15build1.tar.gz | 2
 initramfs-tools_0.40ubuntu26.dsc | 2
 initramfs-tools_0.40ubuntu26.tar.gz | 2
 kmplayer_0.9.1.99+0.9.2-pre3-0ubuntu1.diff.gz | 2
 kmplayer_0.9.1.99+0.9.2-pre3-0ubuntu1.dsc | 2
 libnjb_2.2.4-3.diff.gz | 2
 libnjb_2.2.4-3.dsc | 2
 ubuntu-meta_0.110.dsc | 2
 ubuntu-meta_0.110.tar.gz | 2
 udev_079-0ubuntu9.diff.gz | 2
 udev_079-0ubuntu9.dsc | 2

I went ahead and made no-change uploads for the three packages which could be easily fixed that way

Revision history for this message
Celso Providelo (cprov) wrote : Proposed Sync Tool Fix

Sync Tool should support multiple LFA (Library File Alias) which point to the same LFC (Library File Content), since, in fact, they are the same file; and only aborts the procedure when the contents diverge.

The agglomeration of the multiple LFA must be done at some point, but as far I can tell this condition is harmless to the system, since the download result will be the same.

This patch might not apply clearly to the production version of the sync-tool, since it appears to have local changes.

Revision history for this message
James Troup (elmo) wrote : Re: [Bug 38227] Re: multiple copies of orig.tar.gz's in the librarian

Celso Providelo <email address hidden> writes:

> ** Attachment added: "Proposed Sync Tool Fix"
> http://librarian.launchpad.net/2538918/fix_sync_tool_to_support_multiple_LFA_with_same_LFC.diff

Err, the sync tool already does this.

--
James

Revision history for this message
Celso Providelo (cprov) wrote : Same fix, better style

As discussed with kiko, style fix and plan for redesign the entire script.

Revision history for this message
Celso Providelo (cprov) wrote :

James, where is this code and why does it still failing like reported in bug #41487 ?

Revision history for this message
Andrew Bennetts (spiv) wrote :

Celso, is it possible to construct a test case for this so that you can reproduce it, and be sure that the proposed fix actually works? There appears to be a fair bit of guesswork going on, it would be nice to be able to reproduce the problem in a test environment.

Also, having tests in your patch will help it pass code review...

Revision history for this message
Celso Providelo (cprov) wrote : Re: [Bug 38227] Re: multiple copies of orig.tar.gz's in the librarian

Andrew Bennetts wrote:
> Celso, is it possible to construct a test case for this so that you
> can reproduce it, and be sure that the proposed fix actually works?
> There appears to be a fair bit of guesswork going on, it would be
> nice to be able to reproduce the problem in a test environment.

If by this, you mean the problem itself (as we should not accept
duplicated filename with different content), yes, we already have it
integrated in the doc/zzz-soyuz-set-of-uploads.txt, so, we are sure we
are not making the same mistake again.

However, this last fix is in sync-tool, to deal with the corrupted
content (even if I think the multi LFAs to a single LFC isn't a real
corruption, since the system does support it). And also a fix to apply
experientially in a portion of code that isn't even kept up to date in
LP tree :( , no standards, no patterns, no test, it's a contributed code
atm.

I appreciate your concern and agree with it, I have plan to redesign
each single tool located in scripts/ftpmaster/ (as discussed some time
ago in lp-reviews with bjornt) and it include support classes for each
script, functional and doc tests, standard command-line options, etc.

> Also, having tests in your patch will help it pass code review...

I count with your understanding in this situation, it can't change
magically, redesign in this land would require time and more hands to
not break the dapper release process.

[]
--
Celso Providelo <email address hidden>
Canonical Ltd - http://www.canonical.com

Revision history for this message
Matt Zimmerman (mdz) wrote :

The situation turned out to be (possibly) somewhat better than I originally thought; it was only digikam where the .orig was mismatched relative to Debian. In the kdissert case, the Ubuntu .dsc was wrong, but the .orig in the pool matched Debian.

So I've uploaded digikam with a fresh renamed .orig, and kdissert with only a new revision and the same orig, which should get the pool into a consistent state.

Can someone confirm what the database says, so we know if kdissert needs the same treatment or not?

Revision history for this message
Celso Providelo (cprov) wrote :

yes, kdissert needs new upload with fresh orig version to overlap the differences between the archive and the DB properly, as you did for digikam.

After some investigation with kiko we figured out that what was in archive as Kdissert_1.0.5.debian.orig.tar.gz, didn't fit with what the systems recognizes as such.

What exactly happened is a new upload was accepted with a new orig content but the same name, but at the end, the publisher refused to overwrite the file in archive. (boom, inconsistency between archive and model, but not even a single warn).

The DB has, let's say for comparison sake, kdissert_1.0.5.debian2.orig.tar.gz contents but named as the former.

We expect a new full upload would fix the archive snapshot, but the history will be broken.

Revision history for this message
Matt Zimmerman (mdz) wrote :

On Thu, May 11, 2006 at 08:49:58PM -0000, Celso Providelo wrote:
> yes, kdissert needs new upload with fresh orig version to overlap the
> differences between the archive and the DB properly, as you did for
> digikam.

I uploaded it earlier today; should be fine now.

--
 - mdz

Revision history for this message
Celso Providelo (cprov) wrote : Re: [Bug 38227] Re: [Bug 38227] Re: multiple copies of orig.tar.gz's in the librarian

Matt Zimmerman wrote:
> On Thu, May 11, 2006 at 08:49:58PM -0000, Celso Providelo wrote:
>> yes, kdissert needs new upload with fresh orig version to overlap the
>> differences between the archive and the DB properly, as you did for
>> digikam.
>
> I uploaded it earlier today; should be fine now.
>

Yes, both (digikam & kdissert) seem fine now from a.u.c

thank you

--
Celso Providelo <email address hidden>
Canonical Ltd - http://www.canonical.com

Revision history for this message
Celso Providelo (cprov) wrote :

code is fixed and current versions are ok.

Changed in qprocd:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.