soyuz upload system pays attention to ftp sessions

Bug #30415 reported by Ian Jackson
4
Affects Status Importance Assigned to Milestone
txpkgupload
Triaged
Low
Unassigned

Bug Description

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

 affects /products/soyuz
 done

I am informed that soyuz, when handling incoming uploads, assumes that
all of the components of an upload (the .changes and the files listed
in it) will appear in a single FTP session. This is a violation of
the implied semantics of FTP, and can cause practical problems.

For example, if you use dupload but your upload fails (eg due to
network problems) after successfully completing some files, then
dupload will record success for those files but soyuz will delete
then. If you then rerun dupload it will upload only the files which
were not successfully transferred the first time, but the
already-uploaded files will have been previously deleted by soyuz.

As another example, you might reasonably upload the different parts of
an upload from different systems to save bandwidth on small links.
(Often the .orig.tar.gz is very large.)

As a third example, you might be behind an application relay (web
proxy) which starts a new FTP connection for each transfer. That's
obviously not ideal and is slow and wasteful but it's not demonstrably
wrong.

The correct solution is for soyuz to keep files hanging around rather
than giving each new upload connection a new blank directory. Clashes
between files of different names can be resolved in favour of the most
recent, if each target distribution or namespace has a separate upload
directory. Races can be avoided by careful programming.

Ian.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.6 <http://mailcrypt.sourceforge.net/>

iD8DBQFD43hN8jyP9GfyNQARAqM9AJ4jxO7Jqs/Z+XMX2Khl0YBr3MmvPwCeKSiR
DI+R1DR6Cp52eJfLewbWnNI=
=CRKM
-----END PGP SIGNATURE-----

Revision history for this message
Celso Providelo (cprov) wrote :

I agree with some of your points and we do have a plan for rearranging the directory structure on the server to allow 'lazy' uploads and also allow 'Personal Package Archives'uploads to be landed. Something like:

{{{
/distros/<distro_name>/
/people/<person_name>/distros/<distro_name>/
}}}

Each directory beyond those paths will be processed contextually.

Following you suggestion, I think, we need to support a single control file inside the upload directory which controls its processing, of course, it needs support from 'dput' side. It would be something like:

{{{
at /distros/ubuntu/<upload_directory_name>/<control_file> containing:
proccess = False
}}}

I don't know how to define a reliable way to name the upload directory, since it will be required in the' dput' side to continue the upload later.

Also it needs a way to minimize exploits by having a lot of locked upload_directories, maybe removing directories older than a some age (one day would be a lot, IMO).

Anyway, the current prototype hasidentified problem, let's discuss available mid-term solutions.

Changed in soyuz:
assignee: nobody → cprov
status: Unconfirmed → Needs Info
Revision history for this message
Ian Jackson (ijackson) wrote : Re: [Bug 30415] Re: soyuz upload system pays attention to ftp sessions

Celso Providelo writes ("[Bug 30415] Re: soyuz upload system pays attention to ftp sessions"):
> Following you suggestion, I think, we need to support a single
> control file inside the upload directory which controls its
> processing, of course, it needs support from 'dput' side. It would
> be something like:

Why not process each .changes as soon as the files it lists have
arrived ? A .changes is detailed enough to tell whether the upload is
complete. Stale files can be expired, and it is safe to allow
overwrites - or you could even do something fancy like keep both
versions of the file and pick the one that matches the md5sum in the
.changes.

In fact, there is not really any reason to support directory listings
on the incoming directories, nor downloads from them. Once we realise
this, then it becomes obvious that there is no need for the _filename_
to be a unique key for finding an uploaded file.

Non-.changes files which have been uploaded should be indexed by
(filename,size,md5sum), as that's the information which is listed in
the .changes. .changes files can be indexed by the md5sum of the
.changes, and if two .changes with the same name are uploaded then
whichever one gets all of the files is processed - if both get all of
their files then both are processed, possibly causing an error if that
would violate in-archive uniqueness.

If you do this then you don't need all the palaver with directories
either, because the .changes will say where (which distro, which
personal package archive, etc.) the files should be sent.

Ian.

Revision history for this message
Ian Jackson (ijackson) wrote : forwarded message from Ubuntu Installer
Download full text (3.2 KiB)

Here is an example of a failure caused by this bug.

I used dupload to upload this file and my network broke during the
upload of the .orig.tar.gz. I restarted the upload and dupload quite
correctly didn't reupload the already successfuly uploaded files. The
result was the error message you will find attached.

I disagree vehemently with describing this bug as a `wishlist item'.
It is quite simply a bug that soyuz pays attention to which FTP
session a file arrives in.
 * It is completely wrong as it violates the basic assumptions
   underlying the FTP protocol (which are, admittedly, not stated in
   the FTP RFC as the authors presumably didn't think anyone would be
   so daft).
 * There is no excuse for this behaviour. As you can see from the
   design sketch I have provided, there are reasonably sane and
   reliable ways of dealing with filename clashes, partial uploads,
   etc.
 * As you can see from the attached message, this behaviour causes
   actual malfunctions.
So there is no excuse for classifying it as `wishlist' !

I will change the bug status shortly.

Ian.

Rejected:
UploadError made it out to the main loop: File gs-gpl_8.50-1.1ubuntu1.dsc as mentioned in the changes file was not found.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Format: 1.7
Date: Fri, 30 Jun 2006 18:41:54 +0100
Source: gs-gpl
Binary: gs-gpl gs
Architecture: source
Version: 8.50-1.1ubuntu1
Distribution: edgy
Urgency: high
Maintainer: Masayuki Hatta (mhatta) <email address hidden>
Changed-By: Ian Jackson <email address hidden>
Description:
 gs - Transitional package
 gs-gpl - The GPL Ghostscript PostScript interpreter
Closes: 347637 348834 354352 357326
Changes:
 gs-gpl (8.50-1.1ubuntu1) edgy; urgency=low
 .
   * Merge from debian unstable.
   * Remaining differences from Debian:
     - Enable IJS KRGB support.
     - PowerPC crash fix, messing with setjmp.
     - Fix for X11 crash with antialising.
   * Differences from Debian now dropped:
     - `printenv' in debian/rules.
     - Specification of CC=gcc in debian/rules.
   * See changelog entries below for full details.
 .
 gs-gpl (8.50-1.1) unstable; urgency=high
 .
   * Non-maintainer upload.
   * debian/patches/00list: Re-enable patch 10_powerpc_crash_fix; upstream
     delayed to fix after the 8.50 release. Cures segfaults on ppc (again).
     Thanks to Roger Leigh for testing. Closes: #357326
 .
 gs-gpl (8.50-1) unstable; urgency=low
 .
   * Works done at Codefest in Malaysia 2006.
   * New upstream release - closes: #347637, #348834
   * Updated debian/watch - closes: #354352
   * Bumped to Standards-Version: 3.6.2.2 (no physical changes).
 .
 gs-gpl (8.15-4.1) unstable; urgency=low
 .
   * Non-maintainer upload.
   * Use gcc-3.4 on s390.
Files:
 d86d5fce7654ff79c7cc33544021f79e 803 text optional gs-gpl_8.50-1.1ubuntu1.dsc
 661cacc387fb908f434bfbf5eef5c0ce 9981486 text optional gs-gpl_8.50.orig.tar.gz
 ea907b118553321d91f8c2020c447653 66824 text optional gs-gpl_8.50-1.1ubuntu1.diff.gz

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFEpWQv05QTisYqw/QRAlbtAJoD4aHIAMIGa1nbf0Yzp/eb+uJ7GgCcCE2L
fB0OzGAeGT08RgbdLSrSWCE=
=cKRH
-----END PGP SIGNATURE-----

===

If you don't...

Read more...

Ian Jackson (ijackson)
Changed in soyuz:
status: Needs Info → Confirmed
Revision history for this message
Christian Reis (kiko) wrote :

The main reason this has never been done is that it would involve rearchitecting of the upload service, and few if any other users have ever reported this to be an inconvenience.

It would involve rearchitecting, by the way, because our incoming poppy server has no concept of a persistent upload directory; every upload is made to its own temporary directory, which is then asynchronously processed by a separate process that validates and creates the upload. To provide a persistent upload directory would probably require changing the way poppy stores the data, and would require identifying the user uploading somehow (perhaps based on the gpg key ID being used?) to be able to tell what his directory is. It's not a bad idea, and it's not crazy, but there are a lot of other pressing issues that need our engineering resources too.

Revision history for this message
Ian Jackson (ijackson) wrote : Re: [Bug 30415] Re: soyuz upload system pays attention to ftp sessions

Christian Reis writes ("[Bug 30415] Re: soyuz upload system pays attention to ftp sessions"):
> The main reason this has never been done is that it would involve
> rearchitecting of the upload service, and few if any other users have
> ever reported this to be an inconvenience.

In practice it's not usually too much of an inconvenience if you know
it's happening.

> It would involve rearchitecting, by the way, because our incoming poppy
> server has no concept of a persistent upload directory; every upload is
> made to its own temporary directory, which is then asynchronously
> processed by a separate process that validates and creates the upload.

That sounds like a perfectly fine beginning approach.

> To provide a persistent upload directory would probably require changing
> the way poppy stores the data, and would require identifying the user
> uploading somehow (perhaps based on the gpg key ID being used?) to be
> able to tell what his directory is. It's not a bad idea, and it's not
> crazy, but there are a lot of other pressing issues that need our
> engineering resources too.

I propose the following approach:

 * Move files out of the incoming directories and rename them into a
   holding directory with a filename which includes their checksum.
   (Several hardlinks if several checksums need to be supported.)
   If two identically named files arrive with the same checksums,
   check that they are identical and if not call for help. (Put
   everything aside and block further processing of any file with this
   checksum.)
 * Periodically look for .changes files all of whose pieces have
   arrived, and then process them. When processed, delete the
   .changes but not the other files.
 * Expire files after some period of time (24h?)

Ian.

Curtis Hovey (sinzui)
Changed in soyuz:
assignee: Celso Providelo (cprov) → nobody
tags: added: poppy
Colin Watson (cjwatson)
affects: launchpad → txpkgupload
tags: removed: poppy
information type: Public → Private
Ian Jackson (ijackson)
description: updated
Colin Watson (cjwatson)
information type: Private → Public
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related questions

Remote bug watches

Bug watches keep track of this bug in other bug trackers.