importer: having only one import record for every publish means the commit date is no longer identical to the publishing date

Bug #1730734 reported by Nish Aravamudan on 2017-11-07
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
usd-importer
Critical
Nish Aravamudan

Bug Description

launchpad_versions_published_after used the publishing date to identify what the last LP publishing record that was imported was. That no longer works, because there is now one publishing record for a given version, and the various series/pocket branches point to the same commit (which has a commit timestamp of the first publish seen). We could update the commit message metadata to include a Launchpad-URL: to the publishig record, or a Launchpadlib-Link: to the guaranteed to be unique lp spphr object?

Related branches

Nish Aravamudan (nacc) on 2017-11-07
Changed in usd-importer:
importance: Undecided → Critical
milestone: none → lp-beta
status: New → Confirmed
Nish Aravamudan (nacc) on 2017-11-07
tags: added: import-edge-case
Nish Aravamudan (nacc) wrote :

To be clear, here is my analysis:

In 0f3c943054abc82734febac91896b80f401fcaa7, 47d2ec2bef4bc81efa688610113863ebcfddaa22 and 5aa33fa080789986d90cbe27b086fa63b298a164, we modified the import algorithm to simply reset the branch pointers for series & pocket to the corresponding commit of the source package publication. Similarly, for the series & pocket -devel, and ubuntu/devel branches. This reflects the publication event. We also dropped the publishing parent.

In effect, we now have a commit graph of all the publications, tied together by their changelog entries, and the various branch references are just pointers into that graph. They may or may not fast-forward between publications to the corresponding series & pocket.

This subtly breaks our catch-up model in the importer. We store the publication timestamp into the publication commit's commit date. We iterate the latest Launchpad publishes backwards for a given source package until we found a branch that was identical to a given published version *and* publication timestamp and then iterated forward from there. This works well when the commits in the various branches store the publication data specific to that branch (series & pocket).

With the aforementioned change, though, there is now only one commit for a given source package version, which occurs when we first see the version (in either Debian or Ubuntu, Debian first). That commit's metadata, specifically the commit date, is the publication date of only that first publication record. Thus, even if a branch has been correctly updated to point to that commit object, we have no means of recording that event corresponds to a specific publication event.

As far as I can tell, there is no way to do this within the Git metadata. We could add a field to the commit message (a la LP: #1728685), which stores the unique link to the publication record. That would change all hashes, of course, so this needs to be resolved before LP beta.

This might just be an efficiency problem (we would just be importing source packages that are already imported, which should be fine: we iterate far enough back that we re-import something that is an ancestor of where a branch currently is. We'll end up not importing it again (the treeish will match) and then we'll move the branch pointer to an old spot and back to where it is now) but I'm worried it would lead to breaks in the commit graph as that is confusing :)

Robie Basak (racb) on 2017-11-08
tags: added: hash-abi-break
Colin Watson (cjwatson) wrote :

Fundamentally, I think you should be importing a SourcePackageRelease (although that isn't an object exported directly on the webservice), not a SourcePackagePublishingHistory, and that's part of why I requested that the graph be changed in the first place. As such, I would have expected the commit date to correspond to the date in the package's changelog (i.e. the date of authorship), rather than to any date of publication. I don't think you should have a link to the publication record in the commit message, since that really just recapitulates the original problem.

I think the only case where you need more than just version comparison is the case where a version is removed from a suite and then later copied back into the same suite. That's exceedingly rare (usually something else would precipitate a rebuild anyway), but it does happen occasionally, and I can see that you need an efficient way to detect that you've iterated back far enough through the history to complete catching up.

Does this absolutely have to be in-band in the git metadata for the suites in question? You could keep a separate cache indicating how far you've got, and the rare case of needing to restore without that cache could perhaps just be handled by manually setting a threshold. Or if that isn't acceptable you could even have a separate branch in each repository that's just for importer metadata, outside of refs/heads/ so that it isn't normally very visible unless people go looking for it.

On Tue, Nov 7, 2017 at 11:41 PM, Colin Watson <email address hidden> wrote:
> Fundamentally, I think you should be importing a SourcePackageRelease
> (although that isn't an object exported directly on the webservice), not
> a SourcePackagePublishingHistory, and that's part of why I requested
> that the graph be changed in the first place.

I doubt I would have been able to derive we should use an object that
has no documentation and that I didn't know existed until just now :).
Was this discussed at the Rally? I was not informed of it, if so.
Quick Google and it seems to be a column of a table -- how would I get
to that data in Python not running on Launchpad or with any access to
the Launchpad DB?

> As such, I would have
> expected the commit date to correspond to the date in the package's
> changelog (i.e. the date of authorship), rather than to any date of
> publication. I don't think you should have a link to the publication
> record in the commit message, since that really just recapitulates the
> original problem.

To be clear, we do store both bits of metadata.

The commit author name, email and date are the same as in
debian/changelog, if parseable.

The commit commiter name, email are from who is running the importer
and the date is from the Launchpad publishing record.

I'm fine with dropping the latter and just have it be the current date
at commit time.

> I think the only case where you need more than just version comparison
> is the case where a version is removed from a suite and then later
> copied back into the same suite. That's exceedingly rare (usually
> something else would precipitate a rebuild anyway), but it does happen
> occasionally, and I can see that you need an efficient way to detect
> that you've iterated back far enough through the history to complete
> catching up.

Yep, that was the corner case I was worried about. And it would have
to intersect with no other unique versions being published to the same
series in the meanwhile.

> Does this absolutely have to be in-band in the git metadata for the
> suites in question? You could keep a separate cache indicating how far
> you've got, and the rare case of needing to restore without that cache
> could perhaps just be handled by manually setting a threshold. Or if
> that isn't acceptable you could even have a separate branch in each
> repository that's just for importer metadata, outside of refs/heads/ so
> that it isn't normally very visible unless people go looking for it.

Definitely not a hard requirement for it to be in-band. We do already
have metadata (pristine-tar, dsc cache) for the importer in its own
'namespace' of branches, so we could probably reuse that, or as you
said, keep a distinct cache.

Colin Watson (cjwatson) wrote :

I'm not saying you should be able to get at a SourcePackageRelease directly (since it isn't exported). I'm just using that term as a convenient abbreviation: an SPR is a (source package name, version) pair. My comments to Robie at the rally that I thought you should import each version of a package only once were equivalent to this.

Nish Aravamudan (nacc) wrote :

On Wed, Nov 8, 2017 at 8:08 AM, Colin Watson <email address hidden> wrote:
> I'm not saying you should be able to get at a SourcePackageRelease
> directly (since it isn't exported). I'm just using that term as a
> convenient abbreviation: an SPR is a (source package name, version)
> pair. My comments to Robie at the rally that I thought you should
> import each version of a package only once were equivalent to this.

Ah ok, thanks. Sorry for the curt reply, first thing in the morning :)

Nish Aravamudan (nacc) on 2017-11-08
Changed in usd-importer:
status: Confirmed → In Progress
assignee: nobody → Nish Aravamudan (nacc)
Robie Basak (racb) on 2017-11-28
tags: added: import
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers