importer: having only one import record for every publish means the commit date is no longer identical to the publishing date

Bug #1730734 reported by Nish Aravamudan
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
git-ubuntu
Triaged
Critical
Robie Basak

Bug Description

launchpad_versions_published_after used the publishing date to identify what the last LP publishing record that was imported was. That no longer works, because there is now one publishing record for a given version, and the various series/pocket branches point to the same commit (which has a commit timestamp of the first publish seen). We could update the commit message metadata to include a Launchpad-URL: to the publishig record, or a Launchpadlib-Link: to the guaranteed to be unique lp spphr object?

Related branches

Nish Aravamudan (nacc)
Changed in usd-importer:
importance: Undecided → Critical
milestone: none → lp-beta
status: New → Confirmed
Nish Aravamudan (nacc)
tags: added: import-edge-case
Revision history for this message
Nish Aravamudan (nacc) wrote :

To be clear, here is my analysis:

In 0f3c943054abc82734febac91896b80f401fcaa7, 47d2ec2bef4bc81efa688610113863ebcfddaa22 and 5aa33fa080789986d90cbe27b086fa63b298a164, we modified the import algorithm to simply reset the branch pointers for series & pocket to the corresponding commit of the source package publication. Similarly, for the series & pocket -devel, and ubuntu/devel branches. This reflects the publication event. We also dropped the publishing parent.

In effect, we now have a commit graph of all the publications, tied together by their changelog entries, and the various branch references are just pointers into that graph. They may or may not fast-forward between publications to the corresponding series & pocket.

This subtly breaks our catch-up model in the importer. We store the publication timestamp into the publication commit's commit date. We iterate the latest Launchpad publishes backwards for a given source package until we found a branch that was identical to a given published version *and* publication timestamp and then iterated forward from there. This works well when the commits in the various branches store the publication data specific to that branch (series & pocket).

With the aforementioned change, though, there is now only one commit for a given source package version, which occurs when we first see the version (in either Debian or Ubuntu, Debian first). That commit's metadata, specifically the commit date, is the publication date of only that first publication record. Thus, even if a branch has been correctly updated to point to that commit object, we have no means of recording that event corresponds to a specific publication event.

As far as I can tell, there is no way to do this within the Git metadata. We could add a field to the commit message (a la LP: #1728685), which stores the unique link to the publication record. That would change all hashes, of course, so this needs to be resolved before LP beta.

This might just be an efficiency problem (we would just be importing source packages that are already imported, which should be fine: we iterate far enough back that we re-import something that is an ancestor of where a branch currently is. We'll end up not importing it again (the treeish will match) and then we'll move the branch pointer to an old spot and back to where it is now) but I'm worried it would lead to breaks in the commit graph as that is confusing :)

Robie Basak (racb)
tags: added: hash-abi-break
Revision history for this message
Colin Watson (cjwatson) wrote :

Fundamentally, I think you should be importing a SourcePackageRelease (although that isn't an object exported directly on the webservice), not a SourcePackagePublishingHistory, and that's part of why I requested that the graph be changed in the first place. As such, I would have expected the commit date to correspond to the date in the package's changelog (i.e. the date of authorship), rather than to any date of publication. I don't think you should have a link to the publication record in the commit message, since that really just recapitulates the original problem.

I think the only case where you need more than just version comparison is the case where a version is removed from a suite and then later copied back into the same suite. That's exceedingly rare (usually something else would precipitate a rebuild anyway), but it does happen occasionally, and I can see that you need an efficient way to detect that you've iterated back far enough through the history to complete catching up.

Does this absolutely have to be in-band in the git metadata for the suites in question? You could keep a separate cache indicating how far you've got, and the rare case of needing to restore without that cache could perhaps just be handled by manually setting a threshold. Or if that isn't acceptable you could even have a separate branch in each repository that's just for importer metadata, outside of refs/heads/ so that it isn't normally very visible unless people go looking for it.

Revision history for this message
Nish Aravamudan (nacc) wrote : Re: [Bug 1730734] Re: importer: having only one import record for every publish means the commit date is no longer identical to the publishing date

On Tue, Nov 7, 2017 at 11:41 PM, Colin Watson <email address hidden> wrote:
> Fundamentally, I think you should be importing a SourcePackageRelease
> (although that isn't an object exported directly on the webservice), not
> a SourcePackagePublishingHistory, and that's part of why I requested
> that the graph be changed in the first place.

I doubt I would have been able to derive we should use an object that
has no documentation and that I didn't know existed until just now :).
Was this discussed at the Rally? I was not informed of it, if so.
Quick Google and it seems to be a column of a table -- how would I get
to that data in Python not running on Launchpad or with any access to
the Launchpad DB?

> As such, I would have
> expected the commit date to correspond to the date in the package's
> changelog (i.e. the date of authorship), rather than to any date of
> publication. I don't think you should have a link to the publication
> record in the commit message, since that really just recapitulates the
> original problem.

To be clear, we do store both bits of metadata.

The commit author name, email and date are the same as in
debian/changelog, if parseable.

The commit commiter name, email are from who is running the importer
and the date is from the Launchpad publishing record.

I'm fine with dropping the latter and just have it be the current date
at commit time.

> I think the only case where you need more than just version comparison
> is the case where a version is removed from a suite and then later
> copied back into the same suite. That's exceedingly rare (usually
> something else would precipitate a rebuild anyway), but it does happen
> occasionally, and I can see that you need an efficient way to detect
> that you've iterated back far enough through the history to complete
> catching up.

Yep, that was the corner case I was worried about. And it would have
to intersect with no other unique versions being published to the same
series in the meanwhile.

> Does this absolutely have to be in-band in the git metadata for the
> suites in question? You could keep a separate cache indicating how far
> you've got, and the rare case of needing to restore without that cache
> could perhaps just be handled by manually setting a threshold. Or if
> that isn't acceptable you could even have a separate branch in each
> repository that's just for importer metadata, outside of refs/heads/ so
> that it isn't normally very visible unless people go looking for it.

Definitely not a hard requirement for it to be in-band. We do already
have metadata (pristine-tar, dsc cache) for the importer in its own
'namespace' of branches, so we could probably reuse that, or as you
said, keep a distinct cache.

Revision history for this message
Colin Watson (cjwatson) wrote :

I'm not saying you should be able to get at a SourcePackageRelease directly (since it isn't exported). I'm just using that term as a convenient abbreviation: an SPR is a (source package name, version) pair. My comments to Robie at the rally that I thought you should import each version of a package only once were equivalent to this.

Revision history for this message
Nish Aravamudan (nacc) wrote :

On Wed, Nov 8, 2017 at 8:08 AM, Colin Watson <email address hidden> wrote:
> I'm not saying you should be able to get at a SourcePackageRelease
> directly (since it isn't exported). I'm just using that term as a
> convenient abbreviation: an SPR is a (source package name, version)
> pair. My comments to Robie at the rally that I thought you should
> import each version of a package only once were equivalent to this.

Ah ok, thanks. Sorry for the curt reply, first thing in the morning :)

Nish Aravamudan (nacc)
Changed in usd-importer:
status: Confirmed → In Progress
assignee: nobody → Nish Aravamudan (nacc)
Robie Basak (racb)
tags: added: import
Revision history for this message
Nish Aravamudan (nacc) wrote :

Note that this also affects our scripts, I'm realizing.

If we are reimporting the world, e.g., we want to have the linear walker 'clear' the database, so the keepup script stops keeping up for a bit, and as the walker finishes reimporting a source package, it adds it back to the database with the last publication record seen? Then the keepup script only looks at packages in the database, rather than using the whitelist/blacklist/etc.

The database could be local (sqlite) for now, with the intention that it goes away altogether eventually, as we eventually will import all source packages.

Revision history for this message
Nish Aravamudan (nacc) wrote :

In the short term, though, I'm going to disable the keepup script while I kick off a reimport run of the whitelist + phasing at 1%.

Revision history for this message
Nish Aravamudan (nacc) wrote :

Just so I don't forget, we would also want to make the update-repository-alias read from the same database in order to ensure it only affected packages that we have successfully imported.

Robie Basak (racb)
tags: added: spec
Revision history for this message
Robie Basak (racb) wrote :

I've drafted a specification for this as follows:

---
Commit metadata in git’s data model comprises:
A person’s name
A person’s email address
A timestamp expressed as a number of seconds from the Unix epoch
A timezone offset
for each of:
The author
The committer

Commit metadata generated by the importer is synthesized deterministically from the corresponding source package data instance and its underlying source package.

All authorship information fields are derived directly from their corresponding fields from the sign-off line of the first entry from the debian/changelog file of the underlying source package. An absent name in the changelog entry is interpreted as the empty string. Any failure to parse the changelog sign-off line must result in an import failure.

The committer timestamp is converted from the timestamp of the source package data instance.

The commit name and email address are statically set to usd-importer and <email address hidden> respectively.

Rationale: this presents users looking at git history with useful metadata in an appropriate mapping while also maintaining the determinism necessary for commit hash stability.
---

"source package data instance" means SourcePackageRelease, where I'm defining "the timestamp of the source package data instance" as the earliest timestamp of all the SourcePackagePublishingHistory for that SourcePackageRelease available to us. I might need to think about SourcePackagePublishingHistory entries invisible to us though (such as in a pocket copy from a PPA) by defining what "available to us" should mean exactly.

The commit name and email address probably should be changed to be made more Ubuntu-general now.

Feedback appreciated.

Changed in usd-importer:
assignee: Nish Aravamudan (nacc) → Robie Basak (racb)
status: In Progress → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.