Spencer/Howe example does not align correctly

Bug #567214 reported by Gregor Middell on 2010-04-20
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
collatex
Undecided
Ronald Haentjens Dekker

Bug Description

The use case in Spencer/Howe’s article about multiple progressive alignment, which shows the dependency of pairwise alignment on the order of witnesses collated, is not aligned correctly.

See: collatex/src/test/java/eu/interedition/collatex2/alignmenttable/SpencerHoweTest.java

Changed in collatex:
assignee: nobody → Ronald Haentjens Dekker (ronald-dekker)

Started work on this bug in lp:collatex/1.0development branch.

[RHD] Work on bug #567214; Refactoring to decouple IndexMatcher from AlignmentTable. Goal is to make the IndexMatcher work with anything that contains Tokens, whether that is a Witness, AlignmentTable or a VariantGraph. The IndexMatcher should be able to compare a Witness with a Witness, an AlignmentTable with a Witness or a VariantGraph with a Witness.

Next step after this refactoring is to make the new IndexMatcher work with the VariantGraph I prototyped earlier.

The more general IndexMatcher as described above works.
Basic VariantGraph implementation is done.

Right now looking into Directed Acyclic Graphs, topological sorting and longest path algorithms.

Status update:
The development version of CollateX builds a VariantGraph internally, containing the variation between the witnesses.

VariantGraph is implemented. Variant Graph contains a start and an end vertex.
Vertices contain the normalized version of a token, plus the matching token for each indivual witness.
Edges contain the witnesses for which the edge represents a valid path through the graph.

To build the alignment table CollateX converts the VariantGraph to a Directed Acyclic Graph (DAG) and uses a longest path algorithm (based on BellmanFort) to calculate the optimal number of columns in the table.

Next step: make transposition detection work on the VariantGraph. Now looking into Cycle detection to make that work.

Also the n-gram indexing on the Variant Graph has to be implemented (current implementation is a short cut that assumes every token is unique; which is the case in the Spencer and Howe testcase; not so in real life).

Status update:

Transpositions that occur between the witnesses cause cycles in the VariantGraph.
The challenge is how to deal with these cycles.

I got cycle detection to work on the VariantGraph implementation. Problem is that one transposition can cause multiple cycles. It is not easy to transform cycles intro transpositions and decycle them. The damage has already occured.

I recently changed tactics. Instead of working on cycle detection, I am now working on cycle prevention.
The focus of development has shifted to cycle prevention while building the VariantGraph.

In my 'old' alignment code from the bootcamp of april 2010 there already is transposition detection code.
I need to combine the new VariantGraph based alignment code with the old transposition detection code.

pseudo code looks as follows:

buildVariantGraph:
for each witness {
  match tokens of new witness with existing variant graph
  determine sequences in matches
  determine transpositions
  for each transposition {
    determine which vertices in the variant graph need to be duplicated to prevent cycles
    add vertices
  }
  add edges for matches, add vertices for variants.
}

The new implementation is in the VariantGraph2 class in the experimental package in the 1.0development tree of the source code. The implementation has just started, the code will be in a state of flux for a while.
When its done it will replace the current VariantGraph and DAVariantGraph classes.

Status update: The old VariantGraph and DAVariantGraph classes are merged into one class: VariantGraph2.

This class contains a variant graph implementation that is based on a DAG.
The responsibilities of the class are still the same as described in a previous status update.

Vertices contain the normalized version of a token, plus the matching token for each indivual witness.
Edges contain the witnesses for which the edge represents a valid path through the graph.

The Variant Graph contains a start and an end vertex.
The alignment table is build using the longest path between the start and the end vertex.

Next step will be to refactor the existing transposition detection to make it work on the variant graph and the witness that gets added during the alignment process.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers