Run parsers and generate hpz files

Bug #254443 reported by andrew
2
Affects Status Importance Assigned to Milestone
Hamlet
In Progress
High
WVU Modeling Intelligence Lab

Bug Description

We now have a fully functioning pipeline but are lacking the data to use with it. Each parser will create a .hpz file that is used as input to the pre-processor. We need to generate these files.

This has already been done for the law dataset (half of it at least). It also needs done for the following datasets.

* STEP
* Text/Html
* Java

A large task for this will be finding the data to run the parsers on.

* STEP - there is a folder on wisp at hamlet/data/STEP. This contains a collection of STEP files we extracted from the nara data. this would be a good starting point. Talk to Greg if you need assistance with this.

* Text/HTML - This i'm not sure about. Should we use the data from the nara folks? Talk to adam and see what he has to say since he wrote this.

* Java - This is your cup of tea. The good thing about this parser is that it has a wide variety of possible datasets to be run on. I recommend you start with weka. Try to build an hpz file for at least 3 different large open source projects.

I'm going to ask that Adam and Greg comment on this bug with instructions/recommendations for using their parsers. Be on the lookout for that.

Changed in hamlet:
assignee: nobody → mhull1
Revision history for this message
Gregory Gay (gregoryg) wrote :

About the only thing I can think to add about the step parser is that you might need to change the hard-coded directory in it to match where you store the step files. Just give me a shout if you need any help.

Revision history for this message
Gregory Gay (gregoryg) wrote :

No longer using .hpz files.

Changed in hamlet:
assignee: mhull1 → wvumil
importance: Undecided → High
status: New → In Progress
Revision history for this message
Gregory Gay (gregoryg) wrote :

STEP corpora added to svn.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.