Corpora conversion workflow¶

Prerequisites¶

website with a limited list of supported corpora formats and their detailed description
packaging guidelines for submitters
corpus metadata format specification (XML) - superset of KonText configuration, Manatee Vertical File metadata and Treex configuration

Packaging guidelines for submitters¶

make sure that the data are in some well defined (supported) format
package the data in a standard way (e.g. use tar.gz or zip and make sure the directory structure contains only text or gziped text datafiles, all other information shoud be packaged separately)

Workflow¶

This is a summary of necessary steps to convert user submitted corpus to KonText:

Submission stage (DSpace)
1. submit data to repository
2. specify data format (CoNLL, Treex, PML?, ..., other)
3. describe the data format in more detail using some wizard (analogy to license selector, can be repository independent)
4. optionally tag the data as "Include the corpus in KonText"
5. trigger validation stage upon submission
Validation stage (DSpace/Manatee server/?)
1. validate the data against metadata in restricted environment (can be very space and CPU intensive)
2. trigger Prepare conversion stage
Prepare conversion stage (Manatee server)
1. download the user provided metadata
2. generate conversion config file based on user provided metadata
3. copy conversion config file to cluster
4. trigger conversion stage
Conversion job (cluster)
1. check for new conversion config files
2. download data from repository
3. unpack downloaded data
4. split data to parts of reasonable size
5. perform conversion on cluster
6. monitor status of the conversion job
7. collect the data generated by cluster nodes and assemble vertical file
8. delete the data, splits and conversion configuration file from cluster
9. trigger compilation stage
Compilation stage (Manatee server)
1. download the vertical file back from cluster
2. download the user provided metadata
3. generate manatee metadata file based on user provided metadata
4. compile the corpus based on manatee metadata file
5. trigger delete cluster data files stage
6. trigger update KonText configuration stage
Delete cluster data files stage (Cluster)
1. delete cluster data files and user provided metadata
Update KonText configuration stage (Manatee server)
1. check for new user metadata files
2. generate KonText partial config file based on user provided metadata
3. update context config file based on KonText partial config file

Notes¶

Automatic deployment of corpora to production environment without testing is in staging environment is unfortunate
User metadata might be entered incorrectly, modification of metadata means repeating the whole process
Updating of corpus must be done in separate environment (if we want 99% uptime)
The whole process is VERY error prone

Files (0)

Updated by Redmine Admin over 8 years ago · 1 revisions

Project

General

Profile

Lindat Projects » Services » KonText

Wiki

Corpora conversion workflow¶

Prerequisites¶

Packaging guidelines for submitters¶

Workflow¶

Notes¶