Actions
Corpora conversion workflow¶
Prerequisites¶
- website with a limited list of supported corpora formats and their detailed description
- packaging guidelines for submitters
- corpus metadata format specification (XML) - superset of KonText configuration, Manatee Vertical File metadata and Treex configuration
Packaging guidelines for submitters¶
- make sure that the data are in some well defined (supported) format
- package the data in a standard way (e.g. use tar.gz or zip and make sure the directory structure contains only text or gziped text datafiles, all other information shoud be packaged separately)
Workflow¶
This is a summary of necessary steps to convert user submitted corpus to KonText:
- Submission stage (DSpace)
- submit data to repository
- specify data format (CoNLL, Treex, PML?, ..., other)
- describe the data format in more detail using some wizard (analogy to license selector, can be repository independent)
- optionally tag the data as "Include the corpus in KonText"
- trigger validation stage upon submission
- Validation stage (DSpace/Manatee server/?)
- validate the data against metadata in restricted environment (can be very space and CPU intensive)
- trigger Prepare conversion stage
- Prepare conversion stage (Manatee server)
- download the user provided metadata
- generate conversion config file based on user provided metadata
- copy conversion config file to cluster
- trigger conversion stage
- Conversion job (cluster)
- check for new conversion config files
- download data from repository
- unpack downloaded data
- split data to parts of reasonable size
- perform conversion on cluster
- monitor status of the conversion job
- collect the data generated by cluster nodes and assemble vertical file
- delete the data, splits and conversion configuration file from cluster
- trigger compilation stage
- Compilation stage (Manatee server)
- download the vertical file back from cluster
- download the user provided metadata
- generate manatee metadata file based on user provided metadata
- compile the corpus based on manatee metadata file
- trigger delete cluster data files stage
- trigger update KonText configuration stage
- Delete cluster data files stage (Cluster)
- delete cluster data files and user provided metadata
- Update KonText configuration stage (Manatee server)
- check for new user metadata files
- generate KonText partial config file based on user provided metadata
- update context config file based on KonText partial config file
Notes¶
- Automatic deployment of corpora to production environment without testing is in staging environment is unfortunate
- User metadata might be entered incorrectly, modification of metadata means repeating the whole process
- Updating of corpus must be done in separate environment (if we want 99% uptime)
- The whole process is VERY error prone
Updated by Redmine Admin almost 8 years ago · 1 revisions