Project

General

Profile

Actions

Corpora conversion workflow

Prerequisites

  • website with a limited list of supported corpora formats and their detailed description
  • packaging guidelines for submitters
  • corpus metadata format specification (XML) - superset of KonText configuration, Manatee Vertical File metadata and Treex configuration

Packaging guidelines for submitters

  1. make sure that the data are in some well defined (supported) format
  2. package the data in a standard way (e.g. use tar.gz or zip and make sure the directory structure contains only text or gziped text datafiles, all other information shoud be packaged separately)

Workflow

This is a summary of necessary steps to convert user submitted corpus to KonText:

  1. Submission stage (DSpace)
    1. submit data to repository
    2. specify data format (CoNLL, Treex, PML?, ..., other)
    3. describe the data format in more detail using some wizard (analogy to license selector, can be repository independent)
    4. optionally tag the data as "Include the corpus in KonText"
    5. trigger validation stage upon submission
  2. Validation stage (DSpace/Manatee server/?)
    1. validate the data against metadata in restricted environment (can be very space and CPU intensive)
    2. trigger Prepare conversion stage
  3. Prepare conversion stage (Manatee server)
    1. download the user provided metadata
    2. generate conversion config file based on user provided metadata
    3. copy conversion config file to cluster
    4. trigger conversion stage
  4. Conversion job (cluster)
    1. check for new conversion config files
    2. download data from repository
    3. unpack downloaded data
    4. split data to parts of reasonable size
    5. perform conversion on cluster
    6. monitor status of the conversion job
    7. collect the data generated by cluster nodes and assemble vertical file
    8. delete the data, splits and conversion configuration file from cluster
    9. trigger compilation stage
  5. Compilation stage (Manatee server)
    1. download the vertical file back from cluster
    2. download the user provided metadata
    3. generate manatee metadata file based on user provided metadata
    4. compile the corpus based on manatee metadata file
    5. trigger delete cluster data files stage
    6. trigger update KonText configuration stage
  6. Delete cluster data files stage (Cluster)
    1. delete cluster data files and user provided metadata
  7. Update KonText configuration stage (Manatee server)
    1. check for new user metadata files
    2. generate KonText partial config file based on user provided metadata
    3. update context config file based on KonText partial config file

Notes

  1. Automatic deployment of corpora to production environment without testing is in staging environment is unfortunate
  2. User metadata might be entered incorrectly, modification of metadata means repeating the whole process
  3. Updating of corpus must be done in separate environment (if we want 99% uptime)
  4. The whole process is VERY error prone

Updated by Redmine Admin almost 8 years ago · 1 revisions