Project

General

Profile

Corpora conversion workflow » History » Version 1

Redmine Admin, 01/04/2017 05:25 PM

1 1 Redmine Admin
h1. Corpora conversion workflow
2
3
h2. Prerequisites
4
5
* website with a limited list of supported corpora formats and their detailed description
6
* packaging guidelines for submitters
7
* corpus metadata format specification (XML) - superset of KonText configuration, Manatee Vertical File metadata and Treex configuration
8
9
h2. Packaging guidelines for submitters
10
11
# make sure that the data are in some well defined (supported) format
12
# package the data in a standard way (e.g. use tar.gz or zip and make sure the directory structure contains only text or gziped text datafiles, all other information shoud be packaged separately)
13
14
h2. Workflow
15
16
This is a summary of necessary steps to convert user submitted corpus to KonText:
17
18
# Submission stage (DSpace)
19
## submit data to repository
20
## specify data format (CoNLL, Treex, PML?, ..., other)
21
## describe the data format in more detail using some wizard (analogy to license selector, can be repository independent)
22
## optionally tag the data as "Include the corpus in KonText"
23
## trigger validation stage upon submission
24
# Validation stage (DSpace/Manatee server/?)
25
## validate the data against metadata in restricted environment (can be very space and CPU intensive)
26
## trigger Prepare conversion stage
27
# Prepare conversion stage (Manatee server)
28
## download the user provided metadata
29
## generate conversion config file based on user provided metadata  
30
## copy conversion config file to cluster
31
## trigger conversion stage
32
# Conversion job (cluster)
33
### check for new conversion config files
34
### download data from repository
35
### unpack downloaded data 
36
### split data to parts of reasonable size
37
### perform conversion on cluster
38
### monitor status of the conversion job
39
### collect the data generated by cluster nodes and assemble vertical file
40
### delete the data, splits and conversion configuration file from cluster
41
### trigger compilation stage
42
# Compilation stage (Manatee server)
43
## download the vertical file back from cluster
44
## download the user provided metadata
45
## generate manatee metadata file based on user provided metadata
46
## compile the corpus based on manatee metadata file
47
## trigger delete cluster data files stage
48
## trigger update KonText configuration stage
49
# Delete cluster data files stage (Cluster)
50
## delete cluster data files and user provided metadata
51
# Update KonText configuration stage (Manatee server)
52
## check for new user metadata files
53
## generate KonText partial config file based on user provided metadata 
54
## update context config file based on KonText partial config file
55
56
h2. Notes
57
58
# Automatic deployment of corpora to production environment without testing is in staging environment is unfortunate
59
# User metadata might be entered incorrectly, modification of metadata means repeating the whole process
60
# Updating of corpus must be done in separate environment (if we want 99% uptime)
61
# The whole process is *VERY* error prone