Corpora conversion workflow » History » Version 1
Redmine Admin, 01/04/2017 05:25 PM
1 | 1 | Redmine Admin | h1. Corpora conversion workflow |
---|---|---|---|
2 | |||
3 | h2. Prerequisites |
||
4 | |||
5 | * website with a limited list of supported corpora formats and their detailed description |
||
6 | * packaging guidelines for submitters |
||
7 | * corpus metadata format specification (XML) - superset of KonText configuration, Manatee Vertical File metadata and Treex configuration |
||
8 | |||
9 | h2. Packaging guidelines for submitters |
||
10 | |||
11 | # make sure that the data are in some well defined (supported) format |
||
12 | # package the data in a standard way (e.g. use tar.gz or zip and make sure the directory structure contains only text or gziped text datafiles, all other information shoud be packaged separately) |
||
13 | |||
14 | h2. Workflow |
||
15 | |||
16 | This is a summary of necessary steps to convert user submitted corpus to KonText: |
||
17 | |||
18 | # Submission stage (DSpace) |
||
19 | ## submit data to repository |
||
20 | ## specify data format (CoNLL, Treex, PML?, ..., other) |
||
21 | ## describe the data format in more detail using some wizard (analogy to license selector, can be repository independent) |
||
22 | ## optionally tag the data as "Include the corpus in KonText" |
||
23 | ## trigger validation stage upon submission |
||
24 | # Validation stage (DSpace/Manatee server/?) |
||
25 | ## validate the data against metadata in restricted environment (can be very space and CPU intensive) |
||
26 | ## trigger Prepare conversion stage |
||
27 | # Prepare conversion stage (Manatee server) |
||
28 | ## download the user provided metadata |
||
29 | ## generate conversion config file based on user provided metadata |
||
30 | ## copy conversion config file to cluster |
||
31 | ## trigger conversion stage |
||
32 | # Conversion job (cluster) |
||
33 | ### check for new conversion config files |
||
34 | ### download data from repository |
||
35 | ### unpack downloaded data |
||
36 | ### split data to parts of reasonable size |
||
37 | ### perform conversion on cluster |
||
38 | ### monitor status of the conversion job |
||
39 | ### collect the data generated by cluster nodes and assemble vertical file |
||
40 | ### delete the data, splits and conversion configuration file from cluster |
||
41 | ### trigger compilation stage |
||
42 | # Compilation stage (Manatee server) |
||
43 | ## download the vertical file back from cluster |
||
44 | ## download the user provided metadata |
||
45 | ## generate manatee metadata file based on user provided metadata |
||
46 | ## compile the corpus based on manatee metadata file |
||
47 | ## trigger delete cluster data files stage |
||
48 | ## trigger update KonText configuration stage |
||
49 | # Delete cluster data files stage (Cluster) |
||
50 | ## delete cluster data files and user provided metadata |
||
51 | # Update KonText configuration stage (Manatee server) |
||
52 | ## check for new user metadata files |
||
53 | ## generate KonText partial config file based on user provided metadata |
||
54 | ## update context config file based on KonText partial config file |
||
55 | |||
56 | h2. Notes |
||
57 | |||
58 | # Automatic deployment of corpora to production environment without testing is in staging environment is unfortunate |
||
59 | # User metadata might be entered incorrectly, modification of metadata means repeating the whole process |
||
60 | # Updating of corpus must be done in separate environment (if we want 99% uptime) |
||
61 | # The whole process is *VERY* error prone |