Corpora » History » Version 1
Redmine Admin, 01/04/2017 05:24 PM
1 | 1 | Redmine Admin | h1. Corpora |
---|---|---|---|
2 | |||
3 | * [[Corpora conversion workflow]] (proposal) |
||
4 | * [[Corpora conversion]] |
||
5 | * [[Corpora compilation]] |
||
6 | * [[Conversion benchmarks]] |
||
7 | * "List of available corpora":https://docs.google.com/spreadsheets/d/1K0ZpJNVRcd5Yt1Ti1p2zr3lrxRxyKAZwdzx8fOrjkH0/edit?usp=sharing |
||
8 | |||
9 | h2. Introduction |
||
10 | |||
11 | Conversion of corpora from vertical text to binary format is done by the compilecorp tool provided by the manatee package. |
||
12 | Two files are needed for the conversion: the vertical text (i.e. corpus) itself and corpus configuration file that describes in detail the |
||
13 | contents of the corpus. |
||
14 | |||
15 | The vertical text is documented here: http://www.sketchengine.co.uk/documentation/wiki/SkE/PrepareText |
||
16 | The corpus configuration file is documented here: http://www.sketchengine.co.uk/documentation/wiki/SkE/Config/FullDoc |
||
17 | |||
18 | h2. Directory structure |
||
19 | |||
20 | The directory structure on kontext-dev (kontext) servers is as follows: |
||
21 | |||
22 | <pre> |
||
23 | /opt/project/lindat-services/$ENVIRONMENT/data/corpora/registry # configuration files (no subdirectories) |
||
24 | /opt/project/lindat-services/$ENVIRONMENT/data/corpora/data # compiled corpora |
||
25 | /opt/project/lindat-services/$ENVIRONMENT/data/corpora/speech # mp3 files |
||
26 | /opt/project/lindat-services/devel/data/corpora/conversion # conversion of corpora (data and scripts) |
||
27 | /opt/project/lindat-services/devel/data/corpora/vert # vertical text files (corpora data) |
||
28 | </pre> |