Project

General

Profile

Corpora » History » Version 1

Redmine Admin, 01/04/2017 05:24 PM

1 1 Redmine Admin
h1. Corpora
2
3
* [[Corpora conversion workflow]] (proposal)
4
* [[Corpora conversion]]
5
* [[Corpora compilation]]
6
* [[Conversion benchmarks]]
7
* "List of available corpora":https://docs.google.com/spreadsheets/d/1K0ZpJNVRcd5Yt1Ti1p2zr3lrxRxyKAZwdzx8fOrjkH0/edit?usp=sharing
8
9
h2. Introduction
10
11
Conversion of corpora from vertical text to binary format is done by the compilecorp tool provided by the manatee package.
12
Two files are needed for the conversion: the vertical text (i.e. corpus) itself and corpus configuration file that describes in detail the
13
contents of the corpus.
14
15
The vertical text is documented here: http://www.sketchengine.co.uk/documentation/wiki/SkE/PrepareText
16
The corpus configuration file is documented here: http://www.sketchengine.co.uk/documentation/wiki/SkE/Config/FullDoc
17
18
h2. Directory structure
19
20
The directory structure on kontext-dev (kontext) servers is as follows:
21
22
<pre>
23
/opt/project/lindat-services/$ENVIRONMENT/data/corpora/registry # configuration files (no subdirectories)
24
/opt/project/lindat-services/$ENVIRONMENT/data/corpora/data # compiled corpora
25
/opt/project/lindat-services/$ENVIRONMENT/data/corpora/speech # mp3 files
26
/opt/project/lindat-services/devel/data/corpora/conversion # conversion of corpora (data and scripts)
27
/opt/project/lindat-services/devel/data/corpora/vert # vertical text files (corpora data)
28
</pre>