Actions
Corpora¶
- Corpora conversion workflow (proposal)
- Corpora conversion
- Corpora compilation
- Conversion benchmarks
- List of available corpora
Introduction¶
Conversion of corpora from vertical text to binary format is done by the compilecorp tool provided by the manatee package.
Two files are needed for the conversion: the vertical text (i.e. corpus) itself and corpus configuration file that describes in detail the
contents of the corpus.
The vertical text is documented here: http://www.sketchengine.co.uk/documentation/wiki/SkE/PrepareText
The corpus configuration file is documented here: http://www.sketchengine.co.uk/documentation/wiki/SkE/Config/FullDoc
Directory structure¶
The directory structure on kontext-dev (kontext) servers is as follows:
/opt/project/lindat-services/$ENVIRONMENT/data/corpora/registry # configuration files (no subdirectories) /opt/project/lindat-services/$ENVIRONMENT/data/corpora/data # compiled corpora /opt/project/lindat-services/$ENVIRONMENT/data/corpora/speech # mp3 files /opt/project/lindat-services/devel/data/corpora/conversion # conversion of corpora (data and scripts) /opt/project/lindat-services/devel/data/corpora/vert # vertical text files (corpora data)
Updated by Redmine Admin almost 8 years ago · 1 revisions