Project

General

Profile

Actions

Corpora

Introduction

Conversion of corpora from vertical text to binary format is done by the compilecorp tool provided by the manatee package.
Two files are needed for the conversion: the vertical text (i.e. corpus) itself and corpus configuration file that describes in detail the
contents of the corpus.

The vertical text is documented here: http://www.sketchengine.co.uk/documentation/wiki/SkE/PrepareText
The corpus configuration file is documented here: http://www.sketchengine.co.uk/documentation/wiki/SkE/Config/FullDoc

Directory structure

The directory structure on kontext-dev (kontext) servers is as follows:

/opt/project/lindat-services/$ENVIRONMENT/data/corpora/registry # configuration files (no subdirectories)
/opt/project/lindat-services/$ENVIRONMENT/data/corpora/data # compiled corpora
/opt/project/lindat-services/$ENVIRONMENT/data/corpora/speech # mp3 files
/opt/project/lindat-services/devel/data/corpora/conversion # conversion of corpora (data and scripts)
/opt/project/lindat-services/devel/data/corpora/vert # vertical text files (corpora data)

Updated by Redmine Admin almost 9 years ago · 1 revisions