Corpora compilation¶
Compilation¶
Set MANATEE_REGISTRY environmental variable to the directory with registry files:
export MANATEE_REGISTRY=/opt/projects/lindat-services-kontext/devel/data/corpora/registry
Compile the corpus:
cd $MANATEE_REGISTRY compilecorp --no-sketches --recompile-corpus <corpus config file>
Troubleshooting¶
Corpus config file in MANATEE_REGISTRY must be named in lowercase
This is probably bug in KonText.
Corpus config file in MANATEE_REGISTRY must consist only of alphanumerical characters
The name of the config file is used in the within clause of CQL queries and special characters cause CQL syntax errors.
This happens always when trying to search in two or more parallel corpora at the same time.
Compilation of large corpora (e.g. syn2013pub) will fail with an error like this:
[20140612-09:47:00] Processed 288000000 lines, 248604749 positions. [20140612-09:47:04] encodevert error: File too large for FD_FD, use FD_FGD Writing log to /opt/projects/lindat-services-kontext/devel/data/corpora/data/syn2013pub/log/compilecorp_2014-06-12_0917.log
In large corpora the type of basic attributes (word, lemma...) needs to be changed to FD_FGD
(see http://www.sketchengine.co.uk/documentation/wiki/SkE/Config/FullDoc#Attributestypes)
Computing sizes can fail if the doc structure element doesn't contain ATTRIBUTE wordcount:
You will see the following message at the beginning of compilation:
Reading corpus configuration... corpinfo: CorpInfoNotFound (wordcount) ...
Add ATTRIBUTE wordcount to doc structure element.
STRUCTURE doc { ATTRIBUTE wordcount }
Compilation of large corpora (e.g. syn2013pub) will fail with an error like this:
[20140612-18:33:28] lexicon (/opt/projects/lindat-services-kontext/devel/data/corpora/data/syn2013pub/s.id) make_lex_srt_file [20140612-18:33:29] encodevert error: FileAccessError (/opt/projects/lindat-services-kontext/devel/data/corpora/data/syn2013pub/s.id.rev.idx) in ToFile: fopen [Too many open files]
In this case the system limits are too low. See the adjust limits section on Installation page.
Updated by Redmine Admin almost 8 years ago · 1 revisions