Project

General

Profile

Actions

Corpora compilation

Compilation

Set MANATEE_REGISTRY environmental variable to the directory with registry files:

export MANATEE_REGISTRY=/opt/projects/lindat-services-kontext/devel/data/corpora/registry

Compile the corpus:

cd $MANATEE_REGISTRY
compilecorp --no-sketches --recompile-corpus <corpus config file>

Troubleshooting

Corpus config file in MANATEE_REGISTRY must be named in lowercase

This is probably bug in KonText.

Corpus config file in MANATEE_REGISTRY must consist only of alphanumerical characters

The name of the config file is used in the within clause of CQL queries and special characters cause CQL syntax errors.
This happens always when trying to search in two or more parallel corpora at the same time.

Compilation of large corpora (e.g. syn2013pub) will fail with an error like this:

[20140612-09:47:00] Processed 288000000 lines, 248604749 positions.
[20140612-09:47:04] encodevert error: File too large for FD_FD, use FD_FGD
Writing log to /opt/projects/lindat-services-kontext/devel/data/corpora/data/syn2013pub/log/compilecorp_2014-06-12_0917.log

In large corpora the type of basic attributes (word, lemma...) needs to be changed to FD_FGD (see http://www.sketchengine.co.uk/documentation/wiki/SkE/Config/FullDoc#Attributestypes)

Computing sizes can fail if the doc structure element doesn't contain ATTRIBUTE wordcount:

You will see the following message at the beginning of compilation:

Reading corpus configuration...
corpinfo: CorpInfoNotFound (wordcount)
...

Add ATTRIBUTE wordcount to doc structure element.
STRUCTURE doc {
    ATTRIBUTE wordcount
}

Compilation of large corpora (e.g. syn2013pub) will fail with an error like this:

[20140612-18:33:28] lexicon (/opt/projects/lindat-services-kontext/devel/data/corpora/data/syn2013pub/s.id) make_lex_srt_file
[20140612-18:33:29] encodevert error: FileAccessError (/opt/projects/lindat-services-kontext/devel/data/corpora/data/syn2013pub/s.id.rev.idx) in ToFile: fopen [Too many open files]

In this case the system limits are too low. See the adjust limits section on Installation page.

Updated by Redmine Admin almost 8 years ago · 1 revisions