Project

General

Profile

Corpora compilation » History » Version 1

Redmine Admin, 01/04/2017 05:26 PM

1 1 Redmine Admin
h1. Corpora compilation
2
3
h2. Compilation
4
5
*Set MANATEE_REGISTRY environmental variable to the directory with registry files:*
6
7
<pre>
8
export MANATEE_REGISTRY=/opt/projects/lindat-services-kontext/devel/data/corpora/registry
9
</pre>
10
11
*Compile the corpus:*
12
13
<pre>
14
cd $MANATEE_REGISTRY
15
compilecorp --no-sketches --recompile-corpus <corpus config file>
16
</pre>
17
18
h2. Troubleshooting
19
20
*Corpus config file in MANATEE_REGISTRY must be named in lowercase*
21
22
This is probably bug in KonText.
23
24
*Corpus config file in MANATEE_REGISTRY must consist only of alphanumerical characters*
25
26
The name of the config file is used in the within clause of CQL queries and special characters cause CQL syntax errors.
27
This happens always when trying to search in two or more parallel corpora at the same time.
28
29
30
*Compilation of large corpora (e.g. syn2013pub) will fail with an error like this:*
31
32
<pre>
33
[20140612-09:47:00] Processed 288000000 lines, 248604749 positions.
34
[20140612-09:47:04] encodevert error: File too large for FD_FD, use FD_FGD
35
Writing log to /opt/projects/lindat-services-kontext/devel/data/corpora/data/syn2013pub/log/compilecorp_2014-06-12_0917.log
36
</pre>
37
38
In large corpora the type of basic attributes (word, lemma...) needs to be changed to @FD_FGD@ (see http://www.sketchengine.co.uk/documentation/wiki/SkE/Config/FullDoc#Attributestypes)
39
40
*Computing sizes can fail if the doc structure element doesn't contain ATTRIBUTE wordcount:*
41
42
You will see the following message at the beginning of compilation:
43
44
<pre>
45
Reading corpus configuration...
46
corpinfo: CorpInfoNotFound (wordcount)
47
...
48
</pre>
49
Add ATTRIBUTE wordcount to doc structure element.
50
51
<pre>
52
STRUCTURE doc {
53
    ATTRIBUTE wordcount
54
}
55
</pre>
56
57
*Compilation of large corpora (e.g. syn2013pub) will fail with an error like this:*
58
59
<pre>
60
[20140612-18:33:28] lexicon (/opt/projects/lindat-services-kontext/devel/data/corpora/data/syn2013pub/s.id) make_lex_srt_file
61
[20140612-18:33:29] encodevert error: FileAccessError (/opt/projects/lindat-services-kontext/devel/data/corpora/data/syn2013pub/s.id.rev.idx) in ToFile: fopen [Too many open files]
62
</pre>
63
64
In this case the system limits are too low. See the *adjust limits* section on [[Installation|Installation page]].