Corpora compilation » History » Version 1
Redmine Admin, 01/04/2017 05:26 PM
1 | 1 | Redmine Admin | h1. Corpora compilation |
---|---|---|---|
2 | |||
3 | h2. Compilation |
||
4 | |||
5 | *Set MANATEE_REGISTRY environmental variable to the directory with registry files:* |
||
6 | |||
7 | <pre> |
||
8 | export MANATEE_REGISTRY=/opt/projects/lindat-services-kontext/devel/data/corpora/registry |
||
9 | </pre> |
||
10 | |||
11 | *Compile the corpus:* |
||
12 | |||
13 | <pre> |
||
14 | cd $MANATEE_REGISTRY |
||
15 | compilecorp --no-sketches --recompile-corpus <corpus config file> |
||
16 | </pre> |
||
17 | |||
18 | h2. Troubleshooting |
||
19 | |||
20 | *Corpus config file in MANATEE_REGISTRY must be named in lowercase* |
||
21 | |||
22 | This is probably bug in KonText. |
||
23 | |||
24 | *Corpus config file in MANATEE_REGISTRY must consist only of alphanumerical characters* |
||
25 | |||
26 | The name of the config file is used in the within clause of CQL queries and special characters cause CQL syntax errors. |
||
27 | This happens always when trying to search in two or more parallel corpora at the same time. |
||
28 | |||
29 | |||
30 | *Compilation of large corpora (e.g. syn2013pub) will fail with an error like this:* |
||
31 | |||
32 | <pre> |
||
33 | [20140612-09:47:00] Processed 288000000 lines, 248604749 positions. |
||
34 | [20140612-09:47:04] encodevert error: File too large for FD_FD, use FD_FGD |
||
35 | Writing log to /opt/projects/lindat-services-kontext/devel/data/corpora/data/syn2013pub/log/compilecorp_2014-06-12_0917.log |
||
36 | </pre> |
||
37 | |||
38 | In large corpora the type of basic attributes (word, lemma...) needs to be changed to @FD_FGD@ (see http://www.sketchengine.co.uk/documentation/wiki/SkE/Config/FullDoc#Attributestypes) |
||
39 | |||
40 | *Computing sizes can fail if the doc structure element doesn't contain ATTRIBUTE wordcount:* |
||
41 | |||
42 | You will see the following message at the beginning of compilation: |
||
43 | |||
44 | <pre> |
||
45 | Reading corpus configuration... |
||
46 | corpinfo: CorpInfoNotFound (wordcount) |
||
47 | ... |
||
48 | </pre> |
||
49 | Add ATTRIBUTE wordcount to doc structure element. |
||
50 | |||
51 | <pre> |
||
52 | STRUCTURE doc { |
||
53 | ATTRIBUTE wordcount |
||
54 | } |
||
55 | </pre> |
||
56 | |||
57 | *Compilation of large corpora (e.g. syn2013pub) will fail with an error like this:* |
||
58 | |||
59 | <pre> |
||
60 | [20140612-18:33:28] lexicon (/opt/projects/lindat-services-kontext/devel/data/corpora/data/syn2013pub/s.id) make_lex_srt_file |
||
61 | [20140612-18:33:29] encodevert error: FileAccessError (/opt/projects/lindat-services-kontext/devel/data/corpora/data/syn2013pub/s.id.rev.idx) in ToFile: fopen [Too many open files] |
||
62 | </pre> |
||
63 | |||
64 | In this case the system limits are too low. See the *adjust limits* section on [[Installation|Installation page]]. |