Project

General

Profile

Configuration » History » Version 1

Redmine Admin, 01/04/2017 05:22 PM

1 1 Redmine Admin
h1. Configuration
2
3
h2. Overview
4
5
Basic but up-to-date configuration description can be found in the INSTALL.md file in the original project of ÚČNK (see https://bitbucket.org/ucnk/kontext).
6
7
This section will cover some Know-How that was not written clearly or was completely missing from the documentation.
8
9
h2. Speech corpora
10
11
Speech corpus is a written corpus enriched by audio recordings of the text. This requires some special setup.
12
13
h3. Preparing audio
14
15
* split the audio recordings into pieces corresponding to speech segments that should be played at a time
16
* assign each segment some unique identifier (it's value will be referenced as $ID)
17
* convert audio segment to mp3
18
* store each segment in a separate file named $ID.mp3 in a flat directory $SPEECH_FILES_PATH/$CORPUS_ID
19
20
<pre>
21
/opt
22
   /data
23
       /speech
24
           /speech_corpus1
25
               file1.mp3
26
               file2.mp3
27
               file3.mp3
28
               ...
29
</pre>
30
31
In the case above $SPEECH_FILES_PATH will be /opt/data/speech and $CORPUS_ID will be speech_corpus1
32
33
h3. Preparing corpus vertical file
34
35
* delimit the speech segments like this:
36
37
<pre>
38
<doc id="1">
39
<s id="1">
40
<seg soundfile="file1.mp3">
41
word1
42
word2
43
word3
44
...
45
</seg>
46
</s>
47
<s id="2">
48
<seg soundfile="file2.mp3">
49
word1
50
word2
51
word3
52
...
53
</seg>
54
</s>
55
...
56
</doc>
57
</pre>
58
59
* the names *seg* and *soundfile* can be chosen arbitrarilly and recompile the corpus in a standard way.
60
61
h3. Updating config.xml
62
63
* to <corpora> add the following elements:
64
** <speech_files_path>$SPEECH_FILES_PATH</speech_files_path>
65
** <speech_segment_struct_attr>$SPEECH_SEGMENT_STRUCT_ATTR</speech_segment_struct_attr>
66
* to <corplist><corpus> add attribute:
67
** speech_segment="$SPEECH_SEGMENT"
68
 
69
* in the above case of vertical file:
70
** $SPEECH_SEGMENT_STRUCT_ATTR should be set to: seg
71
** $SPEECH_SEGMENT should be set to: seg.soundfile
72
73
The whole snippet should look like:
74
75
<pre>
76
<corpora>
77
    <speech_files_path>/opt/data/speech</speech_files_path>
78
    <speech_segment_struct_attr>seg</speech_segment_struct_attr>
79
    <corplist title="">         
80
        <corplist title="ÚFAL speech corpora">
81
            <corpus id="speech_corpus1" sentence_struct="sp" speech_segment="seg.soundfile"/>
82
        </corplist>
83
    </corplist>
84
</corpora>
85
</pre>