Configuration » History » Version 1
Redmine Admin, 01/04/2017 05:22 PM
1 | 1 | Redmine Admin | h1. Configuration |
---|---|---|---|
2 | |||
3 | h2. Overview |
||
4 | |||
5 | Basic but up-to-date configuration description can be found in the INSTALL.md file in the original project of ÚČNK (see https://bitbucket.org/ucnk/kontext). |
||
6 | |||
7 | This section will cover some Know-How that was not written clearly or was completely missing from the documentation. |
||
8 | |||
9 | h2. Speech corpora |
||
10 | |||
11 | Speech corpus is a written corpus enriched by audio recordings of the text. This requires some special setup. |
||
12 | |||
13 | h3. Preparing audio |
||
14 | |||
15 | * split the audio recordings into pieces corresponding to speech segments that should be played at a time |
||
16 | * assign each segment some unique identifier (it's value will be referenced as $ID) |
||
17 | * convert audio segment to mp3 |
||
18 | * store each segment in a separate file named $ID.mp3 in a flat directory $SPEECH_FILES_PATH/$CORPUS_ID |
||
19 | |||
20 | <pre> |
||
21 | /opt |
||
22 | /data |
||
23 | /speech |
||
24 | /speech_corpus1 |
||
25 | file1.mp3 |
||
26 | file2.mp3 |
||
27 | file3.mp3 |
||
28 | ... |
||
29 | </pre> |
||
30 | |||
31 | In the case above $SPEECH_FILES_PATH will be /opt/data/speech and $CORPUS_ID will be speech_corpus1 |
||
32 | |||
33 | h3. Preparing corpus vertical file |
||
34 | |||
35 | * delimit the speech segments like this: |
||
36 | |||
37 | <pre> |
||
38 | <doc id="1"> |
||
39 | <s id="1"> |
||
40 | <seg soundfile="file1.mp3"> |
||
41 | word1 |
||
42 | word2 |
||
43 | word3 |
||
44 | ... |
||
45 | </seg> |
||
46 | </s> |
||
47 | <s id="2"> |
||
48 | <seg soundfile="file2.mp3"> |
||
49 | word1 |
||
50 | word2 |
||
51 | word3 |
||
52 | ... |
||
53 | </seg> |
||
54 | </s> |
||
55 | ... |
||
56 | </doc> |
||
57 | </pre> |
||
58 | |||
59 | * the names *seg* and *soundfile* can be chosen arbitrarilly and recompile the corpus in a standard way. |
||
60 | |||
61 | h3. Updating config.xml |
||
62 | |||
63 | * to <corpora> add the following elements: |
||
64 | ** <speech_files_path>$SPEECH_FILES_PATH</speech_files_path> |
||
65 | ** <speech_segment_struct_attr>$SPEECH_SEGMENT_STRUCT_ATTR</speech_segment_struct_attr> |
||
66 | * to <corplist><corpus> add attribute: |
||
67 | ** speech_segment="$SPEECH_SEGMENT" |
||
68 | |||
69 | * in the above case of vertical file: |
||
70 | ** $SPEECH_SEGMENT_STRUCT_ATTR should be set to: seg |
||
71 | ** $SPEECH_SEGMENT should be set to: seg.soundfile |
||
72 | |||
73 | The whole snippet should look like: |
||
74 | |||
75 | <pre> |
||
76 | <corpora> |
||
77 | <speech_files_path>/opt/data/speech</speech_files_path> |
||
78 | <speech_segment_struct_attr>seg</speech_segment_struct_attr> |
||
79 | <corplist title=""> |
||
80 | <corplist title="ÚFAL speech corpora"> |
||
81 | <corpus id="speech_corpus1" sentence_struct="sp" speech_segment="seg.soundfile"/> |
||
82 | </corplist> |
||
83 | </corplist> |
||
84 | </corpora> |
||
85 | </pre> |