Actions
Configuration¶
Overview¶
Basic but up-to-date configuration description can be found in the INSTALL.md file in the original project of ÚČNK (see https://bitbucket.org/ucnk/kontext).
This section will cover some Know-How that was not written clearly or was completely missing from the documentation.
Speech corpora¶
Speech corpus is a written corpus enriched by audio recordings of the text. This requires some special setup.
Preparing audio¶
- split the audio recordings into pieces corresponding to speech segments that should be played at a time
- assign each segment some unique identifier (it's value will be referenced as $ID)
- convert audio segment to mp3
- store each segment in a separate file named $ID.mp3 in a flat directory $SPEECH_FILES_PATH/$CORPUS_ID
/opt
/data
/speech
/speech_corpus1
file1.mp3
file2.mp3
file3.mp3
...
In the case above $SPEECH_FILES_PATH will be /opt/data/speech and $CORPUS_ID will be speech_corpus1
Preparing corpus vertical file¶
- delimit the speech segments like this:
<doc id="1"> <s id="1"> <seg soundfile="file1.mp3"> word1 word2 word3 ... </seg> </s> <s id="2"> <seg soundfile="file2.mp3"> word1 word2 word3 ... </seg> </s> ... </doc>
- the names seg and soundfile can be chosen arbitrarilly and recompile the corpus in a standard way.
Updating config.xml¶
- to <corpora> add the following elements:
- <speech_files_path>$SPEECH_FILES_PATH</speech_files_path>
- <speech_segment_struct_attr>$SPEECH_SEGMENT_STRUCT_ATTR</speech_segment_struct_attr>
- to <corplist><corpus> add attribute:
- speech_segment="$SPEECH_SEGMENT"
- in the above case of vertical file:
- $SPEECH_SEGMENT_STRUCT_ATTR should be set to: seg
- $SPEECH_SEGMENT should be set to: seg.soundfile
The whole snippet should look like:
<corpora>
<speech_files_path>/opt/data/speech</speech_files_path>
<speech_segment_struct_attr>seg</speech_segment_struct_attr>
<corplist title="">
<corplist title="ÚFAL speech corpora">
<corpus id="speech_corpus1" sentence_struct="sp" speech_segment="seg.soundfile"/>
</corplist>
</corplist>
</corpora>
Updated by Redmine Admin almost 9 years ago · 1 revisions