Actions
Configuration¶
Overview¶
Basic but up-to-date configuration description can be found in the INSTALL.md file in the original project of ÚČNK (see https://bitbucket.org/ucnk/kontext).
This section will cover some Know-How that was not written clearly or was completely missing from the documentation.
Speech corpora¶
Speech corpus is a written corpus enriched by audio recordings of the text. This requires some special setup.
Preparing audio¶
- split the audio recordings into pieces corresponding to speech segments that should be played at a time
- assign each segment some unique identifier (it's value will be referenced as $ID)
- convert audio segment to mp3
- store each segment in a separate file named $ID.mp3 in a flat directory $SPEECH_FILES_PATH/$CORPUS_ID
/opt /data /speech /speech_corpus1 file1.mp3 file2.mp3 file3.mp3 ...
In the case above $SPEECH_FILES_PATH will be /opt/data/speech and $CORPUS_ID will be speech_corpus1
Preparing corpus vertical file¶
- delimit the speech segments like this:
<doc id="1"> <s id="1"> <seg soundfile="file1.mp3"> word1 word2 word3 ... </seg> </s> <s id="2"> <seg soundfile="file2.mp3"> word1 word2 word3 ... </seg> </s> ... </doc>
- the names seg and soundfile can be chosen arbitrarilly and recompile the corpus in a standard way.
Updating config.xml¶
- to <corpora> add the following elements:
- <speech_files_path>$SPEECH_FILES_PATH</speech_files_path>
- <speech_segment_struct_attr>$SPEECH_SEGMENT_STRUCT_ATTR</speech_segment_struct_attr>
- to <corplist><corpus> add attribute:
- speech_segment="$SPEECH_SEGMENT"
- in the above case of vertical file:
- $SPEECH_SEGMENT_STRUCT_ATTR should be set to: seg
- $SPEECH_SEGMENT should be set to: seg.soundfile
The whole snippet should look like:
<corpora> <speech_files_path>/opt/data/speech</speech_files_path> <speech_segment_struct_attr>seg</speech_segment_struct_attr> <corplist title=""> <corplist title="ÚFAL speech corpora"> <corpus id="speech_corpus1" sentence_struct="sp" speech_segment="seg.soundfile"/> </corplist> </corplist> </corpora>
Updated by Redmine Admin almost 8 years ago · 1 revisions