Project

General

Profile

Actions

Configuration

Overview

Basic but up-to-date configuration description can be found in the INSTALL.md file in the original project of ÚČNK (see https://bitbucket.org/ucnk/kontext).

This section will cover some Know-How that was not written clearly or was completely missing from the documentation.

Speech corpora

Speech corpus is a written corpus enriched by audio recordings of the text. This requires some special setup.

Preparing audio

  • split the audio recordings into pieces corresponding to speech segments that should be played at a time
  • assign each segment some unique identifier (it's value will be referenced as $ID)
  • convert audio segment to mp3
  • store each segment in a separate file named $ID.mp3 in a flat directory $SPEECH_FILES_PATH/$CORPUS_ID
/opt
   /data
       /speech
           /speech_corpus1
               file1.mp3
               file2.mp3
               file3.mp3
               ...

In the case above $SPEECH_FILES_PATH will be /opt/data/speech and $CORPUS_ID will be speech_corpus1

Preparing corpus vertical file

  • delimit the speech segments like this:
<doc id="1">
<s id="1">
<seg soundfile="file1.mp3">
word1
word2
word3
...
</seg>
</s>
<s id="2">
<seg soundfile="file2.mp3">
word1
word2
word3
...
</seg>
</s>
...
</doc>
  • the names seg and soundfile can be chosen arbitrarilly and recompile the corpus in a standard way.

Updating config.xml

  • to <corpora> add the following elements:
    • <speech_files_path>$SPEECH_FILES_PATH</speech_files_path>
    • <speech_segment_struct_attr>$SPEECH_SEGMENT_STRUCT_ATTR</speech_segment_struct_attr>
  • to <corplist><corpus> add attribute:
    • speech_segment="$SPEECH_SEGMENT"
  • in the above case of vertical file:
    • $SPEECH_SEGMENT_STRUCT_ATTR should be set to: seg
    • $SPEECH_SEGMENT should be set to: seg.soundfile

The whole snippet should look like:

<corpora>
    <speech_files_path>/opt/data/speech</speech_files_path>
    <speech_segment_struct_attr>seg</speech_segment_struct_attr>
    <corplist title="">         
        <corplist title="ÚFAL speech corpora">
            <corpus id="speech_corpus1" sentence_struct="sp" speech_segment="seg.soundfile"/>
        </corplist>
    </corplist>
</corpora>

Updated by Redmine Admin over 7 years ago · 1 revisions