Configuration¶

Overview¶

Basic but up-to-date configuration description can be found in the INSTALL.md file in the original project of ÚČNK (see https://bitbucket.org/ucnk/kontext).

This section will cover some Know-How that was not written clearly or was completely missing from the documentation.

Speech corpora¶

Speech corpus is a written corpus enriched by audio recordings of the text. This requires some special setup.

Preparing audio¶

split the audio recordings into pieces corresponding to speech segments that should be played at a time
assign each segment some unique identifier (it's value will be referenced as $ID)
convert audio segment to mp3
store each segment in a separate file named $ID.mp3 in a flat directory $SPEECH_FILES_PATH/$CORPUS_ID

/opt
   /data
       /speech
           /speech_corpus1
               file1.mp3
               file2.mp3
               file3.mp3
               ...

In the case above $SPEECH_FILES_PATH will be /opt/data/speech and $CORPUS_ID will be speech_corpus1

Preparing corpus vertical file¶

delimit the speech segments like this:

<doc id="1">
<s id="1">
<seg soundfile="file1.mp3">
word1
word2
word3
...
</seg>
</s>
<s id="2">
<seg soundfile="file2.mp3">
word1
word2
word3
...
</seg>
</s>
...
</doc>

the names seg and soundfile can be chosen arbitrarilly and recompile the corpus in a standard way.

Updating config.xml¶

to <corpora> add the following elements:
- <speech_files_path>$SPEECH_FILES_PATH</speech_files_path>
- <speech_segment_struct_attr>$SPEECH_SEGMENT_STRUCT_ATTR</speech_segment_struct_attr>
to <corplist><corpus> add attribute:
- speech_segment="$SPEECH_SEGMENT"

in the above case of vertical file:
- $SPEECH_SEGMENT_STRUCT_ATTR should be set to: seg
- $SPEECH_SEGMENT should be set to: seg.soundfile

The whole snippet should look like:

<corpora>
    <speech_files_path>/opt/data/speech</speech_files_path>
    <speech_segment_struct_attr>seg</speech_segment_struct_attr>
    <corplist title="">         
        <corplist title="ÚFAL speech corpora">
            <corpus id="speech_corpus1" sentence_struct="sp" speech_segment="seg.soundfile"/>
        </corplist>
    </corplist>
</corpora>

Files (0)

Updated by Redmine Admin over 8 years ago · 1 revisions

Project

General

Profile

Lindat Projects » Services » KonText