Corpora conversion¶

Overview¶

Conversion of corpora is performed on development environment only and realized by the following tools:

Makefile
bash tools (awk, sed)
treex
python

Cluster setup¶

See detailed description of the cluster here: https://wiki.ufal.ms.mff.cuni.cz/grid

SGE (Sun Cluster Engine) setup¶

Set up the SGE environment as written here https://wiki.ufal.ms.mff.cuni.cz/zakladni-nastaveni-sge

Disk space¶

For any conversion create separate directory in /net/cluster/TMP (automounted) or /net/cluster/SSD (automounted) as there is not enough space elsewhere for large corpora.
Be sure to clean up any files when the conversion is finished.

Treex¶

Install perlbrew on your local computer (your $HOME will be available on the cluster as well) as described here: https://wiki.ufal.ms.mff.cuni.cz/perlbrew
Install Treex from SVN as described here: http://ufal.mff.cuni.cz/treex/install.html, please note that most of the prerequisites is already provided by the perlbrew.

Conversion process¶

# SSH to the cluster:
ssh lrc1
# Use screen
screen
# Create directory in /net/cluster/TMP e.g.:
mkdir -p /net/cluster/TMP/$USER/czeng_1.0
# Change the directory: 
cd /net/cluster/TMP/$USER/czeng_1.0
# Create output directory:
mkdir output
# Copy the input data files to the cluster:
scp -r remote_server:/some/remote/dir/czeng_1.0/input ./
# Copy the scripts to the cluster:
scp -r remote_server:/some/remote/dir/czeng_1.0/scripts ./
# Run the conversion:
cd scripts
make
# Wait until the job finishes
# ...
# Copy the output to the remote server
scp -r /net/cluster/TMP/$USER/czeng_1.0/output remote_server:/some/remote/dir/czeng_1.0/
# Clean up
rm -r /net/cluster/TMP/$USER/czeng_1.0
# Leave screen
exit
# Logout
exit

PDT to Manatee¶

Conversion of PDT was implemenented in perl as a Block for Treex (Treex::Block::Write::Manatee). This block converts documents in PDT to <doc> structures in vertical files for Manatee.

the following attributes are included in the output:

word
lemma
POS positional tags (16 characters)
afun

The structure of the output document is as follows:

<doc id="xyz">
<s id="1">
token    lemma    POS-tag    afun
token    lemma    POS-tag    afun
token    lemma    POS-tag    afun
...
</s>
<s id="2">
token    lemma    POS-tag    afun
token    lemma    POS-tag    afun
token    lemma    POS-tag    afun
...
</s>
...
</doc>

Python script was developed to extract metadata from document ID's. Description of ID's structure is provided here: http://ufal.mff.cuni.cz/pdt2.0/doc/pdt-guide/en/html/ch03.html
but is incorrect and incomplete. This is corrected description of ther structure:

Prefixes:

Code structure	Name	Comments
lnYYNNN	Lidové noviny	YY - last two digits of the year, NNN - issue number (day of the year excluding Sundays and public holidays)
lndYYNNN	Lidové noviny	YY - last two digits of the year, NNN - issue number (day of the year excluding Sundays and public holidays)
mfYYMMDD	Mladá fronta Dnes	YY - last two digits of the year, MM - month, DD - day of the month
cmprYYNN	Českomoravský Profit	YY - last two digits of the year, NN - issue number (week)
vesmYYNN	Vesmír	YY - last two digits of the year, NN - issue number (month)

Shell script was created to easily convert the whole set of PDT documents to a single corpus.

Treex to Manatee¶

Conversion of treex files was implemenented in perl as a Block for Treex (Treex::Block::Write::Manatee). This block converts documents in treex to <doc> structures in vertical files for Manatee.

Files (0)

Updated by Redmine Admin over 8 years ago · 1 revisions

Project

General

Profile

Lindat Projects » Services » KonText

Wiki