Corpora conversion¶
Overview¶
Conversion of corpora is performed on development environment only and realized by the following tools:
- Makefile
- bash tools (awk, sed)
- treex
- python
Cluster setup¶
See detailed description of the cluster here: https://wiki.ufal.ms.mff.cuni.cz/grid
SGE (Sun Cluster Engine) setup¶
Set up the SGE environment as written here https://wiki.ufal.ms.mff.cuni.cz/zakladni-nastaveni-sge
Disk space¶
For any conversion create separate directory in /net/cluster/TMP (automounted) or /net/cluster/SSD (automounted) as there is not enough space elsewhere for large corpora.
Be sure to clean up any files when the conversion is finished.
Treex¶
Install perlbrew on your local computer (your $HOME will be available on the cluster as well) as described here: https://wiki.ufal.ms.mff.cuni.cz/perlbrew
Install Treex from SVN as described here: http://ufal.mff.cuni.cz/treex/install.html, please note that most of the prerequisites is already provided by the perlbrew.
Conversion process¶
# SSH to the cluster: ssh lrc1 # Use screen screen # Create directory in /net/cluster/TMP e.g.: mkdir -p /net/cluster/TMP/$USER/czeng_1.0 # Change the directory: cd /net/cluster/TMP/$USER/czeng_1.0 # Create output directory: mkdir output # Copy the input data files to the cluster: scp -r remote_server:/some/remote/dir/czeng_1.0/input ./ # Copy the scripts to the cluster: scp -r remote_server:/some/remote/dir/czeng_1.0/scripts ./ # Run the conversion: cd scripts make # Wait until the job finishes # ... # Copy the output to the remote server scp -r /net/cluster/TMP/$USER/czeng_1.0/output remote_server:/some/remote/dir/czeng_1.0/ # Clean up rm -r /net/cluster/TMP/$USER/czeng_1.0 # Leave screen exit # Logout exit
PDT to Manatee¶
Conversion of PDT was implemenented in perl as a Block for Treex (Treex::Block::Write::Manatee). This block converts documents in PDT to <doc> structures in vertical files for Manatee.
the following attributes are included in the output:
- word
- lemma
- POS positional tags (16 characters)
- afun
The structure of the output document is as follows:
<doc id="xyz"> <s id="1"> token lemma POS-tag afun token lemma POS-tag afun token lemma POS-tag afun ... </s> <s id="2"> token lemma POS-tag afun token lemma POS-tag afun token lemma POS-tag afun ... </s> ... </doc>
Python script was developed to extract metadata from document ID's. Description of ID's structure is provided here: http://ufal.mff.cuni.cz/pdt2.0/doc/pdt-guide/en/html/ch03.html
but is incorrect and incomplete. This is corrected description of ther structure:
Prefixes:
Code structure | Name | Comments |
---|---|---|
lnYYNNN | Lidové noviny | YY - last two digits of the year, NNN - issue number (day of the year excluding Sundays and public holidays) |
lndYYNNN | Lidové noviny | YY - last two digits of the year, NNN - issue number (day of the year excluding Sundays and public holidays) |
mfYYMMDD | Mladá fronta Dnes | YY - last two digits of the year, MM - month, DD - day of the month |
cmprYYNN | Českomoravský Profit | YY - last two digits of the year, NN - issue number (week) |
vesmYYNN | Vesmír | YY - last two digits of the year, NN - issue number (month) |
Shell script was created to easily convert the whole set of PDT documents to a single corpus.
Treex to Manatee¶
Conversion of treex files was implemenented in perl as a Block for Treex (Treex::Block::Write::Manatee). This block converts documents in treex to <doc> structures in vertical files for Manatee.
Updated by Redmine Admin almost 8 years ago · 1 revisions