Corpora conversion » History » Version 1
Redmine Admin, 01/04/2017 05:26 PM
| 1 | 1 | Redmine Admin | h1. Corpora conversion |
|---|---|---|---|
| 2 | |||
| 3 | h2. Overview |
||
| 4 | |||
| 5 | Conversion of corpora is performed on development environment only and realized by the following tools: |
||
| 6 | |||
| 7 | * Makefile |
||
| 8 | * bash tools (awk, sed) |
||
| 9 | * treex |
||
| 10 | * python |
||
| 11 | |||
| 12 | h2. Cluster setup |
||
| 13 | |||
| 14 | See detailed description of the cluster here: https://wiki.ufal.ms.mff.cuni.cz/grid |
||
| 15 | |||
| 16 | h3. SGE (Sun Cluster Engine) setup |
||
| 17 | |||
| 18 | Set up the SGE environment as written here https://wiki.ufal.ms.mff.cuni.cz/zakladni-nastaveni-sge |
||
| 19 | |||
| 20 | h3. Disk space |
||
| 21 | |||
| 22 | For any conversion create separate directory in /net/cluster/TMP (automounted) or /net/cluster/SSD (automounted) as there is not enough space elsewhere for large corpora. |
||
| 23 | Be sure to clean up any files when the conversion is finished. |
||
| 24 | |||
| 25 | h3. Treex |
||
| 26 | |||
| 27 | Install perlbrew on your local computer (your $HOME will be available on the cluster as well) as described here: https://wiki.ufal.ms.mff.cuni.cz/perlbrew |
||
| 28 | Install Treex from SVN as described here: http://ufal.mff.cuni.cz/treex/install.html, please note that most of the prerequisites is already provided by the perlbrew. |
||
| 29 | |||
| 30 | h2. Conversion process |
||
| 31 | |||
| 32 | <pre> |
||
| 33 | # SSH to the cluster: |
||
| 34 | ssh lrc1 |
||
| 35 | # Use screen |
||
| 36 | screen |
||
| 37 | # Create directory in /net/cluster/TMP e.g.: |
||
| 38 | mkdir -p /net/cluster/TMP/$USER/czeng_1.0 |
||
| 39 | # Change the directory: |
||
| 40 | cd /net/cluster/TMP/$USER/czeng_1.0 |
||
| 41 | # Create output directory: |
||
| 42 | mkdir output |
||
| 43 | # Copy the input data files to the cluster: |
||
| 44 | scp -r remote_server:/some/remote/dir/czeng_1.0/input ./ |
||
| 45 | # Copy the scripts to the cluster: |
||
| 46 | scp -r remote_server:/some/remote/dir/czeng_1.0/scripts ./ |
||
| 47 | # Run the conversion: |
||
| 48 | cd scripts |
||
| 49 | make |
||
| 50 | # Wait until the job finishes |
||
| 51 | # ... |
||
| 52 | # Copy the output to the remote server |
||
| 53 | scp -r /net/cluster/TMP/$USER/czeng_1.0/output remote_server:/some/remote/dir/czeng_1.0/ |
||
| 54 | # Clean up |
||
| 55 | rm -r /net/cluster/TMP/$USER/czeng_1.0 |
||
| 56 | # Leave screen |
||
| 57 | exit |
||
| 58 | # Logout |
||
| 59 | exit |
||
| 60 | </pre> |
||
| 61 | |||
| 62 | h2. PDT to Manatee |
||
| 63 | |||
| 64 | Conversion of PDT was implemenented in perl as a Block for Treex (Treex::Block::Write::Manatee). This block converts documents in PDT to <doc> structures in vertical files for Manatee. |
||
| 65 | |||
| 66 | the following attributes are included in the output: |
||
| 67 | |||
| 68 | * word |
||
| 69 | * lemma |
||
| 70 | * POS positional tags (16 characters) |
||
| 71 | * afun |
||
| 72 | |||
| 73 | The structure of the output document is as follows: |
||
| 74 | |||
| 75 | <pre> |
||
| 76 | <doc id="xyz"> |
||
| 77 | <s id="1"> |
||
| 78 | token lemma POS-tag afun |
||
| 79 | token lemma POS-tag afun |
||
| 80 | token lemma POS-tag afun |
||
| 81 | ... |
||
| 82 | </s> |
||
| 83 | <s id="2"> |
||
| 84 | token lemma POS-tag afun |
||
| 85 | token lemma POS-tag afun |
||
| 86 | token lemma POS-tag afun |
||
| 87 | ... |
||
| 88 | </s> |
||
| 89 | ... |
||
| 90 | </doc> |
||
| 91 | </pre> |
||
| 92 | |||
| 93 | Python script was developed to extract metadata from document ID's. Description of ID's structure is provided here: http://ufal.mff.cuni.cz/pdt2.0/doc/pdt-guide/en/html/ch03.html |
||
| 94 | but is incorrect and incomplete. This is corrected description of ther structure: |
||
| 95 | |||
| 96 | Prefixes: |
||
| 97 | |||
| 98 | |_.Code structure |_.Name |_.Comments | |
||
| 99 | | lnYYNNN | Lidové noviny | YY - last two digits of the year, NNN - issue number (day of the year excluding Sundays and public holidays) | |
||
| 100 | | lndYYNNN | Lidové noviny | YY - last two digits of the year, NNN - issue number (day of the year excluding Sundays and public holidays) | |
||
| 101 | | mfYYMMDD | Mladá fronta Dnes | YY - last two digits of the year, MM - month, DD - day of the month | |
||
| 102 | | cmprYYNN | Českomoravský Profit |YY - last two digits of the year, NN - issue number (week) | |
||
| 103 | | vesmYYNN | Vesmír | YY - last two digits of the year, NN - issue number (month) | |
||
| 104 | |||
| 105 | Shell script was created to easily convert the whole set of PDT documents to a single corpus. |
||
| 106 | |||
| 107 | h2. Treex to Manatee |
||
| 108 | |||
| 109 | Conversion of treex files was implemenented in perl as a Block for Treex (Treex::Block::Write::Manatee). This block converts documents in treex to <doc> structures in vertical files for Manatee. |