Corpora conversion » History » Version 1
Redmine Admin, 01/04/2017 05:26 PM
1 | 1 | Redmine Admin | h1. Corpora conversion |
---|---|---|---|
2 | |||
3 | h2. Overview |
||
4 | |||
5 | Conversion of corpora is performed on development environment only and realized by the following tools: |
||
6 | |||
7 | * Makefile |
||
8 | * bash tools (awk, sed) |
||
9 | * treex |
||
10 | * python |
||
11 | |||
12 | h2. Cluster setup |
||
13 | |||
14 | See detailed description of the cluster here: https://wiki.ufal.ms.mff.cuni.cz/grid |
||
15 | |||
16 | h3. SGE (Sun Cluster Engine) setup |
||
17 | |||
18 | Set up the SGE environment as written here https://wiki.ufal.ms.mff.cuni.cz/zakladni-nastaveni-sge |
||
19 | |||
20 | h3. Disk space |
||
21 | |||
22 | For any conversion create separate directory in /net/cluster/TMP (automounted) or /net/cluster/SSD (automounted) as there is not enough space elsewhere for large corpora. |
||
23 | Be sure to clean up any files when the conversion is finished. |
||
24 | |||
25 | h3. Treex |
||
26 | |||
27 | Install perlbrew on your local computer (your $HOME will be available on the cluster as well) as described here: https://wiki.ufal.ms.mff.cuni.cz/perlbrew |
||
28 | Install Treex from SVN as described here: http://ufal.mff.cuni.cz/treex/install.html, please note that most of the prerequisites is already provided by the perlbrew. |
||
29 | |||
30 | h2. Conversion process |
||
31 | |||
32 | <pre> |
||
33 | # SSH to the cluster: |
||
34 | ssh lrc1 |
||
35 | # Use screen |
||
36 | screen |
||
37 | # Create directory in /net/cluster/TMP e.g.: |
||
38 | mkdir -p /net/cluster/TMP/$USER/czeng_1.0 |
||
39 | # Change the directory: |
||
40 | cd /net/cluster/TMP/$USER/czeng_1.0 |
||
41 | # Create output directory: |
||
42 | mkdir output |
||
43 | # Copy the input data files to the cluster: |
||
44 | scp -r remote_server:/some/remote/dir/czeng_1.0/input ./ |
||
45 | # Copy the scripts to the cluster: |
||
46 | scp -r remote_server:/some/remote/dir/czeng_1.0/scripts ./ |
||
47 | # Run the conversion: |
||
48 | cd scripts |
||
49 | make |
||
50 | # Wait until the job finishes |
||
51 | # ... |
||
52 | # Copy the output to the remote server |
||
53 | scp -r /net/cluster/TMP/$USER/czeng_1.0/output remote_server:/some/remote/dir/czeng_1.0/ |
||
54 | # Clean up |
||
55 | rm -r /net/cluster/TMP/$USER/czeng_1.0 |
||
56 | # Leave screen |
||
57 | exit |
||
58 | # Logout |
||
59 | exit |
||
60 | </pre> |
||
61 | |||
62 | h2. PDT to Manatee |
||
63 | |||
64 | Conversion of PDT was implemenented in perl as a Block for Treex (Treex::Block::Write::Manatee). This block converts documents in PDT to <doc> structures in vertical files for Manatee. |
||
65 | |||
66 | the following attributes are included in the output: |
||
67 | |||
68 | * word |
||
69 | * lemma |
||
70 | * POS positional tags (16 characters) |
||
71 | * afun |
||
72 | |||
73 | The structure of the output document is as follows: |
||
74 | |||
75 | <pre> |
||
76 | <doc id="xyz"> |
||
77 | <s id="1"> |
||
78 | token lemma POS-tag afun |
||
79 | token lemma POS-tag afun |
||
80 | token lemma POS-tag afun |
||
81 | ... |
||
82 | </s> |
||
83 | <s id="2"> |
||
84 | token lemma POS-tag afun |
||
85 | token lemma POS-tag afun |
||
86 | token lemma POS-tag afun |
||
87 | ... |
||
88 | </s> |
||
89 | ... |
||
90 | </doc> |
||
91 | </pre> |
||
92 | |||
93 | Python script was developed to extract metadata from document ID's. Description of ID's structure is provided here: http://ufal.mff.cuni.cz/pdt2.0/doc/pdt-guide/en/html/ch03.html |
||
94 | but is incorrect and incomplete. This is corrected description of ther structure: |
||
95 | |||
96 | Prefixes: |
||
97 | |||
98 | |_.Code structure |_.Name |_.Comments | |
||
99 | | lnYYNNN | Lidové noviny | YY - last two digits of the year, NNN - issue number (day of the year excluding Sundays and public holidays) | |
||
100 | | lndYYNNN | Lidové noviny | YY - last two digits of the year, NNN - issue number (day of the year excluding Sundays and public holidays) | |
||
101 | | mfYYMMDD | Mladá fronta Dnes | YY - last two digits of the year, MM - month, DD - day of the month | |
||
102 | | cmprYYNN | Českomoravský Profit |YY - last two digits of the year, NN - issue number (week) | |
||
103 | | vesmYYNN | Vesmír | YY - last two digits of the year, NN - issue number (month) | |
||
104 | |||
105 | Shell script was created to easily convert the whole set of PDT documents to a single corpus. |
||
106 | |||
107 | h2. Treex to Manatee |
||
108 | |||
109 | Conversion of treex files was implemenented in perl as a Block for Treex (Treex::Block::Write::Manatee). This block converts documents in treex to <doc> structures in vertical files for Manatee. |