Project

General

Profile

Corpora conversion » History » Version 1

Redmine Admin, 01/04/2017 05:26 PM

1 1 Redmine Admin
h1. Corpora conversion
2
3
h2. Overview
4
5
Conversion of corpora is performed on development environment only and realized by the following tools:
6
7
* Makefile
8
* bash tools (awk, sed)
9
* treex
10
* python
11
12
h2. Cluster setup
13
14
See detailed description of the cluster here: https://wiki.ufal.ms.mff.cuni.cz/grid
15
16
h3. SGE (Sun Cluster Engine) setup
17
18
Set up the SGE environment as written here https://wiki.ufal.ms.mff.cuni.cz/zakladni-nastaveni-sge
19
20
h3. Disk space
21
22
For any conversion create separate directory in /net/cluster/TMP (automounted) or /net/cluster/SSD (automounted) as there is not enough space elsewhere for large corpora.
23
Be sure to clean up any files when the conversion is finished.
24
25
h3. Treex
26
27
Install perlbrew on your local computer (your $HOME will be available on the cluster as well) as described here: https://wiki.ufal.ms.mff.cuni.cz/perlbrew
28
Install Treex from SVN as described here: http://ufal.mff.cuni.cz/treex/install.html, please note that most of the prerequisites is already provided by the perlbrew.
29
30
h2. Conversion process
31
32
<pre>
33
# SSH to the cluster:
34
ssh lrc1
35
# Use screen
36
screen
37
# Create directory in /net/cluster/TMP e.g.:
38
mkdir -p /net/cluster/TMP/$USER/czeng_1.0
39
# Change the directory: 
40
cd /net/cluster/TMP/$USER/czeng_1.0
41
# Create output directory:
42
mkdir output
43
# Copy the input data files to the cluster:
44
scp -r remote_server:/some/remote/dir/czeng_1.0/input ./
45
# Copy the scripts to the cluster:
46
scp -r remote_server:/some/remote/dir/czeng_1.0/scripts ./
47
# Run the conversion:
48
cd scripts
49
make
50
# Wait until the job finishes
51
# ...
52
# Copy the output to the remote server
53
scp -r /net/cluster/TMP/$USER/czeng_1.0/output remote_server:/some/remote/dir/czeng_1.0/
54
# Clean up
55
rm -r /net/cluster/TMP/$USER/czeng_1.0
56
# Leave screen
57
exit
58
# Logout
59
exit
60
</pre>
61
62
h2. PDT to Manatee
63
64
Conversion of PDT was implemenented in perl as a Block for Treex (Treex::Block::Write::Manatee). This block converts documents in PDT to <doc> structures in vertical files for Manatee.
65
66
the following attributes are included in the output:
67
68
* word
69
* lemma
70
* POS positional tags (16 characters)
71
* afun
72
73
The structure of the output document is as follows:
74
75
<pre>
76
<doc id="xyz">
77
<s id="1">
78
token    lemma    POS-tag    afun
79
token    lemma    POS-tag    afun
80
token    lemma    POS-tag    afun
81
...
82
</s>
83
<s id="2">
84
token    lemma    POS-tag    afun
85
token    lemma    POS-tag    afun
86
token    lemma    POS-tag    afun
87
...
88
</s>
89
...
90
</doc>
91
</pre>
92
93
Python script was developed to extract metadata from document ID's. Description of ID's structure is provided here: http://ufal.mff.cuni.cz/pdt2.0/doc/pdt-guide/en/html/ch03.html
94
but is incorrect and incomplete. This is corrected description of ther structure:
95
96
Prefixes:
97
98
|_.Code structure |_.Name |_.Comments |
99
| lnYYNNN | Lidové noviny | YY - last two digits of the year, NNN - issue number (day of the year excluding Sundays and public holidays) |
100
| lndYYNNN | Lidové noviny | YY - last two digits of the year, NNN - issue number (day of the year excluding Sundays and public holidays) |
101
| mfYYMMDD | Mladá fronta Dnes | YY - last two digits of the year, MM - month, DD - day of the month |
102
| cmprYYNN | Českomoravský Profit |YY - last two digits of the year, NN - issue number (week) |
103
| vesmYYNN | Vesmír | YY - last two digits of the year, NN - issue number (month) |
104
105
Shell script was created to easily convert the whole set of PDT documents to a single corpus.
106
107
h2. Treex to Manatee
108
109
Conversion of treex files was implemenented in perl as a Block for Treex (Treex::Block::Write::Manatee). This block converts documents in treex to <doc> structures in vertical files for Manatee.