Project

General

Profile

Overview » History » Version 1

Redmine Admin, 01/04/2017 05:21 PM

1 1 Redmine Admin
h1. Overview
2
3
h2. Introduction
4
5
LINDAT/CLARIN repository contains many corpora in different formats. Some of these corpora are maintained by other institutions and possibly accessible online but most of them are maintained by ÚFAL and for a long time there was no interface that could be used for searching these resources either by humans or by other remote services. This is why KonText corpus manager was set up.
6
7
h2. Terminology
8
9
h3. Corpus
10
11
Corpus is a (by Wikipedia definition) "large and structured set of texts ... They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory". The fact that the set is large is arguable and there are many more applications like machine translation but these are the basic use cases.
12
13
Typical corpus will look like a list of words with lemmas and part-of-speech tags.
14
15
h3. Treebank
16
17
Treebank is basically a corpus that contains additional information about syntactic tree structure of the sentence.
18
19
h3. Corpus manager
20
21
Corpus manager is a tool that solves the tasks of storing corpus in a format suitable for efficient querying and provides some API for querying.
22
23
h3. Manatee
24
25
Manatee is a corpus management tool including corpus building and indexing, fast querying and providing basic statistical measures (see http://nlp.fi.muni.cz/trac/noske). It utilitates a fast indexing library called Finlib.
26
27
h3. Bonito
28
29
Bonito is a graphical user interface to corpora mantained by Manatee (see http://nlp.fi.muni.cz/trac/noske). It is available as a standalone graphical application in Tcl/Tk (version Bonito1, not developed/supported anymore) and web interface in Python (version Bonito2, under constant development).
30
31
h3. KonText
32
33
KonText started is a fork of the Bonito 2.68 python web interface to the corpus management tool Manatee.
34
It is maintained by the Institute of the Czech National Corpus (ÚČNK). Version 0.5.4 contains all the key features of the Bonito 2.98.3 (primarily a support for parallel corpora).
35
See https://bitbucket.org/ucnk/kontext for the source code and http://www.korpus.cz for the installation at ÚČNK.
36
37
h3. LINDAT/CLARIN KonText
38
39
LINDAT/CLARIN KonText is a fork of (ÚČNK) KonText that contains some modifications and additional features. These include:
40
41
* support for Shibboleth authentization (using LINDAT AAI)
42
* Federated Content Search support (REST like API)
43
* LINDAT/CLARIN branding
44
* front page with list of available corpora
45
* small javascript enhancements in GUI
46
47
48
49
50
h2. Infrastructure architecture
51
52
h3. Overview
53
54
There are two environments so far:
55
56
* production (http://lindat.mff.cuni.cz/services/kontext)
57
* development (http://ufal-point-dev.ms.mff.cuni.cz/services/kontext-dev)
58
59
Each environment consists of:
60
61
* LINDAT proxy, that forwards HTTP(S) requests to Quest virtual infrastructure proxy
62
* Quest proxy, that forwards HTTP(S) request to application server
63
* Application server, that hosts both the KonText application and corpora maintained by manatee (accessible only from Quest)
64
* Data storage, that holds the actual corpora files and indexes
65
66
The overall picture looks like this:
67
68
!LINDAT-CLARIN_KonText.png!
69
70
h3. LINDAT proxy
71
72
Note that this section is outdated after moving to nginx.
73
74
* server: lindat.mff.cuni.cz
75
** apache configuration in /etc/apache2/ufal-proxies.conf:
76
<pre>
77
RewriteRule ^/services/fcs-kontext/$ /services/fcs-kontext [R,L,NE]
78
<Location /services/fcs-kontext>
79
    ProxyPass http://quest.ms.mff.cuni.cz/kontext/run.cgi/fcs
80
    ProxyPassReverse http://quest.ms.mff.cuni.cz/kontext/run.cgi/fcs
81
</Location>
82
83
<Location /services/kontext>
84
    ProxyPass http://quest.ms.mff.cuni.cz/kontext
85
    ProxyPassReverse http://quest.ms.mff.cuni.cz/kontext
86
    RequestHeader set Host lindat.mff.cuni.cz
87
</Location>
88
89
<Location /services/kontext/run.cgi/loginx>
90
    AuthType shibboleth
91
    ShibRequireSession On
92
    ShibUseHeaders On
93
    require valid-user
94
</Location>
95
</pre>
96
* server: ufal-point-ms.mff.cuni.cz
97
** apache configuration in /etc/apache2/ufal-proxies.conf: 
98
<pre>
99
RewriteRule ^/services/fcs-kontext/$ /services/fcs-kontext [R,L,NE]
100
<Location /services/fcs-kontext>
101
    ProxyPass http://quest.ms.mff.cuni.cz/kontext-dev/run.cgi/fcs
102
    ProxyPassReverse http://quest.ms.mff.cuni.cz/kontext-dev/run.cgi/fcs
103
</Location>
104
105
<Location /services/kontext-dev>
106
    ProxyPass http://quest.ms.mff.cuni.cz/kontext-dev
107
    ProxyPassReverse http://quest.ms.mff.cuni.cz/kontext-dev
108
    RequestHeader set Host ufal-point-dev.ms.mff.cuni.cz
109
</Location>
110
111
<Location /services/kontext-dev/run.cgi/loginx>
112
    AuthType shibboleth
113
    ShibRequireSession On
114
    ShibUseHeaders On
115
    require valid-user
116
</Location>
117
</pre>
118
119
h3. Quest proxy
120
121
* server: quest.ms.mff.cuni.cz
122
** apache configuration:
123
<pre>
124
ProxyPreserveHost on
125
126
<Location /service/kontext>
127
ProxyPass http://kontext/kontext
128
ProxyPassReverse http://kontext/kontext
129
</Location>
130
131
<Location /kontext-dev>
132
ProxyPass http://kontext-dev/kontext-dev
133
ProxyPassReverse http://kontext-dev/kontext-dev
134
</Location>
135
</pre>
136
137
h3. Application servers
138
139
Two servers are used for this project (accessible from quest)
140
141
* kontext - production server (http://lindat.mff.cuni.cz/services/kontext)
142
* kontext-dev - development server (http://ufal-point-dev.ms.mff.cuni.cz/services/kontext-dev)
143
144
Installation of Finlib, Manatee and KonText on the application server is described in more detail here: [[Installation]].
145
146
A Dockerfile will be prepared to make the task of preparing new environments easier and less error prone.
147
148
h4. Directory structure
149
150
The directory structure on the application server is as follows (the ENVIRONMENT variable stands for *production* or *development*):
151
152
* /opt/projects/lindat-services-kontext/${ENVIRONMENT}/lindat-kontext - all KonText source code
153
* /opt/projects/lindat-services-kontext/${ENVIRONMENT}/data - all corpora related data
154
* /opt/projects/lindat-services-kontext/${ENVIRONMENT}/log - log files
155
* /opt/projects/lindat-services-kontext/${ENVIRONMENT}/pythonenv - python virtual environment created by virtualenv
156
* /opt/projects/lindat-services-kontext/${ENVIRONMENT}/scripts - scripts related to the environment (like scripts for updating the environment etc.)
157
158
h4. Processing HTTP request
159
160
HTTP requests are handled by Apache and processed using the CGI interface.
161
162
h3. Data storage
163
164
Data storage is actually mounted NFS disk. The reason for that is, that there is not enough space on the local disks on Quest virtual machines.
165
166
Data storage is mounted as: 
167
168
* on kontext: /a/QUESTDATA/data/kontext/opt/projects/lindat-services-kontext/production/data/corpora and symlinked to /opt/projects/lindat-kontext/production/data/corpora
169
* on kontext-dev: /a/QUESTDATA/data/kontext/opt/projects/lindat-services-kontext/development/data/corpora and symlinked to /opt/projects/lindat-kontext/development/data/corpora
170
171
h2. Development
172
173
h3. Source code
174
175
KonText is written in python and relies on Manatee python extension provided by the Manatee.
176
177
Source code of LINDAT/CLARIN KonText is maintained via Git VCS in Redmine (gitolite@redmine.ms.mff.cuni.cz:lindat/lindat-services/lindat-services-kontext.git).
178
179
The original in ÚČNK KonText is maintained via Mercurial on BitBucket (https://bitbucket.org/ucnk/kontext/)
180
Therefore there is also a fork of LINDAT/CLARIN KonText on BitBucket (https://bitbucket.org/ufal/lindat-kontext) that might be used for synchronization of LINDAT/CLARIN and ÚČNK version.
181
This synchronization is however not straitforward and is described in more detail here: [[Development]].
182
183
h3. Deploying new versions
184
185
Deploying new versions to production environment consists of the following steps:
186
187
<pre>
188
<code class="bash">
189
ssh quest
190
ssh kontext 
191
cd /opt/projects/lindat-services-kontext/production/scripts
192
./update_all.sh
193
</code>
194
</pre>
195
196
This is a wrapper that calls two more scripts in the same directory:
197
198
* update_kontext.sh - i.e. perform git update and run grunt
199
* update_corpora.sh - i.e. copy all compiled corpora from development environment
200
201
Additionaly, it might be necessary to update the *config.xml* file in */opt/projects/lindat-services-kontext/production/lindat-kontext* directory.
202
203
h2. Corpora
204
205
Creating of corpora is described in more detail here: [[Corpora]]