Overview¶
Introduction¶
LINDAT/CLARIN repository contains many corpora in different formats. Some of these corpora are maintained by other institutions and possibly accessible online but most of them are maintained by ÚFAL and for a long time there was no interface that could be used for searching these resources either by humans or by other remote services. This is why KonText corpus manager was set up.
Terminology¶
Corpus¶
Corpus is a (by Wikipedia definition) "large and structured set of texts ... They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory". The fact that the set is large is arguable and there are many more applications like machine translation but these are the basic use cases.
Typical corpus will look like a list of words with lemmas and part-of-speech tags.
Treebank¶
Treebank is basically a corpus that contains additional information about syntactic tree structure of the sentence.
Corpus manager¶
Corpus manager is a tool that solves the tasks of storing corpus in a format suitable for efficient querying and provides some API for querying.
Manatee¶
Manatee is a corpus management tool including corpus building and indexing, fast querying and providing basic statistical measures (see http://nlp.fi.muni.cz/trac/noske). It utilitates a fast indexing library called Finlib.
Bonito¶
Bonito is a graphical user interface to corpora mantained by Manatee (see http://nlp.fi.muni.cz/trac/noske). It is available as a standalone graphical application in Tcl/Tk (version Bonito1, not developed/supported anymore) and web interface in Python (version Bonito2, under constant development).
KonText¶
KonText started is a fork of the Bonito 2.68 python web interface to the corpus management tool Manatee.
It is maintained by the Institute of the Czech National Corpus (ÚČNK). Version 0.5.4 contains all the key features of the Bonito 2.98.3 (primarily a support for parallel corpora).
See https://bitbucket.org/ucnk/kontext for the source code and http://www.korpus.cz for the installation at ÚČNK.
LINDAT/CLARIN KonText¶
LINDAT/CLARIN KonText is a fork of (ÚČNK) KonText that contains some modifications and additional features. These include:
- support for Shibboleth authentization (using LINDAT AAI)
- Federated Content Search support (REST like API)
- LINDAT/CLARIN branding
- front page with list of available corpora
- small javascript enhancements in GUI
Infrastructure architecture¶
Overview¶
There are two environments so far:
- production (http://lindat.mff.cuni.cz/services/kontext)
- development (http://ufal-point-dev.ms.mff.cuni.cz/services/kontext-dev)
Each environment consists of:
- LINDAT proxy, that forwards HTTP requests to Quest virtual infrastructure proxy
- Quest proxy, that forwards HTTP request to application server
- Application server, that hosts both the KonText application and corpora maintained by manatee (accessible only from Quest)
- Data storage, that holds the actual corpora files and indexes
The overall picture looks like this:
LINDAT proxy¶
Note that this section is outdated after moving to nginx.
- server: lindat.mff.cuni.cz
- apache configuration in /etc/apache2/ufal-proxies.conf:
RewriteRule ^/services/fcs-kontext/$ /services/fcs-kontext [R,L,NE] <Location /services/fcs-kontext> ProxyPass http://quest.ms.mff.cuni.cz/kontext/run.cgi/fcs ProxyPassReverse http://quest.ms.mff.cuni.cz/kontext/run.cgi/fcs </Location> <Location /services/kontext> ProxyPass http://quest.ms.mff.cuni.cz/kontext ProxyPassReverse http://quest.ms.mff.cuni.cz/kontext RequestHeader set Host lindat.mff.cuni.cz </Location> <Location /services/kontext/run.cgi/loginx> AuthType shibboleth ShibRequireSession On ShibUseHeaders On require valid-user </Location>
- apache configuration in /etc/apache2/ufal-proxies.conf:
- server: ufal-point-ms.mff.cuni.cz
- apache configuration in /etc/apache2/ufal-proxies.conf:
RewriteRule ^/services/fcs-kontext/$ /services/fcs-kontext [R,L,NE] <Location /services/fcs-kontext> ProxyPass http://quest.ms.mff.cuni.cz/kontext-dev/run.cgi/fcs ProxyPassReverse http://quest.ms.mff.cuni.cz/kontext-dev/run.cgi/fcs </Location> <Location /services/kontext-dev> ProxyPass http://quest.ms.mff.cuni.cz/kontext-dev ProxyPassReverse http://quest.ms.mff.cuni.cz/kontext-dev RequestHeader set Host ufal-point-dev.ms.mff.cuni.cz </Location> <Location /services/kontext-dev/run.cgi/loginx> AuthType shibboleth ShibRequireSession On ShibUseHeaders On require valid-user </Location>
- apache configuration in /etc/apache2/ufal-proxies.conf:
Quest proxy¶
- server: quest.ms.mff.cuni.cz
- apache configuration:
ProxyPreserveHost on <Location /service/kontext> ProxyPass http://kontext/kontext ProxyPassReverse http://kontext/kontext </Location> <Location /kontext-dev> ProxyPass http://kontext-dev/kontext-dev ProxyPassReverse http://kontext-dev/kontext-dev </Location>
- apache configuration:
Application servers¶
Two servers are used for this project (accessible from quest)
- kontext - production server (http://lindat.mff.cuni.cz/services/kontext)
- kontext-dev - development server (http://ufal-point-dev.ms.mff.cuni.cz/services/kontext-dev)
Installation of Finlib, Manatee and KonText on the application server is described in more detail here: Installation.
A Dockerfile will be prepared to make the task of preparing new environments easier and less error prone.
Directory structure¶
The directory structure on the application server is as follows (the ENVIRONMENT variable stands for production or development):
- /opt/projects/lindat-services-kontext/${ENVIRONMENT}/lindat-kontext - all KonText source code
- /opt/projects/lindat-services-kontext/${ENVIRONMENT}/data - all corpora related data
- /opt/projects/lindat-services-kontext/${ENVIRONMENT}/log - log files
- /opt/projects/lindat-services-kontext/${ENVIRONMENT}/pythonenv - python virtual environment created by virtualenv
- /opt/projects/lindat-services-kontext/${ENVIRONMENT}/scripts - scripts related to the environment (like scripts for updating the environment etc.)
Processing HTTP request¶
HTTP requests are handled by Apache and processed using the CGI interface.
Data storage¶
Data storage is actually mounted NFS disk. The reason for that is, that there is not enough space on the local disks on Quest virtual machines.
Data storage is mounted as:
- on kontext: /a/QUESTDATA/data/kontext/opt/projects/lindat-services-kontext/production/data/corpora and symlinked to /opt/projects/lindat-kontext/production/data/corpora
- on kontext-dev: /a/QUESTDATA/data/kontext/opt/projects/lindat-services-kontext/development/data/corpora and symlinked to /opt/projects/lindat-kontext/development/data/corpora
Development¶
Source code¶
KonText is written in python and relies on Manatee python extension provided by the Manatee.
Source code of LINDAT/CLARIN KonText is maintained via Git VCS in Redmine (gitolite@redmine.ms.mff.cuni.cz:lindat/lindat-services/lindat-services-kontext.git).
The original in ÚČNK KonText is maintained via Mercurial on BitBucket (https://bitbucket.org/ucnk/kontext/)
Therefore there is also a fork of LINDAT/CLARIN KonText on BitBucket (https://bitbucket.org/ufal/lindat-kontext) that might be used for synchronization of LINDAT/CLARIN and ÚČNK version.
This synchronization is however not straitforward and is described in more detail here: Development.
Deploying new versions¶
Deploying new versions to production environment consists of the following steps:
ssh quest ssh kontext cd /opt/projects/lindat-services-kontext/production/scripts ./update_all.sh
This is a wrapper that calls two more scripts in the same directory:
- update_kontext.sh - i.e. perform git update and run grunt
- update_corpora.sh - i.e. copy all compiled corpora from development environment
Additionaly, it might be necessary to update the config.xml file in /opt/projects/lindat-services-kontext/production/lindat-kontext directory.
Corpora¶
Creating of corpora is described in more detail here: Corpora
Updated by Redmine Admin almost 8 years ago · 1 revisions