Project

General

Profile

Actions

Overview

Introduction

LINDAT/CLARIN repository contains many corpora in different formats. Some of these corpora are maintained by other institutions and possibly accessible online but most of them are maintained by ÚFAL and for a long time there was no interface that could be used for searching these resources either by humans or by other remote services. This is why KonText corpus manager was set up.

Terminology

Corpus

Corpus is a (by Wikipedia definition) "large and structured set of texts ... They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory". The fact that the set is large is arguable and there are many more applications like machine translation but these are the basic use cases.

Typical corpus will look like a list of words with lemmas and part-of-speech tags.

Treebank

Treebank is basically a corpus that contains additional information about syntactic tree structure of the sentence.

Corpus manager

Corpus manager is a tool that solves the tasks of storing corpus in a format suitable for efficient querying and provides some API for querying.

Manatee

Manatee is a corpus management tool including corpus building and indexing, fast querying and providing basic statistical measures (see http://nlp.fi.muni.cz/trac/noske). It utilitates a fast indexing library called Finlib.

Bonito

Bonito is a graphical user interface to corpora mantained by Manatee (see http://nlp.fi.muni.cz/trac/noske). It is available as a standalone graphical application in Tcl/Tk (version Bonito1, not developed/supported anymore) and web interface in Python (version Bonito2, under constant development).

KonText

KonText started is a fork of the Bonito 2.68 python web interface to the corpus management tool Manatee.
It is maintained by the Institute of the Czech National Corpus (ÚČNK). Version 0.5.4 contains all the key features of the Bonito 2.98.3 (primarily a support for parallel corpora).
See https://bitbucket.org/ucnk/kontext for the source code and http://www.korpus.cz for the installation at ÚČNK.

LINDAT/CLARIN KonText

LINDAT/CLARIN KonText is a fork of (ÚČNK) KonText that contains some modifications and additional features. These include:

  • support for Shibboleth authentization (using LINDAT AAI)
  • Federated Content Search support (REST like API)
  • LINDAT/CLARIN branding
  • front page with list of available corpora
  • small javascript enhancements in GUI

Infrastructure architecture

Overview

There are two environments so far:

Each environment consists of:

  • LINDAT proxy, that forwards HTTP requests to Quest virtual infrastructure proxy
  • Quest proxy, that forwards HTTP request to application server
  • Application server, that hosts both the KonText application and corpora maintained by manatee (accessible only from Quest)
  • Data storage, that holds the actual corpora files and indexes

The overall picture looks like this:

LINDAT proxy

Note that this section is outdated after moving to nginx.

  • server: lindat.mff.cuni.cz
    • apache configuration in /etc/apache2/ufal-proxies.conf:
      RewriteRule ^/services/fcs-kontext/$ /services/fcs-kontext [R,L,NE]
      <Location /services/fcs-kontext>
          ProxyPass http://quest.ms.mff.cuni.cz/kontext/run.cgi/fcs
          ProxyPassReverse http://quest.ms.mff.cuni.cz/kontext/run.cgi/fcs
      </Location>
      
      <Location /services/kontext>
          ProxyPass http://quest.ms.mff.cuni.cz/kontext
          ProxyPassReverse http://quest.ms.mff.cuni.cz/kontext
          RequestHeader set Host lindat.mff.cuni.cz
      </Location>
      
      <Location /services/kontext/run.cgi/loginx>
          AuthType shibboleth
          ShibRequireSession On
          ShibUseHeaders On
          require valid-user
      </Location>
      
  • server: ufal-point-ms.mff.cuni.cz
    • apache configuration in /etc/apache2/ufal-proxies.conf:
      RewriteRule ^/services/fcs-kontext/$ /services/fcs-kontext [R,L,NE]
      <Location /services/fcs-kontext>
          ProxyPass http://quest.ms.mff.cuni.cz/kontext-dev/run.cgi/fcs
          ProxyPassReverse http://quest.ms.mff.cuni.cz/kontext-dev/run.cgi/fcs
      </Location>
      
      <Location /services/kontext-dev>
          ProxyPass http://quest.ms.mff.cuni.cz/kontext-dev
          ProxyPassReverse http://quest.ms.mff.cuni.cz/kontext-dev
          RequestHeader set Host ufal-point-dev.ms.mff.cuni.cz
      </Location>
      
      <Location /services/kontext-dev/run.cgi/loginx>
          AuthType shibboleth
          ShibRequireSession On
          ShibUseHeaders On
          require valid-user
      </Location>
      

Quest proxy

  • server: quest.ms.mff.cuni.cz
    • apache configuration:
      ProxyPreserveHost on
      
      <Location /service/kontext>
      ProxyPass http://kontext/kontext
      ProxyPassReverse http://kontext/kontext
      </Location>
      
      <Location /kontext-dev>
      ProxyPass http://kontext-dev/kontext-dev
      ProxyPassReverse http://kontext-dev/kontext-dev
      </Location>
      

Application servers

Two servers are used for this project (accessible from quest)

Installation of Finlib, Manatee and KonText on the application server is described in more detail here: Installation.

A Dockerfile will be prepared to make the task of preparing new environments easier and less error prone.

Directory structure

The directory structure on the application server is as follows (the ENVIRONMENT variable stands for production or development):

  • /opt/projects/lindat-services-kontext/${ENVIRONMENT}/lindat-kontext - all KonText source code
  • /opt/projects/lindat-services-kontext/${ENVIRONMENT}/data - all corpora related data
  • /opt/projects/lindat-services-kontext/${ENVIRONMENT}/log - log files
  • /opt/projects/lindat-services-kontext/${ENVIRONMENT}/pythonenv - python virtual environment created by virtualenv
  • /opt/projects/lindat-services-kontext/${ENVIRONMENT}/scripts - scripts related to the environment (like scripts for updating the environment etc.)

Processing HTTP request

HTTP requests are handled by Apache and processed using the CGI interface.

Data storage

Data storage is actually mounted NFS disk. The reason for that is, that there is not enough space on the local disks on Quest virtual machines.

Data storage is mounted as:

  • on kontext: /a/QUESTDATA/data/kontext/opt/projects/lindat-services-kontext/production/data/corpora and symlinked to /opt/projects/lindat-kontext/production/data/corpora
  • on kontext-dev: /a/QUESTDATA/data/kontext/opt/projects/lindat-services-kontext/development/data/corpora and symlinked to /opt/projects/lindat-kontext/development/data/corpora

Development

Source code

KonText is written in python and relies on Manatee python extension provided by the Manatee.

Source code of LINDAT/CLARIN KonText is maintained via Git VCS in Redmine (:lindat/lindat-services/lindat-services-kontext.git).

The original in ÚČNK KonText is maintained via Mercurial on BitBucket (https://bitbucket.org/ucnk/kontext/)
Therefore there is also a fork of LINDAT/CLARIN KonText on BitBucket (https://bitbucket.org/ufal/lindat-kontext) that might be used for synchronization of LINDAT/CLARIN and ÚČNK version.
This synchronization is however not straitforward and is described in more detail here: Development.

Deploying new versions

Deploying new versions to production environment consists of the following steps:

ssh quest
ssh kontext 
cd /opt/projects/lindat-services-kontext/production/scripts
./update_all.sh

This is a wrapper that calls two more scripts in the same directory:

  • update_kontext.sh - i.e. perform git update and run grunt
  • update_corpora.sh - i.e. copy all compiled corpora from development environment

Additionaly, it might be necessary to update the config.xml file in /opt/projects/lindat-services-kontext/production/lindat-kontext directory.

Corpora

Creating of corpora is described in more detail here: Corpora

Updated by Redmine Admin over 7 years ago · 1 revisions