Home - Screenshots - Download - Manual - Linguistic data - GPL - LGPL - LGPLLR - Your contribution - Mailing list - Links


UNITEX HOME PAGE



What is Unitex ?

Unitex is a corpus processing system, based on automata-oriented technology. The concept of this software was born at LADL (Laboratoire d'Automatique Documentaire et Linguistique), under the direction of its director, Maurice Gross. With this tool, you can handle electronic resources such as electronic dictionaries and grammars and apply them. You can work at the levels of morphology, the lexicon and syntax. The main functions are:

  • building, checking and applying electronic dictionaries

  • pattern matching with regular expressions and recursive transition networks

  • applying lexicon-grammar tables

  • handling ambiguity via the text automaton

From this site, you can download Unitex and find information about linguistic data.

Unitex is referenced in the PLUME project.


WARNING: versions 3.3.x of gcc contain a huge bug affecting the -O2 compilation option, used in Unitex Makefile. To avoid this bug, just remove the -O2 compilation option in the Makefile. Version 3.4 seems to be OK.


Last updates:

  • release of the stable 1.2 version.

  • release of the ENGLISH manual by June 2006.

  • update of the French manual by June 2006.

  • integration of Ancient Greek and Polish

  • integration of contextual matching

  • integration of Korean with dedicated programs that can handle the morphology system of Korean

  • golden rules for Unitex contributors

  • The electronic dictionaries included in Unitex are now LGPLLR-licensed



A multilingual platform

Unitex conforms to the Unicode 3.0 standard that allows users to handle virtually all the characters of all languages, including Asian languages. The Unitex programs have been designed to work for all writing rules. There is no difficulty in working with Asian languages, in spite of their particular spacing conventions.
 

A multi-system software

The Unitex interface is written in Java and all other programs are written in C/C++. This allows Unitex to work on every system that supports Java 1.6 and that can compile C/C++ programs.

Unitex has been tested successfully on Windows (95, 98, NT, 2000, XP, ME and Vista) and Linux, Mac OS X and runs now on Solaris 8 Sparc.
 

A free software

Versions 1.0 and 1.1 of Unitex are distributed free under the terms of the General Public License (GPL). This means that everyone can redistribute Unitex freely within the terms of the GPL license. It also means that you have access to the source code of all the Unitex programs, which is included in the zip file you download. You can modify it freely and include it in any GPL-licensed program.

Since June 2004, version 1.2beta isdistributed free under the terms of the LesserGeneral Public License (LGPL), at the exception of the TRE library which is GPL-licensed. This license is more permissive than the GPL, because it allows you to reuse the own code of Unitex in non-free softwares.
 

Acknowledgments

Unitex is developed by Sébastien Paumier at the Institut Gaspard-Monge (IGM), University of Marne-la-Vallée (France).

Unitex also benefited from years of research, experiments and publications by Maurice Gross (1989, 1997), Dominique Revuz (1992), Emmanuel Roche (1992, 1997), Max Silberztein (1989, 1991, 1992, 1993, 1994, 1997) and other authors. Unitex would have been useless without the linguistic data (dictionaries and grammars) constructed by the laboratories of the RELEX network. 

The locate pattern function was re-used from previous software known as AGLAE.

Adaption for ancient Greek was made by Claude Devis (CENTAL) who also included new code pages (Windows & ISO) in the Asc2Uni and Uni2Asc conversion filters. Claude Devis has also introduced morphological filters into Unitex, using an OpenSource regular expression library made by Ville Laurikari.

The MergeTextAutomaton program was written by Olivier Blanc (IGM).

The Portuguese version of the manual was translated by Alexis Neme and Oto Araújo Vale(Projeto Relex - Brasil).

Integration of the ELAG program is made by Olivier Blanc.

The text editor is made by Julien Decreton, who has also developed UNDO and REDO fuctions in the graph editor.

The adaptation for Russian of the PolyLex program was made by Sebastian Nagel, who has also developed a set of Perl programs than can be used to manipulate and visualize automatically generated graphs, and other stuffs around Unitex.

The Tokenize and Dico programs were seriously optimized by  Alexis Neme.

A new graph of French sentences has been realized by Anne Dister (CENTAL), Nathalie Friburger and Denis Maurel (Université François-Rabelais de Tours, Laboratoire d'Informatique). French Proper Noun Dictionaries come from Prolex Project of the Université François-Rabelais. Today, the package contains two dictionaries : Toponyms and Countries&Capitals. More details are available on the Unitex page of the TLN website .

Korean was integrated by  Hyun-Gue Huh.

The Dico program was modified by Alexis Neme in order to allow the use of graph dictionaries (simple and compound words).


Related works

This software is used by many people with different goals. Here are a few projects that use Unitex:



References:

A.W. Appel, G.J. Jacobson. 1988. The world's fastest Scrabble program, Comm. ACM 31(5), pp. 572-578 & 585.

Dister, Anne. 1998. Problématique des fins de phrase en traitement automatique du français. In À qui appartient la ponctuation ? Actes du colloque international et interdisciplinaire de Liège (13-15 mars 1997), pp. 437-447, Bruxelles : Duculot, Champs linguistiques.

Friburger, Nathalie; Dister, Anne; Maurel, Denis. 2000. Améliorer la reconnaissance automatique des fins de phrases. In Actes des troisièmes journées Intex, Liège, 13-14 juin 2000 (Anne Dister Ed.), dans Revue, Informatique et Statistiques dans les sciences humaines 36 n°1-4, pp. 181-200.

Maurice Gross. 1989. The Use of Finite Automata in the Lexical Representation of Natural Language. In Electronic Dictionaries and Automata in Computational Linguistics, Lecture Notes in Computer Science 377, pp. 34-50, Berlin/New York: Springer. 

Maurice Gross. 1997. The Construction of Local Grammars, in E.Roche et Y.Schabes (eds.), Finite-State Language Processing, Cambridge, Mass./London, The MIT Press, pp. 329-352.

Cláudio L. Lucchesi, Tomasz Kowaltowski. 1993. Applications of finite automata representing large vocabularies. Software - Practice and Experience 23(1), pp. 15-30, Wiley & Sons.

Sébastien Paumier. 2000. Nouvelles méthodes pour la recherche d'expressions dans de grands corpus. In A. Dister (ed.), Actes des 3èmes Journées INTEX. Revue Informatique et Statistique dans les Sciences Humaines, 36ème année, n° 1 à 4.

Dominique Revuz. 1992. Minimization of acyclic deterministic automata in linear time. Theoretical Comput. Sci., vol. 92, n# 27 1, pp. 181-189.

Emmanuel Roche. 1992. Text disambiguation by finite-state automata: an algorithm and experiments on corpora. In COLING-92. Proceedings of the Conference, Nantes. 

Emmanuel Roche. 1997. Parsing with finite state transducers. In E. Roche and Y. Schabes (eds.), Finite-State Language Processing, Cambridge, Mass./London, The MIT Press, pp. 241-281.

Max Silberztein. 1989. The lexical analysis of French, in Electronic Dictionaries and Automata in Computational Linguistics, Lectures Notes in Computer Science 377, Berlin/New York: Springer.

Max Silberztein. 1991. A new approach to tagging: the use of a large-coverage electronic dictionary, Applied Computer Translation 1(4).

Max D. Silberztein. 1992. Finite state descriptions of various levels of linguistic phenomena, Language Research 28(4), Seoul National University, pp. 731-748.

Max D. Silberztein. 1993. Dictionnaires électroniques et analyse automatique de textes. Le système INTEX, Paris, Masson, 234 p. 

Max D. Silberztein. 1994. INTEX: a corpus processing system, in COLING 94 Proceedings, Kyoto, Japan.

Max D. Silberztein. 1997. The Lexical Analysis of Natural Languages, in Finite-State Language Processing, E. Roche and Y. Schabes (eds.), Cambridge, Mass./London, MIT Press, pp. 175-203.
 
 

University of Marne-la-Vallée|IGM | LADL


Last update April 7, 2007
Contact: unitex@univ-mlv.fr
3981

Valid HTML 4.01!