Home - Screenshots - Download - Manual - Linguistic data - GPL - LGPL - LGPLLR - Your contribution - Mailing list - Links
|
UNITEX HOME PAGE
Unitex is a corpus processing system, based on automata-oriented technology. The concept of this software was born at LADL (Laboratoire d'Automatique Documentaire et Linguistique), under the direction of its director, Maurice Gross. With this tool, you can handle electronic resources such as electronic dictionaries and grammars and apply them. You can work at the levels of morphology, the lexicon and syntax. The main functions are:
From this site, you can download Unitex and find information about linguistic data. Unitex is referenced in the PLUME project. ![]() WARNING: versions 3.3.x of gcc contain a huge bug affecting the -O2 compilation option, used in Unitex Makefile. To avoid this bug, just remove the -O2 compilation option in the Makefile. Version 3.4 seems to be OK. Last updates:
A multilingual platform Unitex
conforms to the Unicode
3.0
standard that allows users to handle virtually all the characters
of all languages, including Asian languages. The Unitex programs
have been designed to work for all writing rules. There is no
difficulty in working with Asian languages, in spite of their
particular spacing conventions. A multi-system software The Unitex interface is written in Java and all other programs are written in C/C++. This allows Unitex to work on every system that supports Java 1.6 and that can compile C/C++ programs. Unitex
has been tested successfully on Windows (95, 98, NT, 2000, XP, ME and
Vista) and Linux, Mac OS X and runs now on Solaris 8 Sparc. A free software Versions 1.0 and 1.1 of Unitex are distributed free under the terms of the General Public License (GPL). This means that everyone can redistribute Unitex freely within the terms of the GPL license. It also means that you have access to the source code of all the Unitex programs, which is included in the zip file you download. You can modify it freely and include it in any GPL-licensed program. Since
June 2004, version 1.2beta isdistributed free under the terms of
the LesserGeneral
Public License (LGPL), at the exception of the TRE library
which is GPL-licensed. This license is more permissive than the
GPL, because it allows you to reuse the own code of Unitex in
non-free softwares. Acknowledgments Unitex is developed by Sébastien Paumier at the Institut Gaspard-Monge (IGM), University of Marne-la-Vallée (France). Unitex also benefited from years of research, experiments and publications by Maurice Gross (1989, 1997), Dominique Revuz (1992), Emmanuel Roche (1992, 1997), Max Silberztein (1989, 1991, 1992, 1993, 1994, 1997) and other authors. Unitex would have been useless without the linguistic data (dictionaries and grammars) constructed by the laboratories of the RELEX network. The locate pattern function was re-used from previous software known as AGLAE. Adaption for ancient Greek was made by Claude Devis (CENTAL) who also included new code pages (Windows & ISO) in the Asc2Uni and Uni2Asc conversion filters. Claude Devis has also introduced morphological filters into Unitex, using an OpenSource regular expression library made by Ville Laurikari. The MergeTextAutomaton program was written by Olivier Blanc (IGM). The Portuguese version of the manual was translated by Alexis Neme and Oto Araújo Vale(Projeto Relex - Brasil). Integration of the ELAG program is made by Olivier Blanc. The text editor is made by Julien Decreton, who has also developed UNDO and REDO fuctions in the graph editor. The adaptation for Russian of the PolyLex program was made by Sebastian Nagel, who has also developed a set of Perl programs than can be used to manipulate and visualize automatically generated graphs, and other stuffs around Unitex. The Tokenize and Dico programs were seriously optimized by Alexis Neme. A new graph of French sentences has been realized by Anne Dister (CENTAL), Nathalie Friburger and Denis Maurel (Université François-Rabelais de Tours, Laboratoire d'Informatique). French Proper Noun Dictionaries come from Prolex Project of the Université François-Rabelais. Today, the package contains two dictionaries : Toponyms and Countries&Capitals. More details are available on the Unitex page of the TLN website . Korean was integrated by Hyun-Gue Huh. The Dico program was modified by Alexis Neme in order to allow the use of graph dictionaries (simple and compound words). Related works This software is used by many people with different goals. Here are a few projects that use Unitex:
References: A.W. Appel, G.J. Jacobson. 1988. The world's fastest Scrabble program, Comm. ACM 31(5), pp. 572-578 & 585. Dister, Anne. 1998. Problématique des fins de phrase en traitement automatique du français. In À qui appartient la ponctuation ? Actes du colloque international et interdisciplinaire de Liège (13-15 mars 1997), pp. 437-447, Bruxelles : Duculot, Champs linguistiques. Friburger, Nathalie; Dister, Anne; Maurel, Denis. 2000. Améliorer la reconnaissance automatique des fins de phrases. In Actes des troisièmes journées Intex, Liège, 13-14 juin 2000 (Anne Dister Ed.), dans Revue, Informatique et Statistiques dans les sciences humaines 36 n°1-4, pp. 181-200. Maurice Gross. 1989. The Use of Finite Automata in the Lexical Representation of Natural Language. In Electronic Dictionaries and Automata in Computational Linguistics, Lecture Notes in Computer Science 377, pp. 34-50, Berlin/New York: Springer. Maurice Gross. 1997. The Construction of Local Grammars, in E.Roche et Y.Schabes (eds.), Finite-State Language Processing, Cambridge, Mass./London, The MIT Press, pp. 329-352. Cláudio L. Lucchesi, Tomasz Kowaltowski. 1993. Applications of finite automata representing large vocabularies. Software - Practice and Experience 23(1), pp. 15-30, Wiley & Sons. Sébastien Paumier. 2000. Nouvelles méthodes pour la recherche d'expressions dans de grands corpus. In A. Dister (ed.), Actes des 3èmes Journées INTEX. Revue Informatique et Statistique dans les Sciences Humaines, 36ème année, n° 1 à 4. Dominique Revuz. 1992. Minimization of acyclic deterministic automata in linear time. Theoretical Comput. Sci., vol. 92, n# 27 1, pp. 181-189. Emmanuel Roche. 1992. Text disambiguation by finite-state automata: an algorithm and experiments on corpora. In COLING-92. Proceedings of the Conference, Nantes. Emmanuel Roche. 1997. Parsing with finite state transducers. In E. Roche and Y. Schabes (eds.), Finite-State Language Processing, Cambridge, Mass./London, The MIT Press, pp. 241-281. Max Silberztein. 1989. The lexical analysis of French, in Electronic Dictionaries and Automata in Computational Linguistics, Lectures Notes in Computer Science 377, Berlin/New York: Springer. Max Silberztein. 1991. A new approach to tagging: the use of a large-coverage electronic dictionary, Applied Computer Translation 1(4). Max D. Silberztein. 1992. Finite state descriptions of various levels of linguistic phenomena, Language Research 28(4), Seoul National University, pp. 731-748. Max D. Silberztein. 1993. Dictionnaires électroniques et analyse automatique de textes. Le système INTEX, Paris, Masson, 234 p. Max D. Silberztein. 1994. INTEX: a corpus processing system, in COLING 94 Proceedings, Kyoto, Japan. Max D. Silberztein.
1997. The Lexical Analysis of Natural Languages, in Finite-State
Language Processing, E. Roche and Y. Schabes (eds.),
Cambridge, Mass./London, MIT Press, pp. 175-203. University of Marne-la-Vallée|IGM | LADL
|