CasEN-Istex
Transducer Cascade CasEN for Named Entity Recognition
CasEN is made available on the plateform Unitex as part of the projects ANR Variling, FEDER Région Centre Entités nommées et nommables, Ortolang and Istex.
The cascade CasEN recognises named entities by using lexical resources and local descriptions of patterns, transducers that act on the text by insertions, replacements or deletions. These actions can be eventually iterative. They can be used "on the fly" on a particular text based on the results of previous transducers. The plateform Unitex allows easy creation and maintenance of these transducers by presenting them to the user in form of graphs. The aim of a casade it to utilise the patterns already identified, or, on the contrary, to avoid tagging a pattern already recognised. Thus, the order in which the transducers are passed is an important parameter.
The graphs subsequently call subgraphs that are:
- Either more specific graphs for which a pass in cascade is not useful. For example, the graph amount.grf calls subgraphs recognising different measures (temperature, length, volume, etc.).
- Either graphs that carry out annotations or transformations to be used thereafter.
- Either graphs containing:
- lists of words employed in a particular context. These lists can be subsequently be tagged.
- regular expressions or Unitex masks for selecting words starting with a capital letter, morphological constraints, etc.
Graphs can be constructed automatically for the text under consideration from generic graphs. These graphs permit retrieval of an entity without a local context, provided this entity had been identified elsewhere in the text by one of the previous graphs.
An Example of Tagging with CasEN
The sentence:
- the graph persProfession which identifies professions;
- the graph tagLastName which identifies last name;
to give (file ivanhoe_snt.raw):
<csc>
<form>
<csc>
John
<code>name</code>
<code>last</code>
<code>grftagLastName</code>
</csc></form>
<code>entity</code>
<code>pers</code>
<code>ind</code>
<code>grfpersProfession</code>
</csc>
, in the meanwhile, occupied his castle, and disposed of his domains without scruple;
If the output XML-CasSys does not correspond to the desired annotation (which is generally the case), the file _csc.txt can be opened in Unitex and treated with a second cascade. Hence, CasEN is composed of two cascades, one for analysis and the second one for synthesis. For our example, and for the synthesis version Istex, the result of the second cascade is:
The Order of Graphs
The cascade itself contains blocks of affirmations which are possible to be retrieved... For example, the sentence:
- timeAbsoluteCalendarDateYear recognises the whole sequence 29 February 2008;
- timeCalendarDate recognises the start of the sequence 29 February.
One must apply the graph timeAbsoluteCalendarDateYear before the other.
Sometimes, it is not about occurence, but about complement. The most simple example is without a doubt the graph of postal addresses that has patterns of person (to identify Franklin D. Roosevelt Street): the graphs of persons are thus placed before the graph of addresses. Various organisations also comprise of tags of type person, such as Rockefeller Center or Lincoln Hospital. These organisations are therefore recognized after the graphs of persons. Hence, the order of graphs is important.
CasEN_Istex
Under the Istex project, CasEN is supplemented for French by named entity recognition in scientific texts (as explained below). But this project deals also (and essentially) with texts written in English, this lead to the creation of a new cascade for this corpus and as a result can be used for other corpora in English.
The annotations of the cascade of analyse are borrowed from the TEI and the cascade of synthesis follow the Istex annotation guide. These two cascades are available below. Their evaluation, carried out in parallel with that of the French version, will be soon available.
If there is any remarks or bugs, please write to casen At univ-tours Dot fr.
Download the scripts CasEN_Istex
Ensure that Unitex is up to date. One must work with Unitex 3.2 or a later version.
To download CasEN_Istex, you must accept the terms of LGPL-LR license.
The download below contains:
- Four scripts to parse files from the folder InputFolder to the folder AnnotationOutputFolder or the folder StandoffOutputFolder.
- The PackageCassys.lingpkg.zip file for the Istex project.
Click here: Download script_casen_Istex_2021_02_25.
Annotation guide of Named Entities Istex Project
To download the PDF file you have to accept the Creative Commons CC-BY license.
Click here: Download the Annotation guide of Named Entities Istex Project PDF file (version of September 17, 2019).
How to Cite Us
Friburger N., Maurel D. (2004), Finite-state transducer cascade to extract named entities in texts, Theoretical Computer Science, vol. 313, 94-104.
Maurel D., Friburger N., Antoine J.-Y., Eshkol-Taravella I., Nouvel D. (2011), Cascades autour de la reconnaissance des entités nommées, TAL 52-1.