Partager

CasEN-Istex


Transducer Cascade CasEN for Named Entity Recognition

CasEN is made available on the plateform Unitex as part of the projects ANR VarilingFEDER Région Centre Entités nommées et nommablesOrtolang and Istex.

The cascade CasEN recognises named entities by using lexical resources and local descriptions of patterns, transducers that act on the text by insertions, replacements or deletions. These actions can be eventually iterative. They can be used "on the fly" on a particular text based on the results of previous transducers. The plateform Unitex allows easy creation and maintenance of these transducers by presenting them to the user in form of graphs. The aim of a casade it to utilise the patterns already identified, or, on the contrary, to avoid tagging a pattern already recognised. Thus, the order in which the transducers are passed is an important parameter.

The graphs subsequently call subgraphs that are:

  • Either more specific graphs for which a pass in cascade is not useful. For example, the graph amount.grf calls subgraphs recognising different measures (temperature, length, volume, etc.).
  • Either graphs that carry out annotations or transformations to be used thereafter.
  • Either graphs containing:
    • lists of words employed in a particular context. These lists can be subsequently be tagged.
    • regular expressions or Unitex masks for selecting words starting with a capital letter, morphological constraints, etc.

Graphs can be constructed automatically for the text under consideration from generic graphs. These graphs permit retrieval of an entity without a local context, provided this entity had been identified elsewhere in the text by one of the previous graphs.


An Example of Tagging with CasEN

The sentence:

Prince John, in the meanwhile, occupied his castle, and disposed of his domains without scruple;
extracted from the corpus distributed by Unitex (Ivanhoe, by Sir Walter Scott) is transformed by:

  • the graph persProfession which identifies professions;
  • the graph tagLastName which identifies last name;

to give (file ivanhoe_snt.raw):

Prince {\{John\,\.name\+last\+grftagLastName\},.entity+pers+ind+grfpersProfession}, in the meanwhile, occupied his castle, and disposed of his domains without scruple;
This format enables the display of concordance, but is hardly readible. Therefore, another resultant file is available in a XML-CasSys (file ivanhoe_snt.txt). This example is:
Prince
<csc>
   <form>
     <csc>
        John
        <code>name</code>
        <code>last</code>
        <code>grftagLastName</code>
     </csc></form>
   <code>entity</code>
   <code>pers</code>
   <code>ind</code>
   <code>grfpersProfession</code>
</csc>
, in the meanwhile, occupied his castle, and disposed of his domains without scruple;
A recognised sequence is, on one hand, tagged and, on the other hand, frozen in a polylexical expression. This annotation can be later serached in Unitex by more or less specific masks. For example, from the graph above, , or . To enable the debugging, we add the name of the graph that had inserted it, prefixed by grf, here grfpersProfession.

If the output XML-CasSys does not correspond to the desired annotation (which is generally the case), the file _csc.txt can be opened in Unitex and treated with a second cascade. Hence, CasEN is composed of two cascades, one for analysis and the second one for synthesis. For our example, and for the synthesis version Istex, the result of the second cascade is:

Prince John, in the meanwhile, occupied his castle, and disposed of his domains without scruple;
 


The Order of Graphs

The cascade itself contains blocks of affirmations which are possible to be retrieved... For example, the sentence:

He arrived on 29 February 2008.
can be analysed by two graphs of CasEN:

  • timeAbsoluteCalendarDateYear recognises the whole sequence 29 February 2008;
  • timeCalendarDate recognises the start of the sequence 29 February.

One must apply the graph timeAbsoluteCalendarDateYear before the other.

Sometimes, it is not about occurence, but about complement. The most simple example is without a doubt the graph of postal addresses that has patterns of person (to identify Franklin D. Roosevelt Street): the graphs of persons are thus placed before the graph of addresses. Various organisations also comprise of tags of type person, such as Rockefeller Center or Lincoln Hospital. These organisations are therefore recognized after the graphs of persons. Hence, the order of graphs is important.


CasEN_Istex

Under the Istex project, CasEN is supplemented for French by named entity recognition in scientific texts (as explained below). But this project deals also (and essentially) with texts written in English, this lead to the creation of a new cascade for this corpus and as a result can be used for other corpora in English.

The annotations of the cascade of analyse are borrowed from the TEI and the cascade of synthesis follow the Istex annotation guide. These two cascades are available below. Their evaluation, carried out in parallel with that of the French version, will be soon available.

If there is any remarks or bugs, please write to casen At univ-tours Dot fr.


Download the scripts CasEN_Istex

Ensure that Unitex is up to date. One must work with Unitex 3.2 or a later version.

To download CasEN_Istex, you must accept the terms of LGPL-LR license.

The download below contains:

  • Four scripts to parse files from the folder InputFolder to the folder AnnotationOutputFolder or the folder StandoffOutputFolder.
  • The PackageCassys.lingpkg.zip file for the Istex project.


Annotation guide of Named Entities Istex Project

To download the PDF file you have to accept the Creative Commons CC-BY license.

Click here: Download the Annotation guide of Named Entities Istex Project PDF file (version of September 17, 2019).


How to Cite Us

Friburger N., Maurel D. (2004), Finite-state transducer cascade to extract named entities in texts, Theoretical Computer Science, vol. 313, 94-104.

Maurel D., Friburger N., Antoine J.-Y., Eshkol-Taravella I., Nouvel D. (2011), Cascades autour de la reconnaissance des entités nommées, TAL 52-1.