Lexical Resources
Wim Peters
NLP group
Department of Computer Science
University of Sheffield
w.peters@dcs.shef.ac.uk
Standardization of lexical description
If one wants to describe language resources, create new ones, or make efficient use of them one needs the appropriate methodology, methodology standards, software tools, and the respective standards for mark-up, interchange, exploitation and evaluation.
Much work has already been carried out on standardizing the description and creation of lexicons, especially to facilitate language engineering applications. While TEI (1) does not make detailed proposals for lexical tag sets, it does describe the structure of a dictionary entry in detail. Various standardization efforts such as EAGLES (2) and ISLE (3) worked out concrete proposals for standard lexical structures. GENELEX (4) can be seen as an early attempt to describe a generic lexicon structure with a complicated but exhaustive descriptive structure. The PAROLE and SIMPLE (5) projects were an attempt to encode multilingual lexicons in a uniform way with 12 fairly small sized example lexicons as a result. MULTILEX (6) was another project focusing on the implementation of 15 concrete lexicons applying a structure derived from the GENELEX model. The MILE (Multilingual Computational Lexicon) project [8] recently started within ISLE has the task of standardizing multilingual lexicons.
Partly within the area of terminology, other relevant work was undertaken by the OLIF2 consortium (Open Lexicon Interchange Format) (7) resulting in the OLIF2 proposal. OLIF2 defines a large number of lexical features, but does not make statements about their structural embedding. Each OLIF2 entry is a monolingual entry containing various feature/value pairs, cross-references between entries in the same language lexicon, and transfers defining bilingual transfer relations. The OLIF2 proposal for features describes four main categories: administrative, morphological, syntactic, semantic. The features are similar to those found in other more generic lexicon proposals. Below are a few examples with their descriptions:
The ptOfSpeechDCS element (DCS is short for data category specification) holds data about a user-extended scheme for describing the part-of-speech of OLIF entries. Users can for example describe their additional part-of-speech tags by means of a URL or by means of CDATA sections.
The subjField element classifies the knowledge domain to which the lexical/terminological entry is assigned. Example values: agriculture, aviation.
The subjFieldDCS element holds data about a user-extended scheme for describing the subject field information of OLIF entries (see the comment for the ptOfSpeechDCS element for more information).
The syllabification element holds data about the syllable boundaries within the entry string. Example use: do-cu-men-ta-ry, li-be-ra-li-ty.
The syllabificationMarkInfo element holds data about editorial practice adopted with respect to syllabification in the original. Example use: we use '*' as marker.
The synFrame element classifies the syntactic frame for the entry string (subcategorisation). Example values: subj-imps-opt, dobj-opt..
The synFrameDCS element holds data about a user-extended scheme for describing the syntactic frames of OLIF entries (see the comment for the ptOfSpeechDCS element for more information)..
The synPosition element classifies the unmarked positioning of the entry string syntactically. Example values: prenoun, cl-init..
The synStruct element holds data about the constituent structure of a multiword entry string (note the possibilities provided for single words by means of the morphStruct element). Example use: [[adj][noun]] (General Ledger)..
Much work has been done in the area of terminology databases. The MARTIF (Machine Reachable Terminology Interchange Format) (8) work describes a format to facilitate the interchange of terminological data among terminology management systems. This work resulted in the ISO 12200 specifications (9). Complementary to that ISO 12620 specifies how "Data Categories" (the basic elements for describing lexical content) have to be defined. Term related information specifies the linguistic type of the terms. This is done by assigning linguistic attributes to the entries such as part of speech (cf. OLIF2 above). Descriptive information links the terms to domains and points to positions in concept hierarchies. Administrative and proprietary information can also be added to each term such as creator name and creation date.
The SALT project (Standards-based Access to Lexicon and Terminologies) (10) was recently initiated mainly driven by the needs from language engineering. SALT suggests the XLT (XML representations of Lexicons and Terminologies) (11) family of formats for representing, manipulating, and sharing terminological data. The core structure of SALT is based on the MARTIF proposal.