Lexical Resources
Wim Peters
NLP group
Department of Computer Science
University of Sheffield
w.peters@dcs.shef.ac.uk

Introduction

Representation format of linguistic resources

Standardization of lexical description

Metadata for lexical description

Comparison of resources using metadata

Metadata for lexical description

 

The information linguistic categories contain and their structural relations can, for reasons of generality and conceptual clarity, best be described by means of metadata, i.e. information about the types of available linguistic information. In fact, all information contained in lexicons and related resources is metadata but the resources differ in terms of terminology, level of granularity of linguistic description and data format (see section 2.3). This is exemplified by the standardization efforts described above (e.g. the term ptOfSpeechDCS is OLIF specific and may contain different values from e.g. EAGLES) and the comparison of resources below in section 5. Metadata are being proposed by initiatives such as OLIF and ISLE. Their function is to describe and access resources in a standard fashion. The ISLE consortium (1) has issued a draft proposal (2) that divides lexical metadata up into two main groups: external (information about the lexicon as a whole) and internal (information about the lexical entry).

4.1 External metadata

External units of information describe the lexicon as an object and can be the following (see the draft proposal for a full list):

Name A short name which identifies the Lexicon
Title A more elaborated title of the Lexicon
Date Date of the creation and major modifications
Version Version indication
Creator The responsible persons who created the resource
Name Name of creators
Contact block of features related to contact person or organization (see below)
Description A suitable description associated with the set of creators
Project A block to describe the project
Name Short name of the project
ID Unique project identifier
Contact contact address sub schema
Description some space for descriptions to be associated with the project
LexiconType Type following some taxonomy (see e.g. list in Section 1)
Object Languages A block to describe the languages included in the lexicon
Description some space for a prose description
MultilingualityType languages can occur in different flavors in lexica, they can occur as multilingual entries in ML lexica, but they also can occur as translations of for example sense descriptions; this difference can be indicated with the help of a controlled vocabulary
Language a list of languages included, each language be described in a substructure
Format a rough indication of the format the lexicon is in such as relational table, structured plain text, some XML format, html format, ...
AccessTool many lexicons are only interpretable via concrete access tools such as Shoebox, ORACLE, FoxPro, Access, Web-Browser,...
Media this entry tells whether the lexicon includes audio or video samples or graphics
Character Encoding this list should give an impression of the type of fonts needed to render all data included such as UTF-8, ISO-latin
Size the size of the lexicon in bytes
No Lexical Entries the number of lexical entries the lexicon includes
Access sub-schema where access info is given (see below)
Keys a possibility to add feature/value pairs to define new keywords
Source this entry describes which sources were used to build the lexicon
References block to cover references to publications etc.
     
Access
ResourceLink URL pointing to the resource if it is directly accessible
Availability codification of terms of access (has to be worked out
Description prose description associated with access
Date date of statements about access
Owner defines the owner of the lexicon
Publisher defines the publisher of the lexicon
Contact specifies a sub-schema describing whom to contact
     
Contact
Name name of the contact person
Address address info
Email email address
Organization name of an institution
   
Language
Language ID formal language specifier from ISO or SIL lists
Name general name of the language
Description a description of the language can be associated

 4.2 Internal metadata

This type of data gives us information about the linguistic content of the lexicons. The following linguistic units of description have been distinguished within the ISLE lexical metadata initiative. This list is not meant to be exhaustive.

Modality indicates which mode of communication is captured in the lexicon. Possible values are:
Spoken
Written
Sign
Headword type indication of the linguistic nature of the entry in the lexicon. Possible values are:
Sentence
Phrase
Wordform
Lemma      
                 entry conforming to the unmarked wordform                                                      (e.g. infinitive for verbs).
Abstract Lemma       entry not conforming to any wordform of the                                                   group subsumed by the lemma.
Stem
Affix
Orthography possible values are:
Hyphenated Spelling
Syllabified Spelling
Spelling Variants
      orthographic variations with or without                                                    preferred spelling information
Citations
Morphology possible values are:
Stem                          deep or surface stem
Stem Allomorphy    variations at stem level
Segmentation          analysis into morphological constituents such                                                as affixes
Production rules     governing the production of surfaec forms on                                                the basis of stems
Typology                 any classification of entries or morphological                                                entities
Morphosyntax possible values are:
Part of Speech          syntactic class of the entry.
Inflection                   any inflectional or conjugational information
Countability              pluralization properties
Gradability                 e.g. adjectival comparative/superlative constructions
Gender                        e.g. neuter
Typology                   any classification of entries
Syntax possible values are:
Complementation     Syntactic complementation
Alternation                alternative complementation patterns
Modification              e.g. adjectival modification patterns
Shallow Parsing         segmentation into chunks
Deep Parsing              finer grained analysis below chunk level
Functional Parsing    syntactic functions such as subject
Collocations               significant juxtaposed entries/wordforms
Typology                   any classification, e.g. prepositional/phrasal verb
Phonology possible values are:
Transcription            any type of phonetic/phonological transcription
IPA Transcription     transcription in International Phonetic Alphabet
CV pattern                  transcription in terms of consonant-vocal combinations Constituent Structure segmentation into phonetic constituents
Intonation                  stress marking, constituent length etc.
Semantics possible values are:
Sense distinction      polysemy and/or homonymy
Ontological classification related concepts and conceptual relations
Gloss                            informal description of the sense in natural language
Definition                    formal description of the sense e.g. as a 1st order logic formula
Connotation               non-denotational information such as pejorative
Idiom idiosyncratic use
Componential Features formula or list containing a finite set of meaning attributes
Cross-references        links to other entries/wordforms
Semantic relations      relations between entries or associated concepts Preference                    characterization of the arguments in the semantic                                        predicate
Etymology information about the historical context (morphological, phonological, syntactic, semantic) of a lexical entry or wordform.
Usage Pragmatic/sociolinguistic information; possible values are:
Region                          e.g. dialect
Style                              e.g. slang
Frequency corpus-derived frequency of occurrence

Another parallel initiative within ISLE, the EAGLES/ISLE Working Group for Multilingual Lexicons, aims at the standardization of multilingual lexical entries. For this purpose a checklist has been created that has much overlap with the metadata set listed above, but is in many cases more fine-grained in its coverage. A brief explanation of abbreviations used: SL= source language; TL = target language; IPA=International Phonetic Alphabet (3).

 
Entry component
Information content
Mode
Function
1 headword lexical form(s) of the headword: how the headword is spelt SL Helps both SL and TL users find the information they are looking for
2 Phonetic transcription how the headword (or variant form etc.) is pronounced (in International Phonetic Alphabet) IPA Helps user pronounce the word correctly
3 variant form alternative spelling of headword or slight variation in the form of this word SL helps both types of user find the information they are looking for
4 inflected form other grammatical forms of the lemma (headword) SL helps dec user find the information they are looking for
helps enc user use the word correctly
5 Cross-reference indication of another headword whose entry holds relevant information, or some other part of the dictionary where this may be found code helps both types of user find the information they are looking for, or other useful information
6 Morphosyntactic information
  a Part-of-speech marker part of speech of the headword (or the secondary headword) code helps both types of user find the information they are looking for, by focussing the search
b Inflectional class Inflectional paradigm of the entry code helps SL user use TL item correctly helps TL user disambiguate TL word

helps TL user use SL item correctly helps SL user disambiguate SL word
c Derivation Cross-part-of-speech-information, morphologically derived forms SL helps SL user identify the sense of the headword or other SL item
helps TL user identify the sense of a TL equivalent
d Gender Information about the gender of the entry in SL and TL code helps SL user identify the sense of the headword or other SL item
helps TL user identify the sense of a TL equivalent
e Number Information about the grammatical number of the entry in SL and TL code helps SL user identify the sense of the headword or other SL item
helps TL user identify the sense of a TL equivalent
f Mass vs. Count Information whether the a noun is mass or count, in SL and TL code helps SL user identify the sense of the headword or other SL item
helps TL user identify the sense of a TL equivalent
g Gradation For adverbs and adjectives code helps SL user use TL item correctly helps TL user disambiguate TL word
7 Subdivision counter indicates the start of new section or subsection ('sense') number / letter 'signpost' helping user to find their way about the entry more efficiently
8 Entry subdivision separate section or subsection in entry (often called dictionary sense) Dictionary text breaks up entry, making it easier to read and find what is being sought
9 Sense indicator synonym or paraphrase of headword in this sense, or other brief sense clue indicating specific sense of SL or TL item SL helps SL user identify the sense of the headword or other SL item
helps TL user identify the sense of a TL equivalent
10 linguistic label the style, register, regional variety, etc. of the SL or TL item code helps SL user identify the sense of the headword
helps both users translate
helps TL user understand
11 Syntactic information
  a Subcategorization frame (i.) Number and types of complements (ii.) syntactic introducer of a complement (e.g. preposition, case, etc.) (iii.) type of syntactic representation (e.g. constituents, functional, etc.) etc. code helps SL user identify the sense of the headword or other SL item
helps TL user identify the sense of a TL equivalent
b Obligatority of complements Information whether a certain complement is obligatory or not code helps SL user identify the sense of the headword or other SL item
helps TL user identify the sense of a TL equivalent
c Auxiliary Which type of auxiliary is selected by a given predicate (in certain languages auxiliary selection is related to issues like unaccusativity, which on turn lie at the interface between lexicon and syntax) code acts as a sense indicator
helps SL user select appropriate TL equivalent
d Light or support construction Constructions with light verbs SL or TL helps SL user identify the sense of the headword or other SL item
helps TL user identify the sense of a TL equivalent
e Periphrastic constructions Constructions containing periphrasis, usage, semantic value, etc. SL or TL helps SL user identify the sense of the headword or other SL item
helps TL user identify the sense of a TL equivalent
f Phrasal verbs Particular representation of phrasal constructions SL or TL helps SL user identify the sense of the headword or other SL item
helps TL user identify the sense of a TL equivalent
g Collocator (i.) typical subject /object of verb, noun modified by adjective etc. (ii.) type of collocation relation represented etc. SL or TL acts as a sense indicator helps SL user select appropriate TL equivalent helps TL user translate or understand the SL item
h Alternations Syntactic alternations an entry can enter into Code acts as a sense indicator
12 Semantic information
  a Semantic type Reference to an ontology of types which are used to classify word senses Code helps SL user identify the sense of the headword or other SL item
helps TL user identify the sense of a TL equivalent
b Argument structure Argument frames, plus semantic information identifying the type of the arguments, selectional constraints, etc. Code helps SL user identify the sense of the headword or other SL item
helps TL user identify the sense of a TL equivalent
c Semantic relations Different types of relations (e.g. synonymy, antonymy, meronymy, hyperonymy, Qualia Roles, etc.) between word senses, etc. Code acts as SL sense indicator for SL user
acts as TL sense indicator for TL user
d Regular polysemy Representation of regular polysemous alternations Code helps SL user identify the sense of the headword or other SL item
helps TL user identify the sense of a TL equivalent
e Domain Information concerning the terminological domain to which a given sense belongs Code helps SL user identify the sense of the headword or other SL item
helps TL user identify the sense of a TL equivalent
f Decomposition Representation of relevant meaning component, e.g. causativity, agentivity, motion, etc. Code acts as SL sense indicator for SL user
acts as TL sense indicator for TL user
13 Translation TL equivalent of SL item TL helps TL user understand
helps both users translate
14 Gloss TL explanation of meaning of an SL item which has no direct equivalent in the TL TL helps TL user understand helps both users translate
15 near-equivalent TL item corresponding to an SL item which has no direct equivalent in the TL TL helps TL user understand
helps both users translate
16 example phrase (straightforward) a phrase or sentence illustrating the non-idiomatic use of the headword, in a context where the TL equivalent is virtually a word-to-word translation SL acts as SL sense indicator for SL user acts as TL sense indicator for TL user helps TL & SL users to use the foreign-language item correctly
17 example phrase (problematic) a phrase or sentence illustrating a non-idiomatic use of headword in a context where a specific TL equivalent is required (i.e. an SL example which is easily understandable for the TL speaker, but presents translation problems for the SL speaker) SL helps SL user avoid a translating error acts as a sense indicator for SL user helps TL user subsequently to use the SL item correctly
18 multiword unit (idiomatic) multiword expression (MWE) containing the headword (the term MWE covers idioms, fixed & semi-fixed collocations, compounds etc.) SL helps both users translate
19 Subheadword also secondary headword lemma morphologically related to the headword, figuring as head of a sub-entry (subheadwords can be compounds, phrasal verbs, etc.) SL saves space helps both types of user find the information they are looking for
20 usage note how the headword is used; 'macro' information which cannot appear at every appropriate entry; warning of cultural differences between the two languages; etc. SL or TL helps both types of user to avoid misunderstandings about the foreign language item, based on own-language knowledge
21 Frequency Information about the frequency of the entry code helps both users translate



  1. (http://www.mpi.nl/world/ISLE/)
  2. Gibbon, D., Peters, W., Wittenburg, P., (December 2001), Metadata Elements for Lexicon Descriptions, Version 1.0, MPI Nijmegen.
  3. http://www2.arts.gla.ac.uk/IPA/ipa.html