Lexical Resources
Wim Peters
NLP group
Department of Computer Science
University of Sheffield
w.peters@dcs.shef.ac.uk
Metadata for lexical description
The information linguistic categories contain and their structural relations can, for reasons of generality and conceptual clarity, best be described by means of metadata, i.e. information about the types of available linguistic information. In fact, all information contained in lexicons and related resources is metadata but the resources differ in terms of terminology, level of granularity of linguistic description and data format (see section 2.3). This is exemplified by the standardization efforts described above (e.g. the term ptOfSpeechDCS is OLIF specific and may contain different values from e.g. EAGLES) and the comparison of resources below in section 5. Metadata are being proposed by initiatives such as OLIF and ISLE. Their function is to describe and access resources in a standard fashion. The ISLE consortium (1) has issued a draft proposal (2) that divides lexical metadata up into two main groups: external (information about the lexicon as a whole) and internal (information about the lexical entry).
4.1 External metadata
External units of information describe the lexicon as an object and can be the following (see the draft proposal for a full list):
Name | A short name which identifies the Lexicon | ||
Title | A more elaborated title of the Lexicon | ||
Date | Date of the creation and major modifications | ||
Version | Version indication | ||
Creator | The responsible persons who created the resource | ||
Name | Name of creators | ||
Contact | block of features related to contact person or organization (see below) | ||
Description | A suitable description associated with the set of creators | ||
Project | A block to describe the project | ||
Name | Short name of the project | ||
ID | Unique project identifier | ||
Contact | contact address sub schema | ||
Description | some space for descriptions to be associated with the project | ||
LexiconType | Type following some taxonomy (see e.g. list in Section 1) | ||
Object Languages | A block to describe the languages included in the lexicon | ||
Description | some space for a prose description | ||
MultilingualityType | languages can occur in different flavors in lexica, they can occur as multilingual entries in ML lexica, but they also can occur as translations of for example sense descriptions; this difference can be indicated with the help of a controlled vocabulary | ||
Language | a list of languages included, each language be described in a substructure | ||
Format | a rough indication of the format the lexicon is in such as relational table, structured plain text, some XML format, html format, ... | ||
AccessTool | many lexicons are only interpretable via concrete access tools such as Shoebox, ORACLE, FoxPro, Access, Web-Browser,... | ||
Media | this entry tells whether the lexicon includes audio or video samples or graphics | ||
Character Encoding | this list should give an impression of the type of fonts needed to render all data included such as UTF-8, ISO-latin | ||
Size | the size of the lexicon in bytes | ||
No Lexical Entries | the number of lexical entries the lexicon includes | ||
Access | sub-schema where access info is given (see below) | ||
Keys | a possibility to add feature/value pairs to define new keywords | ||
Source | this entry describes which sources were used to build the lexicon | ||
References | block to cover references to publications etc. | ||
Access | |||
ResourceLink | URL pointing to the resource if it is directly accessible | ||
Availability | codification of terms of access (has to be worked out | ||
Description | prose description associated with access | ||
Date | date of statements about access | ||
Owner | defines the owner of the lexicon | ||
Publisher | defines the publisher of the lexicon | ||
Contact | specifies a sub-schema describing whom to contact | ||
Contact | |||
Name | name of the contact person | ||
Address | address info | ||
email address | |||
Organization | name of an institution | ||
Language | |||
Language ID | formal language specifier from ISO or SIL lists | ||
Name | general name of the language | ||
Description | a description of the language can be associated |
4.2 Internal metadata
This type of data gives us information about the linguistic content of the lexicons. The following linguistic units of description have been distinguished within the ISLE lexical metadata initiative. This list is not meant to be exhaustive.
Modality | indicates
which mode of communication is captured in the lexicon. Possible values
are: Spoken Written Sign |
Headword type | indication
of the linguistic nature of the entry in the lexicon. Possible values
are: Sentence Phrase Wordform Lemma entry conforming to the unmarked wordform (e.g. infinitive for verbs). Abstract Lemma entry not conforming to any wordform of the group subsumed by the lemma. Stem Affix |
Orthography | possible
values are: Hyphenated Spelling Syllabified Spelling Spelling Variants orthographic variations with or without preferred spelling information Citations |
Morphology | possible
values are: Stem deep or surface stem Stem Allomorphy variations at stem level Segmentation analysis into morphological constituents such as affixes Production rules governing the production of surfaec forms on the basis of stems Typology any classification of entries or morphological entities |
Morphosyntax | possible
values are: Part of Speech syntactic class of the entry. Inflection any inflectional or conjugational information Countability pluralization properties Gradability e.g. adjectival comparative/superlative constructions Gender e.g. neuter Typology any classification of entries |
Syntax | possible
values are: Complementation Syntactic complementation Alternation alternative complementation patterns Modification e.g. adjectival modification patterns Shallow Parsing segmentation into chunks Deep Parsing finer grained analysis below chunk level Functional Parsing syntactic functions such as subject Collocations significant juxtaposed entries/wordforms Typology any classification, e.g. prepositional/phrasal verb |
Phonology | possible
values are: Transcription any type of phonetic/phonological transcription IPA Transcription transcription in International Phonetic Alphabet CV pattern transcription in terms of consonant-vocal combinations Constituent Structure segmentation into phonetic constituents Intonation stress marking, constituent length etc. |
Semantics | possible
values are: Sense distinction polysemy and/or homonymy Ontological classification related concepts and conceptual relations Gloss informal description of the sense in natural language Definition formal description of the sense e.g. as a 1st order logic formula Connotation non-denotational information such as pejorative Idiom idiosyncratic use Componential Features formula or list containing a finite set of meaning attributes Cross-references links to other entries/wordforms Semantic relations relations between entries or associated concepts Preference characterization of the arguments in the semantic predicate |
Etymology | information
about the historical context (morphological, phonological, syntactic,
semantic) of a lexical entry or wordform. |
Usage | Pragmatic/sociolinguistic
information; possible values are: Region e.g. dialect Style e.g. slang |
Frequency | corpus-derived frequency of occurrence |
Another parallel initiative within ISLE, the EAGLES/ISLE Working Group for Multilingual Lexicons, aims at the standardization of multilingual lexical entries. For this purpose a checklist has been created that has much overlap with the metadata set listed above, but is in many cases more fine-grained in its coverage. A brief explanation of abbreviations used: SL= source language; TL = target language; IPA=International Phonetic Alphabet (3).
|
Entry component
|
Information content
|
Mode
|
Function
|
|
1 | headword | lexical form(s) of the headword: how the headword is spelt | SL | Helps both SL and TL users find the information they are looking for | |
2 | Phonetic transcription | how the headword (or variant form etc.) is pronounced (in International Phonetic Alphabet) | IPA | Helps user pronounce the word correctly | |
3 | variant form | alternative spelling of headword or slight variation in the form of this word | SL | helps both types of user find the information they are looking for | |
4 | inflected form | other grammatical forms of the lemma (headword) | SL | helps
dec user find the information they are looking for helps enc user use the word correctly |
|
5 | Cross-reference | indication of another headword whose entry holds relevant information, or some other part of the dictionary where this may be found | code | helps both types of user find the information they are looking for, or other useful information | |
6 | Morphosyntactic information | ||||
a | Part-of-speech marker | part of speech of the headword (or the secondary headword) | code | helps both types of user find the information they are looking for, by focussing the search | |
b | Inflectional class | Inflectional paradigm of the entry | code | helps
SL user use TL item correctly helps TL user disambiguate TL word helps TL user use SL item correctly helps SL user disambiguate SL word |
|
c | Derivation | Cross-part-of-speech-information, morphologically derived forms | SL | helps
SL user identify the sense of the headword or other SL item helps TL user identify the sense of a TL equivalent |
|
d | Gender | Information about the gender of the entry in SL and TL | code | helps
SL user identify the sense of the headword or other SL item helps TL user identify the sense of a TL equivalent |
|
e | Number | Information about the grammatical number of the entry in SL and TL | code | helps
SL user identify the sense of the headword or other SL item helps TL user identify the sense of a TL equivalent |
|
f | Mass vs. Count | Information whether the a noun is mass or count, in SL and TL | code | helps
SL user identify the sense of the headword or other SL item helps TL user identify the sense of a TL equivalent |
|
g | Gradation | For adverbs and adjectives | code | helps SL user use TL item correctly helps TL user disambiguate TL word | |
7 | Subdivision counter | indicates the start of new section or subsection ('sense') | number / letter | 'signpost' helping user to find their way about the entry more efficiently | |
8 | Entry subdivision | separate section or subsection in entry (often called dictionary sense) | Dictionary text | breaks up entry, making it easier to read and find what is being sought | |
9 | Sense indicator | synonym or paraphrase of headword in this sense, or other brief sense clue indicating specific sense of SL or TL item | SL | helps
SL user identify the sense of the headword or other SL item helps TL user identify the sense of a TL equivalent |
|
10 | linguistic label | the style, register, regional variety, etc. of the SL or TL item | code | helps
SL user identify the sense of the headword helps both users translate helps TL user understand |
|
11 | Syntactic information | ||||
a | Subcategorization frame | (i.) Number and types of complements (ii.) syntactic introducer of a complement (e.g. preposition, case, etc.) (iii.) type of syntactic representation (e.g. constituents, functional, etc.) etc. | code | helps
SL user identify the sense of the headword or other SL item helps TL user identify the sense of a TL equivalent |
|
b | Obligatority of complements | Information whether a certain complement is obligatory or not | code | helps
SL user identify the sense of the headword or other SL item helps TL user identify the sense of a TL equivalent |
|
c | Auxiliary | Which type of auxiliary is selected by a given predicate (in certain languages auxiliary selection is related to issues like unaccusativity, which on turn lie at the interface between lexicon and syntax) | code | acts
as a sense indicator helps SL user select appropriate TL equivalent |
|
d | Light or support construction | Constructions with light verbs | SL or TL | helps
SL user identify the sense of the headword or other SL item helps TL user identify the sense of a TL equivalent |
|
e | Periphrastic constructions | Constructions containing periphrasis, usage, semantic value, etc. | SL or TL | helps
SL user identify the sense of the headword or other SL item helps TL user identify the sense of a TL equivalent |
|
f | Phrasal verbs | Particular representation of phrasal constructions | SL or TL | helps
SL user identify the sense of the headword or other SL item helps TL user identify the sense of a TL equivalent |
|
g | Collocator | (i.) typical subject /object of verb, noun modified by adjective etc. (ii.) type of collocation relation represented etc. | SL or TL | acts as a sense indicator helps SL user select appropriate TL equivalent helps TL user translate or understand the SL item | |
h | Alternations | Syntactic alternations an entry can enter into | Code | acts as a sense indicator | |
12 | Semantic information | ||||
a | Semantic type | Reference to an ontology of types which are used to classify word senses | Code | helps
SL user identify the sense of the headword or other SL item helps TL user identify the sense of a TL equivalent |
|
b | Argument structure | Argument frames, plus semantic information identifying the type of the arguments, selectional constraints, etc. | Code | helps
SL user identify the sense of the headword or other SL item helps TL user identify the sense of a TL equivalent |
|
c | Semantic relations | Different types of relations (e.g. synonymy, antonymy, meronymy, hyperonymy, Qualia Roles, etc.) between word senses, etc. | Code | acts
as SL sense indicator for SL user acts as TL sense indicator for TL user |
|
d | Regular polysemy | Representation of regular polysemous alternations | Code | helps
SL user identify the sense of the headword or other SL item helps TL user identify the sense of a TL equivalent |
|
e | Domain | Information concerning the terminological domain to which a given sense belongs | Code | helps
SL user identify the sense of the headword or other SL item helps TL user identify the sense of a TL equivalent |
|
f | Decomposition | Representation of relevant meaning component, e.g. causativity, agentivity, motion, etc. | Code | acts
as SL sense indicator for SL user acts as TL sense indicator for TL user |
|
13 | Translation | TL equivalent of SL item | TL | helps
TL user understand helps both users translate |
|
14 | Gloss | TL explanation of meaning of an SL item which has no direct equivalent in the TL | TL | helps TL user understand helps both users translate | |
15 | near-equivalent | TL item corresponding to an SL item which has no direct equivalent in the TL | TL | helps
TL user understand helps both users translate |
|
16 | example phrase (straightforward) | a phrase or sentence illustrating the non-idiomatic use of the headword, in a context where the TL equivalent is virtually a word-to-word translation | SL | acts as SL sense indicator for SL user acts as TL sense indicator for TL user helps TL & SL users to use the foreign-language item correctly | |
17 | example phrase (problematic) | a phrase or sentence illustrating a non-idiomatic use of headword in a context where a specific TL equivalent is required (i.e. an SL example which is easily understandable for the TL speaker, but presents translation problems for the SL speaker) | SL | helps SL user avoid a translating error acts as a sense indicator for SL user helps TL user subsequently to use the SL item correctly | |
18 | multiword unit | (idiomatic) multiword expression (MWE) containing the headword (the term MWE covers idioms, fixed & semi-fixed collocations, compounds etc.) | SL | helps both users translate | |
19 | Subheadword also secondary headword | lemma morphologically related to the headword, figuring as head of a sub-entry (subheadwords can be compounds, phrasal verbs, etc.) | SL | saves space helps both types of user find the information they are looking for | |
20 | usage note | how the headword is used; 'macro' information which cannot appear at every appropriate entry; warning of cultural differences between the two languages; etc. | SL or TL | helps both types of user to avoid misunderstandings about the foreign language item, based on own-language knowledge | |
21 | Frequency | Information about the frequency of the entry | code | helps both users translate |