Lexical Resources
Wim Peters
NLP group
Department of Computer Science
University of Sheffield
w.peters@dcs.shef.ac.uk

Introduction

Representation format of linguistic resources

Standardization of lexical description

Metadata for lexical description

Comparison of resources using metadata

Comparison of resources using metadata

 

In order to get an impression of the usefulness of metadata for comparison and evaluation of lexical resources the table below lists the linguistic content of four resources according to a metadata set that is a general level subset of the ISLE sets discussed above. The resources under examination are the Longman Dictionary of Contemporary English (LDOCE) (1), The Celex database, WordNet and the Cambridge International Dictionary of English (CIDE) (2).

  LDOCE CELEX WORDNET CIDE
ORTHOGRAPHY  
Spelling 1 1 1 1
Spelling variants 1 1 1 1
Syllabification 1 1 0 0
Capitalisation 1 1 1 1
 
PHONOLOGY  
Phonetic transcription 0 1 0 1
Stress marking 0 1 0 1
 
MORPHO-SYNTAX  
Part of speech 1 1 1 1
Inflection 1 1 1 1
Conjugation 1 1 1 1
Countability 1 1 0 1
Gradation (e.g. busy, busier) 1 1 1 1
Type (e.g. common noun, auxiliary verb) 1 1 0 1
Gender 1 1 0 1
 
MORPHOLOGY  
Derivation/composition 0 1 0 0
Segmentation 0 1 0 0
 
SYNTAX  
Alternation 1 1 1 1
Complementation 1 1 1 1
Positional (attributive, predicative) 1 1 0 0
Analysis of multi-word-units 0 0 0 1
Collocational restrictions 0 0 0 1
 
SEMANTICS  
Senses 1 0 1 1
Ontological classification 1 0 1 1
Semantic relation 1 0 1 1
Definition 1 0 1 1
Preference 1 0 1 1
Regular polysemy 0 0 1 0
Domain 1 0 0 1
Idiom 1 0 0 1
 
OTHER  
Usage notes 1 0 0 1
Examples 1 0 1 1
Translation 0 0 0 0
Frequency 0 1 0 0

In order to refine the comparison the high level information provided by this classification system it can be extended by choosing increasingly fine-grained levels of linguistic description by e.g. incorporating the complete ISLE checklists and more. For example, a subclassification of multi-word-units can be provided on the basis of their constituent parts (fixed phrases, compounds, idioms, support verb constructions, phrasal verbs). Verb complementation can be further subdivided into (in-/di-)transitive, copula, phrasal verb, prepositional verb and support verb. Maximum refinement is obtained when the linguistic information has been decomposed into the most basic information units. The result is a very complex structure of highly interrelated blocks of minimal linguistic information, and is exemplified by the GENELEX architecture.

An example of differences between resources and how they fit into the metadata classification scheme is the encoding of verbal complementation and preference information in the four resources mentioned above. For this particular type of linguistic content the following pieces of information are found:

CELEX

LDOCE

WordNet

CIDE

The figure below illustrates the case of the verb "fall".

Complementation and preference for the verb 'fall'

Linking resource specific units of description to a fine-grained metamodel willl have several advantages:

'I*' = 'I' = 'intransitive' = 'somebody Vs'

 word form <derived frofrom> lemma <synonym> headword
<morphologically_composed_of> stem

Click to view larger

Click to view larger


  1. Procter, P. (1979), The Longman Dictionary of Contemporary English , Longman, London
  2. Cambridge International Dictionary of English (2001), Cambridge University Press, Cambridge U.K.