Lexical Resources - Comparison of resources using metadata

Lexical Resources
Wim Peters
NLP group
Department of Computer Science
University of Sheffield
w.peters@dcs.shef.ac.uk

Introduction	Representation format of linguistic resources	Standardization of lexical description	Metadata for lexical description	Comparison of resources using metadata

Comparison of resources using metadata

In order to get an impression of the usefulness of metadata for comparison and evaluation of lexical resources the table below lists the linguistic content of four resources according to a metadata set that is a general level subset of the ISLE sets discussed above. The resources under examination are the Longman Dictionary of Contemporary English (LDOCE) (1), The Celex database, WordNet and the Cambridge International Dictionary of English (CIDE) (2).

	LDOCE	CELEX	WORDNET	CIDE
ORTHOGRAPHY
Spelling	1	1	1	1
Spelling variants	1	1	1	1
Syllabification	1	1	0	0
Capitalisation	1	1	1	1

PHONOLOGY
Phonetic transcription	0	1	0	1
Stress marking	0	1	0	1

MORPHO-SYNTAX
Part of speech	1	1	1	1
Inflection	1	1	1	1
Conjugation	1	1	1	1
Countability	1	1	0	1
Gradation (e.g. busy, busier)	1	1	1	1
Type (e.g. common noun, auxiliary verb)	1	1	0	1
Gender	1	1	0	1

MORPHOLOGY
Derivation/composition	0	1	0	0
Segmentation	0	1	0	0

SYNTAX
Alternation	1	1	1	1
Complementation	1	1	1	1
Positional (attributive, predicative)	1	1	0	0
Analysis of multi-word-units	0	0	0	1
Collocational restrictions	0	0	0	1

SEMANTICS
Senses	1	0	1	1
Ontological classification	1	0	1	1
Semantic relation	1	0	1	1
Definition	1	0	1	1
Preference	1	0	1	1
Regular polysemy	0	0	1	0
Domain	1	0	0	1
Idiom	1	0	0	1

OTHER
Usage notes	1	0	0	1
Examples	1	0	1	1
Translation	0	0	0	0
Frequency	0	1	0	0

In order to refine the comparison the high level information provided by this classification system it can be extended by choosing increasingly fine-grained levels of linguistic description by e.g. incorporating the complete ISLE checklists and more. For example, a subclassification of multi-word-units can be provided on the basis of their constituent parts (fixed phrases, compounds, idioms, support verb constructions, phrasal verbs). Verb complementation can be further subdivided into (in-/di-)transitive, copula, phrasal verb, prepositional verb and support verb. Maximum refinement is obtained when the linguistic information has been decomposed into the most basic information units. The result is a very complex structure of highly interrelated blocks of minimal linguistic information, and is exemplified by the GENELEX architecture.

An example of differences between resources and how they fit into the metadata classification scheme is the encoding of verbal complementation and preference information in the four resources mentioned above. For this particular type of linguistic content the following pieces of information are found:

CELEX

complementation: set of labels expressed by boolean values in columns: e.g. transitive: Yes; ditransitive: No
no semantic preference

LDOCE

complementation: codes representing verb classes. e.g. D1: ditransitive
preference: 34 semantic classes for subject/object/indirect object slots e.g. liquid/movable

WordNet

complementation and preference: surface patterns 'Somebody Vs somebody something'

CIDE

complementation: codes representing verb classes. e.g. 'T': transitive
preference: 40 semantic classes for subject/object/indirect object slots e.g. 'human'/'clothing'

The figure below illustrates the case of the verb "fall".

Complementation and preference for the verb 'fall'

Linking resource specific units of description to a fine-grained metamodel willl have several advantages:

it enables direct access to the information contained in the resource in a uniform way
it provides an explicit link between representation formats. This enables the integration of different resource-specific attributes/values:

'I*' = 'I' = 'intransitive' = 'somebody Vs'

word form <derived frofrom> lemma <synonym> headword
<morphologically_composed_of> stem

Procter, P. (1979), The Longman Dictionary of Contemporary English , Longman, London
Cambridge International Dictionary of English (2001), Cambridge University Press, Cambridge U.K.