Lexical Resources
Wim Peters
NLP group
Department of Computer Science
University of Sheffield
w.peters@dcs.shef.ac.uk
Comparison of resources using metadata
In order to get an impression of the usefulness of metadata for comparison and evaluation of lexical resources the table below lists the linguistic content of four resources according to a metadata set that is a general level subset of the ISLE sets discussed above. The resources under examination are the Longman Dictionary of Contemporary English (LDOCE) (1), The Celex database, WordNet and the Cambridge International Dictionary of English (CIDE) (2).
LDOCE | CELEX | WORDNET | CIDE | |
ORTHOGRAPHY | ||||
Spelling | 1 | 1 | 1 | 1 |
Spelling variants | 1 | 1 | 1 | 1 |
Syllabification | 1 | 1 | 0 | 0 |
Capitalisation | 1 | 1 | 1 | 1 |
PHONOLOGY | ||||
Phonetic transcription | 0 | 1 | 0 | 1 |
Stress marking | 0 | 1 | 0 | 1 |
MORPHO-SYNTAX | ||||
Part of speech | 1 | 1 | 1 | 1 |
Inflection | 1 | 1 | 1 | 1 |
Conjugation | 1 | 1 | 1 | 1 |
Countability | 1 | 1 | 0 | 1 |
Gradation (e.g. busy, busier) | 1 | 1 | 1 | 1 |
Type (e.g. common noun, auxiliary verb) | 1 | 1 | 0 | 1 |
Gender | 1 | 1 | 0 | 1 |
MORPHOLOGY | ||||
Derivation/composition | 0 | 1 | 0 | 0 |
Segmentation | 0 | 1 | 0 | 0 |
SYNTAX | ||||
Alternation | 1 | 1 | 1 | 1 |
Complementation | 1 | 1 | 1 | 1 |
Positional (attributive, predicative) | 1 | 1 | 0 | 0 |
Analysis of multi-word-units | 0 | 0 | 0 | 1 |
Collocational restrictions | 0 | 0 | 0 | 1 |
SEMANTICS | ||||
Senses | 1 | 0 | 1 | 1 |
Ontological classification | 1 | 0 | 1 | 1 |
Semantic relation | 1 | 0 | 1 | 1 |
Definition | 1 | 0 | 1 | 1 |
Preference | 1 | 0 | 1 | 1 |
Regular polysemy | 0 | 0 | 1 | 0 |
Domain | 1 | 0 | 0 | 1 |
Idiom | 1 | 0 | 0 | 1 |
OTHER | ||||
Usage notes | 1 | 0 | 0 | 1 |
Examples | 1 | 0 | 1 | 1 |
Translation | 0 | 0 | 0 | 0 |
Frequency | 0 | 1 | 0 | 0 |
In order to refine the comparison the high level information provided by this classification system it can be extended by choosing increasingly fine-grained levels of linguistic description by e.g. incorporating the complete ISLE checklists and more. For example, a subclassification of multi-word-units can be provided on the basis of their constituent parts (fixed phrases, compounds, idioms, support verb constructions, phrasal verbs). Verb complementation can be further subdivided into (in-/di-)transitive, copula, phrasal verb, prepositional verb and support verb. Maximum refinement is obtained when the linguistic information has been decomposed into the most basic information units. The result is a very complex structure of highly interrelated blocks of minimal linguistic information, and is exemplified by the GENELEX architecture.
An example of differences between resources and how they fit into the metadata classification scheme is the encoding of verbal complementation and preference information in the four resources mentioned above. For this particular type of linguistic content the following pieces of information are found:
CELEX
complementation: set of labels expressed by boolean values in columns: e.g. transitive: Yes; ditransitive: No
no semantic preference
LDOCE
complementation: codes representing verb classes. e.g. D1: ditransitive
preference: 34 semantic classes for subject/object/indirect object slots e.g. liquid/movable
WordNet
complementation and preference: surface patterns 'Somebody Vs somebody something'
CIDE
complementation: codes representing verb classes. e.g. 'T': transitive
preference: 40 semantic classes for subject/object/indirect object slots e.g. 'human'/'clothing'
The figure below illustrates the case of the verb "fall".
Complementation and preference for the verb 'fall'
Linking resource specific units of description to a fine-grained metamodel willl have several advantages:
'I*' = 'I' = 'intransitive' = 'somebody Vs'
word form <derived frofrom> lemma <synonym> headword
<morphologically_composed_of> stem