Lexical Resources
Wim Peters
NLP group
Department of Computer Science
University of Sheffield
w.peters@dcs.shef.ac.uk
Representation format of linguistic resources
There are various ways in which textual and lexical data can be annotated and structured, depending on theoretical convictions and associated tools. An enumeration of the main types of data structure encountered is given in section 2.3. The most widely used standards for resource representation are SGML, XML and RDF, shortly described in the following section.
2.1 SGML and XML
These are widely used standards for annotating text structure. XML has superseded SGML, but there is a wide-spread availability of resources in SGML format. For several SGML and XML tutorials/information pages see footnote (1) . In 1994 the Text Encoding Initiative (TEI) (2) published a set of detailed recommendations for the encoding and transcription of many types of written and spoken materials, using an extensible SGML framework. This format has also been influential in lexicon creation projects such as PAROLE and SIMPLE as well as in defining EAGLES and related standards (see section 3).
The following TEI example comes from its guidelines for encoding print dictionaries (3) shows a dictionary entry that provides information on several aspects of orthography, phonology, syntax and semantics.
<entry>
<form>
<orth>competitor</orth> orthography
<hyph>com|peti|tor</hyph>
syllabification
<pron>k@m"petit@(r)</pron>
pronunciation
</form>
<gramGrp>
<pos>n</pos>
part of speach
</gramGrp>
<def>person who competes.</def>
definition
</entry>
This TEI specification for dictionaries has been adopted and extended within the CONCEDE project (Consortium for Central European Dictionary Encoding ) (4).
2.2 RDF
The Resource Description Framework (RDF) (5) is, as its name implies, a framework for describing and interchanging metadata. It provides a model and a syntax s for metadata so that independent parties can exchange it and use it.
At the core, RDF data consists of nodes and attached attribute/value pairs. Nodes can be any web resources (pages, servers, basically anything for which you can give a Universal Resource Identifier (URI)), even other instances of metadata. Attributes are named properties of the nodes, and their values are either atomic (text strings, numbers, etc.) or other resources or metadata instances. In short, this mechanism allows us to build labeled directed graphs which can be converted into XML. For a tutorial see footnote (6) .
An example is shown below, where the attribute creator attached to the resource uniquely identified by the URI has the value John Smith.
<RDF:RDF>
<RDF:Description RDF:HREF =
"http://URI-of-Document">
<DC:Creator>John Smith</DC:Creator>
</RDF:Description>
</RDF:RDF>
Different linguistic classification systems will provide different packages of resource/properties/values combinations. These packages are called vocabularies. RDF in itself does not contain any predefined vocabularies for authoring metadata (see section 3).
2.3 Main types of data structure
Typed feature structures:
A feature structure is composed of pairs of attributes (called features) and
their values, which can also be seen as partial functions from features to
values. Each lexical entry is organized as a list of categorized features.
Each list consists of a type symbol followed by zero or more keyword-value
pairs. Each value may in turn be an atom, a string, a list of strings,
feature-value list, or a list of feature-value lists. For a more detailed
introduction we refer the reader to Shieber (1986) (7). An example is the
Comlex Syntax database (8) :
(noun :orth
"assertion"
# orthography
:subc ((noun-that-s) (noun-be-that-s)))
# syntactic complementation
Relational format:
A relational database consists of a set of relations between entities.
Each role in that relation is called an attribute. Conceptually, a relation
is a table whose columns correspond to attributes, and each row, or tuple,
specifies all the values of attributes of a given entry. Attributes have
only atomic value, that is, values which cannot be decomposed. In other
words, each row-to-column intersection contains one, and only one, value.
The following example of the Celex Lexical Database (9) shows the
morphological structure of the word 'abbreviation'. The unique identifier
expressed by the lemma number (lemmano) provides the key into orthographic,
syntactic and phonetic information contained in different tables.
lemmano
|
lemma
|
morphstatus
|
Imm1
|
formation
|
26
|
abbreviation
|
C
|
abbreviate+ion
|
-e#
|
DEFINITION (Key: HW, PS, HN, SN)
HW
|
PS
|
HN
|
SN
|
DF
|
abandon
|
V
|
1
|
1
|
to leave completely and for ever |
abandon
|
V
|
1
|
1
|
desert |
abandon
|
V
|
1
|
2
|
to leave (a relation or friend) in a thoughtless or cruel way |
abandon
|
V
|
1
|
3
|
to give up, esp. without finishing |
abandon
|
V
|
1
|
4
|
to give (oneself) up completely to a feeling, desire, etc. |
abandon
|
N
|
2
|
0
|
The people were so excited that they jumped and shouted with abandon/in gay abandon |
'0' is used for entries that have only one sense and no explicit numbering in the paper based entry (see above).
PRONUNCIATION (Key: HW, PS, HN)
HW
|
PS
|
HN
|
PR
|
abandon
|
V
|
1
|
/'b?nd n/
|
abandon
|
N
|
2
|
/'b?nd n/
|
EXAMPLE (Key: HW, PS, HN, SN)
HW
|
PS
|
HN
|
SN
|
EX
|
abandon
|
V
|
1
|
1
|
The sailors abandoned the burning ship |
abandon
|
V
|
1
|
2
|
He abandoned his wife and went away with all their money |
abandon
|
V
|
1
|
3
|
The search was abandoned when night came, even though the child had not been found |
abandon
|
V
|
1
|
4
|
He abandoned himself to grief |
abandon
|
N
|
2
|
0
|
the state when one's feelings and actions are uncontrolled |
abandon
|
N
|
2
|
0
|
freedom from control |
CODE (Key: HW, PS, HN, SN)
HW
|
PS
|
HN
|
SN
|
GC
|
abandon
|
V
|
1
|
1
|
T1
|
abandon
|
V
|
1
|
2
|
T1
|
abandon
|
V
|
1
|
3
|
T1
|
abandon
|
V
|
1
|
4
|
T1
|
abandon
|
N
|
2
|
0
|
U
|
In the last table the value of grammar
code is dependent on the first four columns. This means that the grammar
code may change from one word sense to another, which is quite often theh
case in the dictionary.
Resource specific format.
This class accounts for resource or company specific data structures that
mostly come with access routines or interfaces. Examples of these are WordNet
(10) that uses data files indexed on byte offsets, and EuroWordNet (11) that
has its own specific import and export format: of which the following is an
example:
The actual names of the features or columns (e.g. "orth" in Comlex)
and the nature of the associated values (e.g. "-e#" in Celex)
constitute the resource specific vocabulary of the linguistic metadescription.
On top of that, different resources describe the same type of linguistic
information by means of different terms (e.g. "orth" vs
"lemma") or divide up the conceptual space into different chunks of
different granularity (compare the Ldoce and WordNet syntactic
subcategorization information; see below).