Lexical Resources - Representation format of linguistic resources

Lexical Resources
Wim Peters
NLP group
Department of Computer Science
University of Sheffield
w.peters@dcs.shef.ac.uk

Introduction	Representation format of linguistic resources	Standardization of lexical description	Metadata for lexical description	Comparison of resources using metadata

Representation format of linguistic resources

There are various ways in which textual and lexical data can be annotated and structured, depending on theoretical convictions and associated tools. An enumeration of the main types of data structure encountered is given in section 2.3. The most widely used standards for resource representation are SGML, XML and RDF, shortly described in the following section.

2.1 SGML and XML

These are widely used standards for annotating text structure. XML has superseded SGML, but there is a wide-spread availability of resources in SGML format. For several SGML and XML tutorials/information pages see footnote (1) . In 1994 the Text Encoding Initiative (TEI) (2) published a set of detailed recommendations for the encoding and transcription of many types of written and spoken materials, using an extensible SGML framework. This format has also been influential in lexicon creation projects such as PAROLE and SIMPLE as well as in defining EAGLES and related standards (see section 3).

The following TEI example comes from its guidelines for encoding print dictionaries (3) shows a dictionary entry that provides information on several aspects of orthography, phonology, syntax and semantics.

<entry>
   <form>
   <orth>competitor</orth>                      orthography
   <hyph>com|peti|tor</hyph>                  syllabification
   <pron>k@m"petit@(r)</pron>             pronunciation
   </form>
   <gramGrp>
   <pos>n</pos>                                      part of speach
   </gramGrp>
   <def>person who competes.</def>    definition
</entry>

This TEI specification for dictionaries has been adopted and extended within the CONCEDE project (Consortium for Central European Dictionary Encoding ) (4).

2.2 RDF

The Resource Description Framework (RDF) (5) is, as its name implies, a framework for describing and interchanging metadata. It provides a model and a syntax s for metadata so that independent parties can exchange it and use it.

At the core, RDF data consists of nodes and attached attribute/value pairs. Nodes can be any web resources (pages, servers, basically anything for which you can give a Universal Resource Identifier (URI)), even other instances of metadata. Attributes are named properties of the nodes, and their values are either atomic (text strings, numbers, etc.) or other resources or metadata instances. In short, this mechanism allows us to build labeled directed graphs which can be converted into XML. For a tutorial see footnote (6) .

An example is shown below, where the attribute creator attached to the resource uniquely identified by the URI has the value John Smith.

<RDF:RDF>
    <RDF:Description RDF:HREF = "http://URI-of-Document">
    <DC:Creator>John Smith</DC:Creator>
    </RDF:Description>
</RDF:RDF>

Different linguistic classification systems will provide different packages of resource/properties/values combinations. These packages are called vocabularies. RDF in itself does not contain any predefined vocabularies for authoring metadata (see section 3).

2.3 Main types of data structure

Typed feature structures:

A feature structure is composed of pairs of attributes (called features) and their values, which can also be seen as partial functions from features to values. Each lexical entry is organized as a list of categorized features. Each list consists of a type symbol followed by zero or more keyword-value pairs. Each value may in turn be an atom, a string, a list of strings, feature-value list, or a list of feature-value lists. For a more detailed introduction we refer the reader to Shieber (1986) (7). An example is the Comlex Syntax database (8) :

(noun :orth "assertion" # orthography
:subc ((noun-that-s) (noun-be-that-s))) # syntactic complementation

Relational format:

A relational database consists of a set of relations between entities. Each role in that relation is called an attribute. Conceptually, a relation is a table whose columns correspond to attributes, and each row, or tuple, specifies all the values of attributes of a given entry. Attributes have only atomic value, that is, values which cannot be decomposed. In other words, each row-to-column intersection contains one, and only one, value.

The following example of the Celex Lexical Database (9) shows the morphological structure of the word 'abbreviation'. The unique identifier expressed by the lemma number (lemmano) provides the key into orthographic, syntactic and phonetic information contained in different tables.

lemmano

lemma

morphstatus

Imm1

formation

26

abbreviation

C

abbreviate+ion

-e#

"morphstatus: C" means that the lemma is morphologically complex. "imm1" is one of the morphological analyses available in Celex, whereas "formation" expresses the rule on the basis of which this deverbal nominalization has been formed, in this case deletion of the final -e of the verbal root).

The following example taken from LDOCE shows one possible conversion of a printed dictionary entry (abandon) into relational format.

a-ban-don 1 /'b?nd n/ v [T1] 1 to leave completely and for ever; desert: The sailors abandoned the burning ship. 2 to leave (a relation or friend) in a thoughtless or cruel way: He abandoned his wife and went away with all their money. 3 to give up, esp. without finishing: The search was abandoned when night came, even though the child had not been found. 4 (to) to give (oneself) up completely to a feeling, desire, etc.: He abandoned himself to grief | abandoned behaviour. -- ~ment n [U].
abandon 2 n [U] the state when one's feelings and actions are uncontrolled; freedom from control: The people were so excited that they jumped and shouted with abandon / in gay abandon.

The derived relational database has four tables. Each table expresses dependency of the value(s) of one or more columns on a set of key columns. The names of the columns are the following:

HW = headword
PS = part of speech
HN = homograph number
SN = sense number
DF = definition text
EX = example
GC = grammar code
PR = pronunciation

DEFINITION (Key: HW, PS, HN, SN)

HW	PS	HN	SN	DF
abandon	V	1	1	to leave completely and for ever
abandon	V	1	1	desert
abandon	V	1	2	to leave (a relation or friend) in a thoughtless or cruel way
abandon	V	1	3	to give up, esp. without finishing
abandon	V	1	4	to give (oneself) up completely to a feeling, desire, etc.
abandon	N	2	0	The people were so excited that they jumped and shouted with abandon/in gay abandon

'0' is used for entries that have only one sense and no explicit numbering in the paper based entry (see above).

PRONUNCIATION (Key: HW, PS, HN)

HW	PS	HN	PR
abandon	V	1	/'b?nd n/
abandon	N	2	/'b?nd n/

EXAMPLE (Key: HW, PS, HN, SN)

HW	PS	HN	SN	EX
abandon	V	1	1	The sailors abandoned the burning ship
abandon	V	1	2	He abandoned his wife and went away with all their money
abandon	V	1	3	The search was abandoned when night came, even though the child had not been found
abandon	V	1	4	He abandoned himself to grief
abandon	N	2	0	the state when one's feelings and actions are uncontrolled
abandon	N	2	0	freedom from control

CODE (Key: HW, PS, HN, SN)

HW	PS	HN	SN	GC
abandon	V	1	1	T1
abandon	V	1	2	T1
abandon	V	1	3	T1
abandon	V	1	4	T1
abandon	N	2	0	U

In the last table the value of grammar code is dependent on the first four columns. This means that the grammar code may change from one word sense to another, which is quite often theh case in the dictionary.

Resource specific format.

This class accounts for resource or company specific data structures that mostly come with access routines or interfaces. Examples of these are WordNet (10) that uses data files indexed on byte offsets, and EuroWordNet (11) that has its own specific import and export format: of which the following is an example:

The actual names of the features or columns (e.g. "orth" in Comlex) and the nature of the associated values (e.g. "-e#" in Celex) constitute the resource specific vocabulary of the linguistic metadescription. On top of that, different resources describe the same type of linguistic information by means of different terms (e.g. "orth" vs "lemma") or divide up the conceptual space into different chunks of different granularity (compare the Ldoce and WordNet syntactic subcategorization information; see below).

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/xmlsdk30/htm/xmtutxmltutorial.asp
http://www.projectcool.com/developer/xmlz/xmldtd/
http://www.oasis-open.org/cover/xml.html
http://www.oasis-open.org/cover/general.html
http://www.w3.org/MarkUp/SGML/
http://www.tei-c.org/
http://www.tei-c.org/Guidelines/DI.htm
http://www.itri.bton.ac.uk/projects/concede/
http://www.w3.org/RDF/
http://www710.univ-lyon1.fr/~champin/rdf-tutorial/rdf-tutorial.html
Shieber, S.M. (1986), An introduction to Unification-based Approaches to Grammar, CSLI Lecture Notes Series, Chicago: University of Chicago Press
http://cs.nyu.edu/cs/faculty/grishman/comlex.html
http://www.kun.nl/celex/
http://www.cogsci.princeton.edu/~wn/
http://www.hum.uva.nl/~ewn/

lemmano	lemma	morphstatus	Imm1	formation
26	abbreviation	C	abbreviate+ion	-e#