Data-Driven Linguistic Ontology Development

Funded by the NSF (#0411348)

 

William Lewis, CSU Fresno, PI

Scott Farrar, Universität Bremen, Contractor

 

An Overview

 

The World Wide Web has become a primary source for disseminating data on the world’s languages, with a variety of language data regularly posted to the Web, including large numbers of scholarly papers on language.  Often embedded in these documents are enriched language data encoded in the form of Interlinear Glossed Text (IGT).  IGT is a standard method for presenting linguistic data, and consists of a line of language data, usually broken down by morpheme, a line of grammatical and gloss information aligned with the text in the first line, and a line representing the translation.  An example is shown in (1).  Variations to this basic form abound, but its most frequent instantiation is this basic three-line format.

(1)

        Afisi    a-na-ph-a             nsomba

        hyenas SP-PST-kill-ASP fish

        'The hyenas killed the fish.'                     (Baker 1988:254)

The Data-Driven Linguistics Ontology Development (DDLOD) project, funded by the NSF (NSF #0411348), seeks to discover the semantic space of Interlinear Glossed Text (IGT), and use this knowledge to update and maintain the General Ontology of Lingusitic Description, or GOLD (see Farrar and Langendoen 2003 for an overview of the ontology).  Although IGT tends to be biased towards morphosyntax, it is still a representative "snapshot" of the that domain, and still can be used as a source of data on other domains that readily use it (e.g., syntax, pragmatics, and discourse).  IGT is being mined off the Web using a specialized tool, an IGT Recognizer, designed to search for text that structurally resembles IGT.  The recognizer uses templates encoded in a simple regular expression language that are used to "match" instances of IGT as it exists on the Web.  The templates are easily maintained using a text editor, and thus can be modified without changing code.

 

The IGT recognizer is embedded within a crawler, called the IGT Trawler (it casts its nets wide), which crawls URLs for documents that potentially could contain IGT. Because of the sheer volume of data on the Web, it is not convenient nor feasible to crawl the entire Web to locate sources of IGT.  The IGT Trawler targets its crawls using results from Google and other search engines.  It also harvests from lists of URLs that are likely to contain enriched language data, such as those provided by Baden Hughes using his language data aggregator built around Google, and those provided by LinguistList.  URLs for online linguistic publications are also routinely crawled.

 

Once instances of IGT are discovered, they are analyzed for linguistic terminology, which can usually be found in the second line of IGT, or the gloss line.  Typically, the gloss line consists of two kinds of glosses:  those that refer to linguistic concepts, such as SP, PST, and ASP, and those that refer to "real-world" concepts, such as hyenas and fish.  The linguistic glosses, which we label grams (following Bybee 1994), are of the most interest to this project, since the concepts they represent are relevant to a linguistic ontology.  The problem is identifying what they mean.

 

Several techniques are used to decipher the meanings of grams.  Any novel gram in a document (hopefully across multiple instances of IGT within that document) is analyzed for its distributional qualities.  Gross distributional qualities, such as whether a gram attaches to a noun or a verb, can help categorize it.  Take the gram NOM, for instance, which can mean either Nominative Case or Nominalizer.  The fact that Nominative Case is almost always associated with nouns (or constituents typically associated with nouns, such as determiners or adjectives) means that a gram labeled NOM that aligns with a morpheme attached to a noun most likely represents Nominative Case.  Finer grained distributional qualities can help determine the nature of a gram and increases the likelihood that it will be correctly identified.  This is done by constructing vectors representing the neighborhood of a novel gram (hosts to which it attaches, the presence of other grams and glosses within the same word or IGT instance, and other grams that occupy the immediate periphery) which are then compared against the feature vectors for known grams.  The closer the vector of another gram in gram feature space, the more likely the gram in question is either the same or a close relative (distances are calculated using cosine measures).  Further information is provided by shallow parses of the translation line (the third line) which can help disambiguate some glosses and grams.  Since the overall process is semi-supervised, assignment of meanings to novel grams is reviewed and compared to author intent (where discernable).  Any new concepts that are discovered are used to update GOLD. 

 

GOLD is intended to represent the semantic space of the discipline of linguistics, although it is currently biased towards morphosyntax (which, fortunately, is the same bias reflected by IGT).  Efforts are now underway to expand the inventory of knowledge to other disciplines, such as phonetics/phonology, syntax, and pragmatics.  GOLD, and the DDLOD project itself, grew out of the larger Electronic Metastructure for Endangered Languages Data (EMELD) project, a five-year NSF-funded project intended to aid in the preservation of endangered languages data and documentation.  You can view a version of GOLD at http://emeld.org/gold-nsYou can download a current version (in OWL) from  http://emeld.org/gold.

 

All tools developed in the process of mining the Web for IGT and updating the ontology will be made publicly available.  An essential part of the DDLOD grant is to develop tools that enable the usage of GOLD, and the mining and updating tools will be part of this suite.