The human language chorus corpus (HULCC)

John Elliott, Debbie Elliott

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Many aspects of linguistic research, whatever their aims and objectives, are reliant on cross-language analysis for their results. In particular, any research into generic attributes, universals, or inter-language comparisons, requires samples of languages in a readily accessible format, which are ‘clean’ and of adequate size for statistical analysis. As computer-based corpus linguistics is still a relatively recent discipline, currently available corpora still suffer from a lack of breadth and conformity. Probably due in part to restrictions dictated by funding, many of the machine-readable resources publicly available are for English or one of the major Indo-European languages and, although this is often frustrating for researchers, it is understandable. An equally problematic aspect of inter-corpus analysis is the lack of agreement between annotation schemes: their format, constituent parts-of-speech, granularity and classification, even within a single language such as English. The aim of HuLCC is to provide a corpus of sufficient size to expedite such inter-language analysis by incorporating languages from all the major language families, and in so doing, also incorporating all types of morphology and word order. Parts-of-speech classification and granularity will be consistent across all languages within the corpus and will conform more closely to the main parts-of-speech originally conceived by Dionysius Thrax than to the fine-grained systems used by the BNC1 and LOB2 corpora. This will then enable cross-language analysis without the need for cross-mappings between differing annotation systems, or for writing/adapting software each time a different language or corpus is analysed. It is also our intention to encode all text using Unicode to accommodate all script types with a single format, whether they traditionally use standard ASCII, extended ASCII or 16 bits. An added feature will be the inclusion of a common text element, which will be translated across all languages to provide both useful translation data and a precise comparable thread for detailed linguistic analysis. Initially, it is planned to provide at least 20,000 words for each chosen language, as this amount of text exceeds the point where randomly generated text attains 100% bigram and trigram coverage (Elliott, J, 2002). This significantly contrasts statistically with the much lower percentages attained by natural languages and provides a statistical rationale for what is often a hotly debated point. Finally, as all constituent language samples within HuLCC conform to the same format and mark-up, a single set of tools will accompany what will be a freely available corpus for the academic world, to facilitate basic analytical needs. This paper outlines the rationales and design criteria that will underlie the development and implementation of this corpus.
Original languageEnglish
Title of host publicationProceedings of the Corpus Linguistics 2003 conference
EditorsDawn Archer, Paul Rayson, Andrew Wilson, Tony McEnery
Place of PublicationLancaster
PublisherUniversity Centre for Computer Corpus Research on Language
Pages201-210
Number of pages10
Publication statusPublished - 28 Mar 2003
EventCorpus Linguistics 2003 - Lancaster University, Lancaster, United Kingdom
Duration: 28 Mar 200331 Mar 2003
https://ucrel.lancs.ac.uk/cl2003/#desc

Publication series

NameUniversity Centre for Computer Corpus Research on Language technical papers
Volume16

Conference

ConferenceCorpus Linguistics 2003
Country/TerritoryUnited Kingdom
CityLancaster
Period28/03/0331/03/03
Internet address

Fingerprint

Dive into the research topics of 'The human language chorus corpus (HULCC)'. Together they form a unique fingerprint.

Cite this