Abstract
Many aspects of linguistic research, whatever their aims and objectives, are reliant on cross-language analysis for their results. In particular, any research into generic attributes, universals, or inter-language comparisons, requires samples of languages in a readily accessible format, which are ‘clean’ and of adequate size for statistical analysis. As computer-based corpus linguistics is still a relatively recent discipline, currently available corpora still suffer from a lack of breadth and conformity. Probably due in part to restrictions dictated by funding, many of the machine-readable resources publicly available are for English or one of the major Indo-European languages and, although this is often frustrating for researchers, it is understandable. An equally problematic aspect of inter-corpus analysis is the lack of agreement between annotation schemes: their format, constituent parts-of-speech, granularity and classification, even within a single language such as English. The aim of HuLCC is to provide a corpus of sufficient size to expedite such inter-language analysis by incorporating languages from all the major language families, and in so doing, also incorporating all types of morphology and word order. Parts-of-speech classification and granularity will be consistent across all languages within the corpus and will conform more closely to the main parts-of-speech originally conceived by Dionysius Thrax than to the fine-grained systems used by the BNC1 and LOB2 corpora. This will then enable cross-language analysis without the need for cross-mappings between differing annotation systems, or for writing/adapting software each time a different language or corpus is analysed. It is also our intention to encode all text using Unicode to accommodate all script types with a single format, whether they traditionally use standard ASCII, extended ASCII or 16 bits. An added feature will be the inclusion of a common text element, which will be translated across all languages to provide both useful translation data and a precise comparable thread for detailed linguistic analysis. Initially, it is planned to provide at least 20,000 words for each chosen language, as this amount of text exceeds the point where randomly generated text attains 100% bigram and trigram coverage (Elliott, J, 2002). This significantly contrasts statistically with the much lower percentages attained by natural languages and provides a statistical rationale for what is often a hotly debated point. Finally, as all constituent language samples within HuLCC conform to the same format and mark-up, a single set of tools will accompany what will be a freely available corpus for the academic world, to facilitate basic analytical needs. This paper outlines the rationales and design criteria that will underlie the development and implementation of this corpus.
Original language | English |
---|---|
Title of host publication | Proceedings of the Corpus Linguistics 2003 conference |
Editors | Dawn Archer, Paul Rayson, Andrew Wilson, Tony McEnery |
Place of Publication | Lancaster |
Publisher | University Centre for Computer Corpus Research on Language |
Pages | 201-210 |
Number of pages | 10 |
Publication status | Published - 28 Mar 2003 |
Event | Corpus Linguistics 2003 - Lancaster University, Lancaster, United Kingdom Duration: 28 Mar 2003 → 31 Mar 2003 https://ucrel.lancs.ac.uk/cl2003/#desc |
Publication series
Name | University Centre for Computer Corpus Research on Language technical papers |
---|---|
Volume | 16 |
Conference
Conference | Corpus Linguistics 2003 |
---|---|
Country/Territory | United Kingdom |
City | Lancaster |
Period | 28/03/03 → 31/03/03 |
Internet address |