TY - JOUR
T1 - A human language corpus for interstellar message construction
AU - Elliott, John
PY - 2011/2
Y1 - 2011/2
N2 - The aim of HuLCC (the human language chorus corpus), is to provide a resource of sufficient size to facilitate inter-language analysis by incorporating languages from all the major language families: for the first time all aspects of typology will be incorporated within a single corpus, adhering to a consistent grammatical classification and granularity, which historically adopt a plethora of disparate schemes. An added feature will be the inclusion of a common text element, which will be translated across all languages, to provide a precise comparable thread for detailed linguistic analysis for translation strategies and a mechanism by which these mappings can be explicitly achieved. Methods developed to solve unambiguous mappings across these languages can then be adopted for any subsequent message authored by the SETI community. Initially, it is planned to provide at least 20,000 words for each chosen language, as this amount of text exceeds the point where randomly generated text can be disambiguated from natural language and is of sufficient size useful for message transmission [1] (Elliot, 2002). This paper details the design of this resource, which ultimately will be made available to SETI upon its completion, and discusses issues 'core' to any message construction.
AB - The aim of HuLCC (the human language chorus corpus), is to provide a resource of sufficient size to facilitate inter-language analysis by incorporating languages from all the major language families: for the first time all aspects of typology will be incorporated within a single corpus, adhering to a consistent grammatical classification and granularity, which historically adopt a plethora of disparate schemes. An added feature will be the inclusion of a common text element, which will be translated across all languages, to provide a precise comparable thread for detailed linguistic analysis for translation strategies and a mechanism by which these mappings can be explicitly achieved. Methods developed to solve unambiguous mappings across these languages can then be adopted for any subsequent message authored by the SETI community. Initially, it is planned to provide at least 20,000 words for each chosen language, as this amount of text exceeds the point where randomly generated text can be disambiguated from natural language and is of sufficient size useful for message transmission [1] (Elliot, 2002). This paper details the design of this resource, which ultimately will be made available to SETI upon its completion, and discusses issues 'core' to any message construction.
KW - Corpus
KW - Language
KW - Message construction
UR - http://www.scopus.com/inward/record.url?scp=78349305148&partnerID=8YFLogxK
U2 - 10.1016/j.actaastro.2010.01.004
DO - 10.1016/j.actaastro.2010.01.004
M3 - Article
AN - SCOPUS:78349305148
SN - 0094-5765
VL - 68
SP - 418
EP - 424
JO - Acta Astronautica
JF - Acta Astronautica
IS - 3-4
ER -