TY - GEN
T1 - Unsupervised discovery of language structure in audio signals
AU - Elliott, John
PY - 2004
Y1 - 2004
N2 - Having received a signal, unlike traditional speech processing, the aim of this research goal is not to identify where individual word boundaries begin and end or detect the pattern set, using supervised techniques, which comprise the signal's lexicon. The rationale that underpins this approach is therefore, not to decipher the audio signal content, as this is a secondary task and assumes language content exists, but to identify what constitutes the physical structure of spoken language, in contrast to other structured phenomena. In essence, to develop an automated (artificially intelligent) intuitive 'ear' that can detect the rhythm and structure of language with the same accuracy (or better) of the human ear. To achieve this, unsupervised learning techniques, which do not rely on prior knowledge of a specific system, underpin generic methods devised to facilitate classification of unknown phenomena, if encountered. Results show that amplitude frequency histograms, derived from vertical, horizontal and thresholded analysis, clearly distinguish speech, 'noise', and music with distinctive leptokurtic, platykurtic and either a 'tooth-comb' or bimodal profiles respectively. Birds and Apes demonstrate similar but coarser-grained versions of a leptokurtic distribution; however, dolphins and orcas produce almost identical profiles to humans, which indicate a similar complexity of sound pattern construction. Individually, the two types of visualisation methods (SAS time series and amplitude frequency histogram) mentioned above are reasonably robust in their ability to differentiate language from other signals. In particular, time series analysis of Significant Activity Segments is able to identify language-like communication within a transmission, which includes other structured phenomena, whether natural or artificial. However, combining these two methods produces a significantly more robust system, which is believed to be an extremely useful automated first-pass filter for identifying and distinguishing intelligent language-like audio communication, without the intervention of supervised techniques.
AB - Having received a signal, unlike traditional speech processing, the aim of this research goal is not to identify where individual word boundaries begin and end or detect the pattern set, using supervised techniques, which comprise the signal's lexicon. The rationale that underpins this approach is therefore, not to decipher the audio signal content, as this is a secondary task and assumes language content exists, but to identify what constitutes the physical structure of spoken language, in contrast to other structured phenomena. In essence, to develop an automated (artificially intelligent) intuitive 'ear' that can detect the rhythm and structure of language with the same accuracy (or better) of the human ear. To achieve this, unsupervised learning techniques, which do not rely on prior knowledge of a specific system, underpin generic methods devised to facilitate classification of unknown phenomena, if encountered. Results show that amplitude frequency histograms, derived from vertical, horizontal and thresholded analysis, clearly distinguish speech, 'noise', and music with distinctive leptokurtic, platykurtic and either a 'tooth-comb' or bimodal profiles respectively. Birds and Apes demonstrate similar but coarser-grained versions of a leptokurtic distribution; however, dolphins and orcas produce almost identical profiles to humans, which indicate a similar complexity of sound pattern construction. Individually, the two types of visualisation methods (SAS time series and amplitude frequency histogram) mentioned above are reasonably robust in their ability to differentiate language from other signals. In particular, time series analysis of Significant Activity Segments is able to identify language-like communication within a transmission, which includes other structured phenomena, whether natural or artificial. However, combining these two methods produces a significantly more robust system, which is believed to be an extremely useful automated first-pass filter for identifying and distinguishing intelligent language-like audio communication, without the intervention of supervised techniques.
KW - Audio, Language
KW - Detection
KW - Significant Activity Segments (SAS)
KW - Unsupervised, Learning
UR - http://www.scopus.com/inward/record.url?scp=11144257937&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:11144257937
SN - 0889864551
SN - 9780889864559
T3 - Proceedings of the IASTED International Conference on Circuits, Signals, and Systems
SP - 237
EP - 242
BT - Proceedings of the IASTED International Conference on Circuits, Signals, and Systems
A2 - Rashid, M.H.
T2 - Proceedings of the IASTED International Conference on Circuits, Signals, and Systems
Y2 - 28 November 2004 through 1 December 2004
ER -