Unsupervised discovery of language structure in audio signals

John Elliott*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Having received a signal, unlike traditional speech processing, the aim of this research goal is not to identify where individual word boundaries begin and end or detect the pattern set, using supervised techniques, which comprise the signal's lexicon. The rationale that underpins this approach is therefore, not to decipher the audio signal content, as this is a secondary task and assumes language content exists, but to identify what constitutes the physical structure of spoken language, in contrast to other structured phenomena. In essence, to develop an automated (artificially intelligent) intuitive 'ear' that can detect the rhythm and structure of language with the same accuracy (or better) of the human ear. To achieve this, unsupervised learning techniques, which do not rely on prior knowledge of a specific system, underpin generic methods devised to facilitate classification of unknown phenomena, if encountered. Results show that amplitude frequency histograms, derived from vertical, horizontal and thresholded analysis, clearly distinguish speech, 'noise', and music with distinctive leptokurtic, platykurtic and either a 'tooth-comb' or bimodal profiles respectively. Birds and Apes demonstrate similar but coarser-grained versions of a leptokurtic distribution; however, dolphins and orcas produce almost identical profiles to humans, which indicate a similar complexity of sound pattern construction. Individually, the two types of visualisation methods (SAS time series and amplitude frequency histogram) mentioned above are reasonably robust in their ability to differentiate language from other signals. In particular, time series analysis of Significant Activity Segments is able to identify language-like communication within a transmission, which includes other structured phenomena, whether natural or artificial. However, combining these two methods produces a significantly more robust system, which is believed to be an extremely useful automated first-pass filter for identifying and distinguishing intelligent language-like audio communication, without the intervention of supervised techniques.

Original languageEnglish
Title of host publicationProceedings of the IASTED International Conference on Circuits, Signals, and Systems
EditorsM.H. Rashid
Number of pages6
Publication statusPublished - 2004
EventProceedings of the IASTED International Conference on Circuits, Signals, and Systems - Clearwater Beach, FL, United States
Duration: 28 Nov 20041 Dec 2004

Publication series

NameProceedings of the IASTED International Conference on Circuits, Signals, and Systems


ConferenceProceedings of the IASTED International Conference on Circuits, Signals, and Systems
Country/TerritoryUnited States
CityClearwater Beach, FL


  • Audio, Language
  • Detection
  • Significant Activity Segments (SAS)
  • Unsupervised, Learning


Dive into the research topics of 'Unsupervised discovery of language structure in audio signals'. Together they form a unique fingerprint.

Cite this