Lesson 7: Audition and Speech

How do we use the auditory signals generated by the ear to recognize events in the world, such as running water, a crackling fire, music, or conversation between speakers? In this unit, you will learn about the structure and function of the auditory system, the nature of sound, and how it can be used to recognize sound textures and speech. Physiologically inspired models of these processes capture the neural mechanisms by which the brain processes this information.

Part 1 of Josh McDermott’s lecture provides an overview of the auditory system and neural encoding of sound, and explores the problem of recognizing sound textures. A model of this process exploits statistical properties of the frequency content of an incoming sound.

Part 2 of Josh McDermott’s lecture delves more deeply into the ability of human listeners to recognize sound textures, using the synthesis of artificial sounds as a powerful tool to test a model of this process. The lecture also examines other cues used to analyze the content of a scene from the sounds that it produces.

Nancy Kanwisher presents fMRI studies that led to the discovery of regions of auditory cortex that are specialized for the analysis of particular classes of sound such as music and speech.

Hynek Hermansky addresses the problem of speech processing, beginning with the structure of speech signals, how they are generated by a speaker, how speech is initially processed in the brain, and key aspects of auditory perception. The lecture then reviews the history of speech recognition in machines.

image

How do we recognize the source of a sound texture, such as the sound of insects or waves? Josh McDermott and colleagues propose a model of this process that uses statistics of the frequency content of the sounds, and the modulation of this content over time, depicted in these spectrograms (bottom) and plots of modulation power (top).

(Courtesy of Elsevier, Inc., http://www.sciencedirect.com. Used with permission. Source: McDermott, Josh H., and Eero P. Simoncelli. “Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis.” Neuron 71, no. 5 (2011): 926-940.)

Part 2 of Hynek Hermansky’s lecture examines the key challenge of recognizing speech in a way that is invariant to large speaker variations and unwanted noise in the auditory signal, and how insights from human audition can inform models of speech processing.

A panel of experts in vision and hearing reflect on the similarities and differences between these two modalities, and how exploitation of the synergies between the two may accelerate the pace of research on the study of vision and audition in brains and machines.

Unit Activities

Useful Background

  • Introductions to neuroscience and statistics
  • The lecture by Hynek Hermansky lecture requires background in signal processing

Lesson 7.1: Introduction to Audition Part 1


Description: The study of sound textures and high level goals of auditory processing. Anatomy and structure of the auditory system, frequency tuning of cochlear transduction, perception of real-world textures using synthesized textures with similar statistics.

Instructor: Josh McDermott


Click here for the lesson transcript

Click here for the lesson slides

Lesson 7.2: Introduction to Audition Part 2


Description: The study of sound textures and high level goals of auditory processing. Anatomy and structure of the auditory system, frequency tuning of cochlear transduction, perception of real-world textures using synthesized textures with similar statistics.

Instructor: Josh McDermott


Click here for the lesson transcript

Lesson 7.3: Human Auditory Cortex


Description: A data-driven approach to understanding the structure of human auditory cortex, application of matrix decomposition and Independent Component Analysis to fMRI data reveals regions specialized for sound categories such as music and speech.

Instructor: Nancy Kanwisher


Click here for the lesson transcript

Click here for the lesson slides

Lesson 7.4: Auditory Perception in Speech Technology Part-1


Description: Integrating insights from human auditory perception and speech generation into the development of speech production and recognition technology. Dealing with the effects of sound distortion from predictable and unpredictable sources of noise.

Instructor: Hynek Hermansky


Click here for the lesson transcript

Click here for the lesson slides

Lesson 7.5: Auditory Perception in Speech Technology Part-2


Description: Integrating insights from human auditory perception and speech generation into the development of speech production and recognition technology. Dealing with the effects of sound distortion from predictable and unpredictable sources of noise.

Instructor: Hynek Hermansky


Click here for the lesson transcript

Lesson 7 Panel: Vision and Audition


Description: Panelists discuss developments in the fields of vision and audition, comparing and contrasting the nature of the two modalities and computational and experimental research methods used to study them, and exploiting similarities to advance this research.

Instructors: Josh Tenenbaum, Hynek Hermansky, Josh McDermott, Gabriel Kreiman, and Dan Yamins

Click here for the lesson transcript

Further Study

Additional information about the speakers’ research and publications can be found at their websites:

Hermansky, H., J. R. Cohen, et al. This resource may not render correctly in a screen reader.“Perceptual Properties of Current Speech Recognition Technology.” (PDF - 1.7MB) Proceedings of the IEEE 101, no. 9 (2013): 1968–85.

McDermott, J. H. This resource may not render correctly in a screen reader.“The Cocktail Party Problem.” (PDF) Current Biology 19, no. 22 (2009): R1024–27.

Buy at Amazon ———. This resource may not render correctly in a screen reader.“Audition.” (PDF - 1.1MB) In Oxford Handbook of Cognitive Neuroscience Two Volume Set (Oxford Library of Psychology) , Edited by K. N. Ochsner and S. Kosslyn. Oxford University Press, 2013. ISBN: 9780195381597.

McDermott, J. H., M. Schemitsch, et al. This resource may not render correctly in a screen reader.“Summary Statistics in Auditory Perception.” (PDF - 2.9MB) Nature Neuroscience 16, no. 4 (2013): 493–98.

McDermott, J. H., and E. P. Simoncelli. “Sound Texture Perception via Statistics of the Auditory Periphery: Evidence from Sound Synthesis.” Neuron 71 (2011): 926–40.

Norman-Haignere, S., N. Kanwisher, et al. This resource may not render correctly in a screen reader.“Distinct Cortical Pathways for Music and Speech Revealed by Hypothesis-Free Voxel Decomposition.” (PDF - 10.3MB) Neuron 88, no. 6 (2015): 1281–96.

Yamins, D., and J. J. DiCarlo. “Using Goal-Driven Deep Learning Models to Understand Sensory Cortex.” Nature Neuroscience 19, no. 3 (2016): 356–65.