Lesson 5: Vision and Language

The ability to obtain and communicate complex knowledge about a visual scene, in order to answer simple questions about the objects, agents, and actions portrayed, requires the integration of vision with language understanding. In this unit, you will learn about the state-of-the-art in automated question answering systems; models that leverage visual recognition and tracking with language understanding to describe the content of a video in linguistic terms; and a system that can understand stories. Turning to biology, you will learn about the representations of semantic information in the brain as revealed by fMRI studies.

image

(Image © Journal of Artificial Intelligence Research. All rights reserved. This content is excluded from our Creative Commons license. Source: Yu, H., N. Siddharth, A. Barbu, and J. M. Siskind. “A Compositional Framework for Grounding Language Inference, Generation, and Acquisition in Video.” J. Artif. Intell. Res. (JAIR) 52 (2015): 601-713.)

Boris Katz describes key elements of the START system, an online question answering system that has been operating for over two decades, and compares its capabilities to IBM’s Watson system that can beat human players at Jeopardy.

Andrei Barbu shows how the simple ability to compare an English sentence and a video clip can form the basis for many tasks such as recognition, image and video retrieval, generation of video captions, question answering, and language acquisition.

Patrick Winston addresses a cognitive ability that distinguishes human intelligence from that of other primates: The ability to tell, understand and recombine stories. The Genesis story understanding system is a powerful and flexible platform for exploring this capability.

Guest speaker Tom Mitchell shows how the neural representations of language meaning can be understood using machine learning methods that can decode fMRI signals to reveal the semantics of words experienced by a viewer.

Unit Activities

Useful Background

  • Introductions to machine learning, neuroscience

Lesson 5.1: Vision and Language


Description: Combining language and vision processing to solve problems in computer scene recognition and scene understanding, language understanding and knowledge representation in the START question answering system, comparison to IBM’s Watson.

Instructor: Boris Katz


Click here for the lesson transcript

Click here for the lesson slides

Lesson 5.2: From Language to Vision and Back Again


Description: Using higher level knowledge to improve object detection, language-vision model that simultaneously processes sentences and recognizes image objects and events, performing tasks like image/video retrieval, generating descriptions, and question answering.

Instructor: Andrei Barbu


Click here for the lesson transcript

Click here for the lesson slides

Lesson 5.3: Story Understanding


Description: The strong story hypothesis that the ability to tell, understand, and recombine stories distinguishes human and primate intelligence, historical perspective on AI and thinking machines, modelling story understanding in the Genesis system.

Instructor: Patrick Winston


Click here for the lesson transcript

Click here for the lesson slides

Seminar 5: Neural Representations of Language


Description: Modelling the neural representations of language using machine learning to classify words from fMRI data, predictive models for word feature combinations, probing the timing of semantic processing with MEG, neural interpretation of adjective-noun phrases.

Instructor: Tom Mitchell


Click here for the lesson transcript

Further Study

Additional information about the speakers’ research and publications can be found at their websites:

Berzak, Y., A. Barbu, et al. This resource may not render correctly in a screen reader.“Do You See What I Mean? Visual Resolution of Linguistic Ambiguities.” (PDF - 2.4MB) Proceedings of the 2015 Conference on Empirical Methods on Natural Language Processing (2015): 1477–87.

Huth, A. G., S. Nishimoto, et al. “A Continuous Semantic Space Describes the Representation of Thousands of Object and Action Categories Across the Human Brain.” Neuron 76, no. 6 (2012): 1210–24.

Katz, B. "START Natural Language Question Answering System. " (online resource)

Mitchell, T., S. V. Shinkareva, et al. This resource may not render correctly in a screen reader.“Predicting Human Brain Activity Associated with the Meanings of Nouns.” (PDF) Science 320 (2008): 1191–95.

Siddharth, N., A. Barbu, et al. This resource may not render correctly in a screen reader.“Seeing What You’re Told: Sentence-Guided Activity Recognition in Video.” (PDF) IEEE Conference on Computer Vision and Pattern Recognition (2014).

Sudre, G., D. Pomerleau, et al. This resource may not render correctly in a screen reader.“Tracking Neural Coding of Perceptual and Semantic Features of Concrete Nouns.” (PDF - 1.3MB) NeuroImage 62 (2012): 451–63.

Wehbe, L., B. Murphy, et al. This resource may not render correctly in a screen reader.“Simultaneously Uncovering the Patterns of Brain Regions Involved in Different Story Reading Subprocesses.” (PDF - 1.1MB) PLOS One (2014): 1–19.

Winston, P. H. This resource may not render correctly in a screen reader.“The Genesis Story Understanding and Story Telling System: A 21st Century Step toward Artificial Intelligence.” (PDF) Center for Barins, Minds & Machines, Memo no. 019 (2014).

———. “The Right Way.Advances in Cognitive Systems 1 (2012): 23–36.

Yu, H., N. Siddharth, et al. This resource may not render correctly in a screen reader.“A Compositional Framework for Grounded Language Inference, Generation, and Acquisition in Video.” (PDF - 6.0MB) Journal of Artificial Intelligence Research 52 (2015): 601–713.