Lesson 5: Vision and Language

Tim · March 1, 2019, 1:30pm

The ability to obtain and communicate complex knowledge about a visual scene, in order to answer simple questions about the objects, agents, and actions portrayed, requires the integration of vision with language understanding. In this unit, you will learn about the state-of-the-art in automated question answering systems; models that leverage visual recognition and tracking with language understanding to describe the content of a video in linguistic terms; and a system that can understand stories. Turning to biology, you will learn about the representations of semantic information in the brain as revealed by fMRI studies.

(Image © Journal of Artificial Intelligence Research. All rights reserved. This content is excluded from our Creative Commons license. Source: Yu, H., N. Siddharth, A. Barbu, and J. M. Siskind. “A Compositional Framework for Grounding Language Inference, Generation, and Acquisition in Video.” J. Artif. Intell. Res. (JAIR) 52 (2015): 601-713.)

Boris Katz describes key elements of the START system, an online question answering system that has been operating for over two decades, and compares its capabilities to IBM’s Watson system that can beat human players at Jeopardy.

Andrei Barbu shows how the simple ability to compare an English sentence and a video clip can form the basis for many tasks such as recognition, image and video retrieval, generation of video captions, question answering, and language acquisition.

Patrick Winston addresses a cognitive ability that distinguishes human intelligence from that of other primates: The ability to tell, understand and recombine stories. The Genesis story understanding system is a powerful and flexible platform for exploring this capability.

Guest speaker Tom Mitchell shows how the neural representations of language meaning can be understood using machine learning methods that can decode fMRI signals to reveal the semantics of words experienced by a viewer.

Unit Activities

Useful Background

Introductions to machine learning, neuroscience

Tim · March 1, 2019, 1:30pm

Lesson 5.1: Vision and Language

Description: Combining language and vision processing to solve problems in computer scene recognition and scene understanding, language understanding and knowledge representation in the START question answering system, comparison to IBM’s Watson.

Instructor: Boris Katz

Click here for the lesson transcript

Click here for the lesson slides

Tim · March 1, 2019, 1:30pm

Lesson 5.2: From Language to Vision and Back Again

Description: Using higher level knowledge to improve object detection, language-vision model that simultaneously processes sentences and recognizes image objects and events, performing tasks like image/video retrieval, generating descriptions, and question answering.

Instructor: Andrei Barbu

Click here for the lesson transcript

Click here for the lesson slides

Tim · March 1, 2019, 1:30pm

Lesson 5.3: Story Understanding

Description: The strong story hypothesis that the ability to tell, understand, and recombine stories distinguishes human and primate intelligence, historical perspective on AI and thinking machines, modelling story understanding in the Genesis system.

Instructor: Patrick Winston

Click here for the lesson transcript

Click here for the lesson slides

Tim · March 1, 2019, 1:30pm

Seminar 5: Neural Representations of Language

Description: Modelling the neural representations of language using machine learning to classify words from fMRI data, predictive models for word feature combinations, probing the timing of semantic processing with MEG, neural interpretation of adjective-noun phrases.

Instructor: Tom Mitchell

Click here for the lesson transcript

Tim · March 1, 2019, 1:30pm

Further Study

Additional information about the speakers’ research and publications can be found at their websites:

Berzak, Y., A. Barbu, et al. “Do You See What I Mean? Visual Resolution of Linguistic Ambiguities.” (PDF - 2.4MB) Proceedings of the 2015 Conference on Empirical Methods on Natural Language Processing (2015): 1477–87.

Huth, A. G., S. Nishimoto, et al. “A Continuous Semantic Space Describes the Representation of Thousands of Object and Action Categories Across the Human Brain.” Neuron 76, no. 6 (2012): 1210–24.

Katz, B. "START Natural Language Question Answering System. " (online resource)

Mitchell, T., S. V. Shinkareva, et al. “Predicting Human Brain Activity Associated with the Meanings of Nouns.” (PDF) Science 320 (2008): 1191–95.

Siddharth, N., A. Barbu, et al. “Seeing What You’re Told: Sentence-Guided Activity Recognition in Video.” (PDF) IEEE Conference on Computer Vision and Pattern Recognition (2014).

Sudre, G., D. Pomerleau, et al. “Tracking Neural Coding of Perceptual and Semantic Features of Concrete Nouns.” (PDF - 1.3MB) NeuroImage 62 (2012): 451–63.

Wehbe, L., B. Murphy, et al. “Simultaneously Uncovering the Patterns of Brain Regions Involved in Different Story Reading Subprocesses.” (PDF - 1.1MB) PLOS One (2014): 1–19.

Winston, P. H. “The Genesis Story Understanding and Story Telling System: A 21st Century Step toward Artificial Intelligence.” (PDF) Center for Barins, Minds & Machines, Memo no. 019 (2014).

———. “The Right Way.” Advances in Cognitive Systems 1 (2012): 23–36.

Yu, H., N. Siddharth, et al. “A Compositional Framework for Grounded Language Inference, Generation, and Acquisition in Video.” (PDF - 6.0MB) Journal of Artificial Intelligence Research 52 (2015): 601–713.