The ability to obtain and communicate complex knowledge about a visual scene, in order to answer simple questions about the objects, agents, and actions portrayed, requires the integration of vision with language understanding. In this unit, you will learn about the state-of-the-art in automated question answering systems; models that leverage visual recognition and tracking with language understanding to describe the content of a video in linguistic terms; and a system that can understand stories. Turning to biology, you will learn about the representations of semantic information in the brain as revealed by fMRI studies.
Boris Katz describes key elements of the START system, an online question answering system that has been operating for over two decades, and compares its capabilities to IBM’s Watson system that can beat human players at Jeopardy.
Andrei Barbu shows how the simple ability to compare an English sentence and a video clip can form the basis for many tasks such as recognition, image and video retrieval, generation of video captions, question answering, and language acquisition.
Patrick Winston addresses a cognitive ability that distinguishes human intelligence from that of other primates: The ability to tell, understand and recombine stories. The Genesis story understanding system is a powerful and flexible platform for exploring this capability.
Guest speaker Tom Mitchell shows how the neural representations of language meaning can be understood using machine learning methods that can decode fMRI signals to reveal the semantics of words experienced by a viewer.
Description: Combining language and vision processing to solve problems in computer scene recognition and scene understanding, language understanding and knowledge representation in the START question answering system, comparison to IBM’s Watson.
Lesson 5.2: From Language to Vision and Back Again
Description: Using higher level knowledge to improve object detection, language-vision model that simultaneously processes sentences and recognizes image objects and events, performing tasks like image/video retrieval, generating descriptions, and question answering.
Description: The strong story hypothesis that the ability to tell, understand, and recombine stories distinguishes human and primate intelligence, historical perspective on AI and thinking machines, modelling story understanding in the Genesis system.
Description: Modelling the neural representations of language using machine learning to classify words from fMRI data, predictive models for word feature combinations, probing the timing of semantic processing with MEG, neural interpretation of adjective-noun phrases.