Technology: object, scene, and event recognition
Realizing the limit of existing feature-based object representations, we have developed richer models for object recognition in the following four aspects.
- Richer appearance models to account for view-based variations by a mixture of templates, for example, the poselet formalism for describing human parts and aspects (Malik’s group), and the hybrid image templates that integrating shape (sketches), texture (HoG type) and flatness and color patches (Zhu’s group).
- Compositional models that mix different part templates to account for the wider shape variations (Ramanan’s group) and the And-Or Graph grammar models (Yuille’s group, Zhu’s group). Currently, Zhu’s group is comparing performance with Ramanan’s method on several datasets.
- Fine-grained object recognition by modeling sub-ordinate category variations on the parts level (Fei-fei’s group), and the prediction of attributes using parts (Malik’s group).
- Exploring more context information by estimating the relative importance of object categories in a scene in human perception, and then estimating the role of different image characteristics (size, position, low-level saliency of an object’s image) in determining the perceived importance of the object (Perona’s group).
At the scene level, we have studied two major aspects.
- Scene memory. Torralba’s group, in collaboration with Oliva, has been exploring which aspects of computational scene understanding can be relevant to understand human visual memory of images. They studied the information contributions of various scene components and feature that contribute to human memory of scenes.
- Grammar for 3D scene layout. Zhu’s group has been developing scene grammar for 3D background layout and the alignment of foreground objects. The grammar provides strong top-down information to improve object localization. This work is also compatible with the 3D parsing work by Koller’s group.
It becomes evident that we will be able to predict more attributes/properties in human perception through deeper and finer-grained modeling of objects and scenes.
Action and event level
We have studied both human and animal action recognition and high level cognitive models for motion planning and goal reasoning.
- Tracking and analyzing animal (flies, mice) behavior (Perona’s group) . Then simple patterns of behavior are studied to reveal underlying genetic mutations.
- Developing intuitive psychology models (Tenenbaum’s group) which treat action understanding as a kind of inverse planning of rational agents. To explain a rational agent’s context-dependent behavior, observers must reason about the preferences and beliefs that caused the behavior, neither of which can be directly observed. To address this problem, we have developed a Bayesian model of human “Theory of Mind” (BToM) that parses actions observed in some environmental context into sequences of beliefs and preferences.
- Goal and intent inference with temporal And-Or graphs (Zhu’s group). Inspired by cognitive models of Tenenbaum’s group, we have developed a temporal grammar model to represent the sequence of actions in an events and their possible variations. This representation is then used to infer agent’s intent and goals through bottom-up and top-down computation. This work is an example of collaboration between the MURI team members.
Synergy between objects, scenes, and events
We have started a series of work (at Feifei’s group and Zhu’s group) that explore the interactions between humans and objects in scenes. Fei-Fei’s work has been focused on sports video where human actions (poses) and equipments (ball, racket) provide mutual context information to improve recognition. In Zhu’s work, human actions are defined by a set of spatial and temporal relations between humans, their parts, and objects in the scenes. Therefore, action recognition relies on object recognition, and event understanding, in return, provide top-down information for action recognition, which further improves object recognition, especially for small objects, such as tea cup, phones involved in drinking tea or making phone calls. Human action, such as trajectory on the floor also helps scene segmentation.
Theory: representation, learning and inference
Knowledge representation and cognitive modeling
We have developed the following representations and cognitive models.
- Modeling subordinate categorization, a.k.a. fine-grained categorization (Perona’s group) based on parts and attributes.
- Developing probabilistic programs for dynamic causal processes (Tenenbuam’s group), which govern how physical objects and intentional agents interact with each other.
- Developing intuitive physics models to capture the core aspects of human common sense (Tenenbuam’s group), testing them by comparison with people’s judgments in precise quantitative experiments, and using them in support of cognitively rich frameworks for visual scene understanding. For example, an understanding of how physical objects move and interact with each other, how and why people act as they do, and how people interact with objects, their environment and other people to achieve their goals.
- Developing causal-And-Or graphs (Zhu’s group) to represent the causal effects between actions in a scene and the change of fluents (such as door open/close, lighting on/off). This work is inspired by Pearl’s study of causality. Now we can learn causality from video in an unsupervised way. This provides potentially a means for collecting commonsense knowledge from video.
Learning and generalization
Learning and generalizability are essential topics in the MURI project. In year 1, we have made some progress in both theory and experiments.
- Large scale object recognition. Koller’s group has been studying object recognition with 100s of categories with computational efficiency. The insight gained include: (a) the ability and flexibility of ignoring and delaying a subset of confusing classes to lower levels is critical to achieve good accuracy in hierarchical methods; (2) the joint learning of the coloring and the associated binary decision boundary can directly maximize the margins of the resulting binary problem thus has much better accuracy than previous two-stage methods (classes are first partitioned using some metric, and then a binary classifier is trained); (3) a good trade-off between the number of active classes in a node and the resulting margins of the binary problem needs to be established for low generalization error bound.
- Mathematical formalization of “external validity”, transportability, and “multi-source learning”. Pearl group has proven theorems about the various licensing conditions for transportability of representations or probabilistic models from one domain to the other. This leads to formal representations and computational tools for pooling data effectively and systematically.
- Capacity of And-or graphs. Wu and Zhu have started studying the capacity of And-or graph hypothesis space. This is crucial for the generalizability. We identify key factors, such as part localization, part sharing etc that lead to large reduction of the capacity and there the models can be learned from small number of examples in the PAC-learning sense, and can be generalized better.
Algorithm and inference
We are studying inference algorithms for both discriminative and generative models.
- Discriminative max-margin algorithm. Koller’s group proposed a new output-coding based multiclass boosting algorithm using a multiclass hinge loss. The algorithm is called HingeBoost.OC which can simultaneously optimize over the coding matrix and associated binary classifiers under the global hinge loss for many class clafficiations.
- Novel statistical sampling algorithms based on exact inference for Gaussian MRFs. Yuille’s group has been studying new ways to rapidly sample from probability distributions, including Gaussian MRF and discrete MRFs. We demonstrated good results on a range of problems
- Effective inference on And-Or graphs. Geman’s group defined a distribution jointly on and/or graphs of latent variables and pixel intensities that is computationally manageable (i.e. posterior can be explored for inference and parameter estimation). Hence the computation engine, and indeed the representation, is homogeneous. The distribution is context sensitive, so in principle we should be able to feasibly perform context-sensitive inference.
Dataset and Benchmark
Dataset, annotation, and benchmarking
In dataset collection, we have made the following progress.
- Datasets. Fei-Fei is continuing her work on ImageNet, Torralba is continuing his collection of the MIT LableMe dataset.
- Evaluating the datasets. Torralba, in collaboration with Alyosha Efros, studied the problems and limitations of training and testing on different datasets.
- Visipedia development. Perona’s group developed a model of the process of a human annotating binary properties an image (e.g. presence/absence of a category). The model allows them to harvest image annotations from hundreds of individuals with different degrees of motivation and ability, and estimate the most reliable consolidated label for each image, alongside the difficulty of each annotation task, the competence and risk-aversion of each annotator as well as the criteria used by each annotator. http://www.vision.caltech.edu/visipedia/
- Interactive annotation. Perona’s group is also developing an architecture to allow machines and humans to collaborate in solving categorization tasks. Their focus is on subordinated categorization tasks (fine-grained categorization) which humans are notoriously poor at. E.g. a dataset consisting of 200 species of birds, some of which visually similar. Humans are best at taking visual measurements related to object parts (e.g. where is the eye of the bird? Is the bird’s belly red?) while machines are best at categorization once all the measurements are done, and at taking metric measurements (e.g. what kind of red?). We find that cooperating humans and machines are far better than either humans or machines alone.