It is customary in fluid mechanics to distinguish between the Lagrangian and Eulerian flow field formulations. According to Wikipedia, “The Lagrangian flow field specification is an approach to studying fluid motion in which the observer follows a discrete parcel of fluid as it flows through space and time. The trajectory line of a parcel can be determined by plotting its location over time. This could be represented as floating along a river while sitting in a boat. The Eulerian flow field specification is a method of analyzing fluid motion that places special emphasis on the locations in space through which the fluid flows as time passes. Sitting on the bank of a river and watching the water pass a fixed point will help you visualize this.
These insights are crucial to understanding how they examine recordings of human action. According to the Eulerian perspective, they would focus on feature vectors at certain locations, such as (x, y) or (x, y, z), and would consider historical evolution while they remain stationary in space at the location. According to the Lagrangian perspective, they would follow, say, a human through space-time and the related feature vector. For example, earlier research for activity recognition frequently employed the Lagrange viewpoint. However, with the development of neural networks based on 3D space-time convolution, the Eulerian view has become the norm in cutting-edge methods such as SlowFast Networks. The Eulerian perspective has been maintained even after the switch to transformer systems.
This is significant because it gives us an opportunity to re-examine the question: “What should be the counterparts of words in video analysis?” during the transformer tokenization process. Imaging patches were recommended by Dosovitskiy et al. as a good option, and the extension of that concept to video implies that spacetime cuboids might also be suitable for video. Instead, they adopt the Lagrangian perspective to examine human behavior in their work. This makes it clear that they are thinking of the course of an entity through time. In this case, the entity can be high level, such as a human, or low level, such as a pixel or patch. They choose to function at the “humans as entities” level because they are interested in understanding human behavior.
To do this, they use a technique that analyzes the movement of a person in a video and uses it to identify their activity. They can retrieve these trajectories using the recently released PHALP and HMR 2.0 3D tracking techniques. Figure 1 illustrates how PHALP recovers people tracks from video by raising people to 3D, allowing them to link people across multiple frames and access their 3D representation. They use these 3D representations of people, their 3D poses and locations, as the fundamental elements of each token. This allows us to build a flexible system where the model, in this case a transformer, accepts tokens belonging to multiple people with access to their identity, 3D posture, and 3D location as input. We can learn about interpersonal interactions using the 3D locations of the people on the stage.
Its tokenization-based model surpasses previous baselines that only had access to posture data and can use 3D tracking. Although the evolution of a person’s position over time is a powerful signal, some activities require additional prior knowledge about the environment and the person’s appearance. As a result, it is crucial to combine the posture with data about the appearance of the person and the scene that is derived directly from the pixels. To do this, they further employ state-of-the-art action recognition models to provide supplementary data based on the contextualized appearance of people and the environment in a Lagrangian framework. They specifically record the contextualized appearance attributes located around each track by intensively running those models along the path of each track.
Its tokens, which are processed by action recognition backbones, contain explicit information about people’s 3D posture, as well as highly sampled pixel appearance data. On the difficult AVA v2.2 dataset, their entire system outperforms the prior state of the art by a significant margin of 2.8 mAP. Overall, his key contribution is the introduction of a methodology that emphasizes the benefits of 3D posing and tracking for understanding human movement. UC Berkeley researchers and Meta AI suggest a Lagrangian Action Recognition with Tracking (LART) method that uses people’s footprints to predict their actions. Its baseline version outperforms previous baselines that used posture information using traceless trajectories and 3D posed renderings of the people in the video. Furthermore, they show that standard baselines that only consider video appearance and context can be easily integrated with the suggested Lagrangian view of action detection, yielding notable improvements over the prevailing paradigm.
review the Paper, Githuband project page. Don’t forget to join our 25k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at asif@marktechpost.com
Check out 100 AI tools at AI Tools Club
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.