Imagine looking at a busy street for a few moments and then trying to draw from memory the scene you saw. Most people could draw the approximate positions of major objects, such as cars, people, and crosswalks, but almost no one can draw every detail with pixel-perfect precision. The same goes for most modern computer vision algorithms: they're great at capturing high-level details of a scene, but they lose fine details as they process information.
Now, MIT researchers have created a system called “feat” which allows algorithms to capture all the high- and low-level details of a scene at the same time, almost like Lasik eye surgery for computer vision.
When computers learn to “see” by looking at images and videos, they construct “ideas” of what is in a scene through something called “features.” To create these features, deep networks and visual-based models divide images into a grid of small squares and process these squares as a group to determine what is happening in a photo. Each small square is usually made up of between 16 and 32 pixels, so the resolution of these algorithms is dramatically lower than that of the images they work with. When trying to summarize and understand photos, algorithms lose a lot of pixel clarity.
The FeatUp algorithm can stop this information loss and increase the resolution of any deep network without compromising speed or quality. This allows researchers to quickly and easily improve the resolution of any new or existing algorithm. For example, let's imagine trying to interpret the predictions of a lung cancer detection algorithm with the goal of locating the tumor. Applying FeatUp before interpreting the algorithm using a method such as class activation maps (CAM) can produce a much more detailed view (16-32x) of where the tumor might be located based on the model.
FeatUp not only helps professionals understand their models, but it can also improve a variety of different tasks such as object detection, semantic segmentation (assigning labels to pixels in an image with object labels), and depth estimation. It achieves this by providing more precise, high-resolution features, which are crucial for creating vision applications ranging from autonomous driving to medical imaging.
“The essence of all computer vision lies in these deep, intelligent features that emerge from the depths of deep learning architectures. The big challenge with modern algorithms is that they reduce large images to very small grids of 'smart' features, gaining intelligent insights but losing the finer details,” says Mark Hamilton, a doctoral student in electrical and computer engineering at MIT, MIT Computer Science and affiliate of the artificial intelligence Laboratory (CSAIL), and lead co-author of a paper about the project. “FeatUp helps achieve the best of both worlds: highly intelligent renderings at the resolution of the original image. These high-resolution features significantly increase performance across a spectrum of computer vision tasks, from improving object detection and depth prediction to providing a deeper understanding of your network's decision-making process through analysis of high resolution”.
Rebirth of resolution
As these large ai models become more and more prevalent, there is an increasing need to explain what they are doing, what they are looking at, and what they are thinking.
But how exactly can FeatUp uncover these fine details? Curiously, the secret is in the movement and movement of the images.
In particular, FeatUp applies minor adjustments (such as moving the image a few pixels left or right) and watches how an algorithm responds to these slight image movements. This results in hundreds of slightly different deep feature maps, which can be combined into a single, sharp, high-resolution deep feature set. “We imagine that there are some high-resolution features and that when we shake and blur them, they will match all the original lower-resolution features of the blurred images. Our goal is to learn how to refine low-resolution features into high-resolution features using this 'game' that lets us know how well we're doing,” says Hamilton. This methodology is analogous to how algorithms can create a 3D model from multiple 2D images by ensuring that the predicted 3D object matches all of the 2D photos used to create it. In the case of FeatUp, they predict a high-resolution feature map that is consistent with all low-resolution feature maps formed by jittering the original image.
The team notes that the standard tools available in PyTorch were insufficient for their needs and introduced a new type of deep network layer in their search for a fast and efficient solution. Their custom layer, a special joint bilateral sampling operation, was more than 100 times more efficient than a naive implementation in PyTorch. The team also showed that this new layer could improve a wide variety of different algorithms, including semantic segmentation and depth prediction. This layer improved the network's ability to process and understand high-resolution details, giving any algorithm that used it a substantial performance boost.
“Another application is something called small object retrieval, where our algorithm allows for precise localization of objects. For example, even in crowded road scenes, FeatUp-enriched algorithms can see tiny objects like traffic cones, reflectors, lights, and potholes where their lower-resolution cousins fail. This demonstrates its ability to enhance coarse features and turn them into finely detailed signals,” says Stephanie Fu '22, MNG '23, a PhD student at the University of California, Berkeley and another co-lead author of the new FeatUp paper. “This is especially critical for urgent tasks, such as identifying a traffic sign on a congested highway in a self-driving car. “Not only can this improve the accuracy of such tasks by turning broad guesses into exact locations, but it could also make these systems more reliable, interpretable, and trustworthy.”
Whats Next?
As for future aspirations, the team emphasizes the potential widespread adoption of FeatUp within the research community and beyond, similar to data augmentation practices. “The goal is to make this method a fundamental tool in deep learning, enriching models to perceive the world in greater detail without the computational inefficiency of traditional high-resolution processing,” says Fu.
“FeatUp represents a wonderful advance in making visual representations truly useful by producing them at full image resolutions,” says Noah Snavely, a computer science professor at Cornell University, who was not involved in the research. “Learned visual representations have gotten really good in recent years, but they are almost always produced in very low resolution; you can put a nice photo in full resolution and get a little postage stamp-sized grid of features. That's a problem if you want to use those features in applications that produce full resolution results. FeatUp solves this problem in a creative way by combining classic super-resolution ideas with modern learning approaches, generating beautiful high-resolution feature maps.”
“We hope that this simple idea can have wide application. It provides high-resolution versions of image analysis that we previously thought could only be low-resolution,” says senior author William T. Freeman, professor of electrical engineering and computer science at MIT and a CSAIL member.
Lead authors Fu and Hamilton are joined by MIT doctoral students Laura Brandt SM '21 and Axel Feldmann SM '21, as well as Zhoutong Zhang SM '21, PhD '22, all current or former MIT CSAIL affiliates . His research is supported, in part, by a National Science Foundation Graduate Research Fellowship., by the National Science Foundation and the Office of the Director of National Intelligence, by the US Air Force Research Laboratory, and by the US Air Force artificial intelligence Accelerator. The group will present their work in May at the International Conference on Representations of Learning.