Advancement in display technology has made our viewing experience more intense and pleasurable. Watching something in 4K 60FPS is extremely satisfying than 1080P 30FPS. The first immerses you in the content as if you were witnessing it. However, not everyone can enjoy this content as they are not easy to deliver. One minute of 4K 60FPS video costs about 6 times more than 1080P 30FPS in terms of data, which is not accessible to many users.
However, it is possible to address this issue by increasing the resolution and/or frame rate of the delivered video. Super resolution methods address increasing the resolution of the video, while video interpolation methods focus on increasing the number of frames within the video.
Video frame interpolation is used to add new frames to a video sequence by estimating motion between existing frames. This technique has been widely used in various applications such as slow motion video, frame rate conversion, and video compression. The resulting video generally looks nicer.
In recent years, research on video frame interpolation has progressed significantly. They can render in-between frames quite accurately and provide a pleasant viewing experience.
However, measuring the quality of interpolation results has been a challenging task for years. Existing methods mainly use available metrics to measure the quality of interpolation results. As video frame interpolation results often show unique artifacts, existing quality metrics are sometimes not consistent with human perception when interpolation results are measured.
Some methods have carried out subjective tests to obtain more precise measurements, but doing so is time consuming, with the exception of some methods that employ user studies. So how can we accurately measure the quality of our video interpolation method? It is time to answer that question.
A group of researchers presented a dedicated perceptual quality metric to measure the results of video frame interpolation. They designed a novel neural network architecture for video perception quality assessment based on Swin Transformers.
The network takes as input a pair of frames, one from the original video sequence and another interpolated frame. Produces a score that represents the perceptual similarity between the two frames. The first step in achieving this type of network was preparing a data set, and that’s where they started. They built a large dataset of perceptual similarity from video frame interpolation. This dataset contains frame pairs from various videos, along with human judgments of their perceptual similarity. This data set is used to train the network using a combination of objective L1 and SSIM metrics.
The L1 loss measures the absolute difference between the predicted score and the actual terrain score, while the SSIM loss measures the structural similarity between two images. By combining these two losses, the network is trained to predict scores that are accurate and consistent with human perception. A great advantage of the proposed method is that it is not based on frames of reference; thus, it can run on client devices where we normally don’t have that information available.
review the Paper. Don’t forget to join our 20k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Ekrem Çetinkaya received his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. She wrote her M.Sc. thesis on denoising images using deep convolutional networks. She is currently pursuing a PhD. She graduated from the University of Klagenfurt, Austria, and working as a researcher in the ATHENA project. Her research interests include deep learning, computer vision, and multimedia networks.