This paper was accepted into the Image Matching: Local Features & Beyond workshop at CVPR 2024.
Identifying strong and accurate correspondences between images is a fundamental problem in computer vision that enables various downstream tasks. Recent semi-dense matching methods emphasize the effectiveness of fusing relevant cross-view information through Transformer. In this article, we propose several improvements to this paradigm. First, we introduce affine-based local attention to model transverse deformations. Second, we introduce selective fusion to fuse local and global messages from cross-attention. In addition to network structure, we also identify the importance of enforcing spatial smoothness in loss design, which has been omitted in previous work. Based on these increases, our network demonstrates strong matching capabilities in different environments. The full version of our network achieves state-of-the-art performance among semi-dense matching methods at a cost similar to that of LoFTR, while the thin version achieves the baseline performance of LoFTR with only 15% computational cost and 18%. of parameters.