App developers advertise their apps by creating product pages with app images and bidding on search terms. So it is essential that the images in the application are very relevant to the search terms. Solutions to this problem require an image-text matching model to predict the quality of the match between the chosen image and the search terms. In this work, we present a novel approach to match an app image with search terms based on tuning a pre-trained LXMERT model. We show that compared to the CLIP model and a baseline that uses a Transformer model for search terms and a ResNet model for images, we significantly improve the matching accuracy. We evaluate our approach using two sets of tags: advertiser-associated (image, search term) pairs for a given app and human ratings for relevance between (image, search term) pairs. Our approach achieves an AUC score of 0.96 for the advertiser-associated ground truth, outperforming the Transformer+ResNet baseline and the fitted CLIP model by 8% and 14%. For human-labeled ground truth, our approach achieves an AUC score of 0.95, outperforming the Transformer + ResNet baseline and the tuned CLIP model by 16% and 17%.