Cross attention is a fundamental tool in the creation of ai models that can understand multiple forms of data simultaneously. Think of language models that can understand images such as those used in chatgpt, or models that generate text -based videos such as Sora.
This summary reviews all critical mathematical operations within cross attention, which allows you to understand its internal functioning at a fundamental level.
Cross attention is used when modeling with a variety of data types, each of which could format the entrance differently. For natural language data, you probably use a word for vector ragged, combined with positional coding, to calculate a vector that represents each word.
For visual data, one could pass the image through a encoder designed specifically to summarize the image in a vector representation.