The recent fusion of convolutional and transformer designs has led to constant improvements in model accuracy and efficiency. In this work, we present FastViT, a hybrid vision transformer architecture that achieves the state-of-the-art latency-accuracy trade-off. To this end, we present a novel token mixing operator, RepMixer, a core component of FastViT, which uses structural reparametrization to reduce the memory access cost by eliminating jump connections in the network. Additionally, we apply training time overparameterization and large kernel convolutions to increase accuracy and show empirically that these options have minimal effect on latency. We show that: our model is 3.5 times faster than CMT, a state-of-the-art hybrid transformer architecture, 4.9 times faster than EfficientNet, and 1.9 times faster than ConvNeXt on a mobile device for the same accuracy. in the ImageNet data set. . With similar latency, our model achieves 4.2% better Top-1 accuracy on ImageNet than MobileOne. Our model consistently outperforms competing architectures on several tasks: image classification, detection, segmentation, and 3D mesh regression with significant improvement in latency on both a mobile device and a desktop GPU. Furthermore, our model is very robust to out-of-distribution samples and corruptions, which improves over competing robust models.