The transformers have demonstrated impressive performance on class-conditional ImageNet benchmarks, achieving state-of-the-art FID scores. However, its computational complexity increases with the depth/width of the transformer or the number of input tokens and requires an irregular approximation to operate even on latent input sequences. In this paper, we address these issues by presenting a novel approach to improve the efficiency and scalability of imaging models, incorporating state space models (SSM) as a core component and deviating from the widely used U-Net and transformer-based architectures. adopted. We introduce a class of SSM-based models that significantly reduce forward pass complexity while maintaining comparable performance and taking exact input sequences without irregular approximations. Through extensive experiments and rigorous evaluations, we demonstrate that our proposed approach reduces the Gflops used in the model without sacrificing the quality of the generated images. Our findings suggest that state space models can be an effective alternative to attention mechanisms in transformer-based architectures, offering a more efficient solution for large-scale imaging tasks.