Generating high-quality 3D content requires models capable of learning robust distributions of complex scenes and the real-world objects within them. Recent Gaussian-based 3D reconstruction techniques have achieved impressive results in recovering high-fidelity 3D assets from sparse input images by predicting 3D Gaussians in advance. However, these techniques often lack the extensive background and expressiveness that diffusion models offer. On the other hand, 2D diffusion models, which have been successfully applied to denoise multi-view images, show potential to generate a wide range of photorealistic 3D results, but still fall short of explicit 3D backgrounds and coherence. In this work, we aim to bridge these two approaches by introducing DSplats, a novel method that directly denoises multi-view images using Gaussian Splat-based reconstructors to produce a wide range of realistic 3D assets. To leverage the extensive prior knowledge of 2D diffusion models, we incorporate a pre-trained latent diffusion model into the backbone of the reconstructor to predict an ensemble of 3D Gaussians. Furthermore, the explicit 3D rendering embedded in the denoising network provides a strong inductive bias, ensuring generation of geometrically consistent novel views. Our qualitative and quantitative experiments demonstrate that DSplats not only produces high-quality, spatially consistent results, but also sets a new standard in single-image reconstruction to 3D. When evaluated on Google's scanned object dataset, DSplats achieves a PSNR of 20.38, an SSIM of 0.842, and an LPIPS of 0.109.