Novel view synthesis from a single image requires inferring occluded regions of objects and scenes while simultaneously maintaining semantic and physical consistency with the input. Existing approaches condition neural radiation fields (NeRFs) on local image features, projecting points onto the input image plane and adding 2D features to perform volume rendering. However, under severe occlusion, this projection fails to resolve the uncertainty, resulting in blurry representations that lack detail. In this paper, we propose NerfDiff, which addresses this problem by distilling the knowledge of a 3D-aware conditional diffusion model (CDM) into NeRF by synthesizing and refining a set of virtual views at test time. Furthermore, we propose a new NeRF-guided distillation algorithm that simultaneously generates 3D-consistent virtual views from the CDM samples and adjusts the NeRF based on the enhanced virtual views. Our approach significantly outperforms existing geometry-free and NeRF-based approaches on challenging data sets, including ShapeNet, ABO, and Clevr3D.