We introduce Vanesdiffusion, a generative end -to -end audio model that produces immersive 3D sound landscapes conditioned in the spatial, temporal and environmental conditions of sound objects. IMENSENDIFFUSION is trained to generate first -order Ambisonics audio (FOA), which is a conventional space audio format that includes four channels that can be represented with multichannel spatial production. The proposed generative system is composed of a space audio codec that maps the FOA audio to the latent components, a latent diffusion model trained based on various types of user input, namely text indications, space acoustic, temporal parameters and environmental and optionally a space audio and a space audio and a space audio and a space audio and a space audio the text encoder trained in a contrasting language and audio pretraying style (CLAP). We propose metrics to evaluate the quality and spatial adhesion of the generated space audio. Finally, we evaluate the performance of the model in terms of quality of generation and spatial compliance, comparing the two proposed modes: “descriptive”, which uses indications of spatial and “parametric” text, which uses non -spatial text indications and space parameters. Our evaluations demonstrate promising results that are consistent with user conditions and reflect reliable spatial fidelity.