Multiview Diffusion Models for High-Resolution Image Synthesis
DOI:
https://doi.org/10.63282/3050-922X.IJERET-V5I3P112Keywords:
Multiview Image Synthesis, Diffusion Models, High-Resolution Generation, Latent Diffusion, Geometric Consistency, View-Consistent Generation, 3D-Aware Generative Models, Novel View SyntheisAbstract
Multiview image synthesis aims to generate multiple coherent images of a scene from different viewpoints, a capability that is essential for applications such as 3D reconstruction, virtual reality, medical imaging, and autonomous systems. While recent advances in diffusion-based generative models have significantly improved image fidelity and training stability, ensuring geometric, photometric, and semantic consistency across multiple high-resolution views remains a fundamental challenge. This paper presents a comprehensive review of multiview diffusion models for high-resolution image synthesis, focusing on methods that explicitly or implicitly enforce cross-view consistency. We first introduce the theoretical foundations of diffusion models, including denoising diffusion probabilistic models, score-based generative frameworks, and latent diffusion. We then systematically analyze the core challenges of multiview high-resolution synthesis and propose a structured taxonomy of existing approaches based on conditioning strategies, architectural designs, and geometry-aware modeling mechanisms. Furthermore, we review resolution-scaling and computational optimization techniques that enable diffusion models to operate effectively at high resolutions. Widely used datasets and evaluation metrics are discussed, highlighting current limitations in benchmarking multiview consistency. Finally, we survey key application domains and identify open research challenges and future directions. This review provides a unified perspective on the intersection of multiview learning and diffusion-based generation, serving as a valuable reference for researchers and practitioners in generative modeling and 3D vision
References
[1] Z. Zhang, X. Zhang, et al., "Multi-view consistent generative adversarial networks for compositional 3d-aware image synthesis," International Journal of Computer Vision, vol. 131, no. 8, pp. 2219–2242, 2023.
[2] P. Eigenschink, et al., "Deep generative models for synthetic data: A survey," IEEE Access, vol. 11, pp. 47304–47320, 2023.
[3] N. A. Manap, "Multi-view image synthesis techniques for 3D vision and free-viewpoint applications," 2012.
[4] A. Kazerouni, et al., "Diffusion models for medical image analysis: A comprehensive survey," arXiv preprint arXiv:2211.07804, 2022.
[5] L. R. A. Wilde, "Generative imagery as media form and research field: Introduction to a new paradigm," 2023.
[6] D. Lee, "A comparison of conditional autoregressive models used in Bayesian disease mapping," Spatial and Spatio-temporal Epidemiology, vol. 2, no. 2, pp. 79–89, 2011.
[7] L. Yang, et al., "Diffusion models: A comprehensive survey of methods and applications," ACM Computing Surveys, vol. 56, no. 4, pp. 1–39, 2023.
[8] D. Watson, et al., "Learning to efficiently sample from diffusion probabilistic models," arXiv preprint arXiv:2106.03802, 2021.
[9] A. Alimanov and M. B. Islam, "Denoising diffusion probabilistic model for retinal image generation and segmentation," in 2023 IEEE International Conference on Computational Photography (ICCP), 2023.
[10] X. Wang, et al., "Efficient transfer learning in diffusion models via adversarial noise," arXiv preprint arXiv:2308.11948, 2023.
[11] S. Yu, et al., "Video probabilistic diffusion models in projected latent space," in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
[12] H. Sun, et al., "Score-based continuous-time discrete diffusion models," arXiv preprint arXiv:2211.16750, 2022.
[13] R. Rombach, et al., "High-resolution image synthesis with latent diffusion models," in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
[14] T. Khot, et al., "Learning unsupervised multi-view stereopsis via robust photometric consistency," arXiv preprint arXiv:1905.02706, 2019.
[15] T. Kruisselbrink, R. Dangol, and A. Rosemann, "Photometric measurements of lighting quality: An overview," Building and Environment, vol. 138, pp. 42–52, 2018.
[16] K. Saleh, S. Szénási, and Z. Vámossy, "Generative adversarial network for overcoming occlusion in images: A survey," Algorithms, vol. 16, no. 3, p. 175, 2023.
[17] T. Mahmud, M. Billah, and A. K. Roy-Chowdhury, "Multi-view frame reconstruction with conditional gan," in 2018 IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2018.
[18] B. Kaya, et al., "Neural radiance fields approach to deep multi-view photometric stereo," in Proc. IEEE/CVF Winter Conference on Applications of Computer Vision, 2022.
[19] Y. Shi, et al., "Mvdream: Multi-view diffusion for 3d generation," arXiv preprint arXiv:2308.16512, 2023.
[20] H.-Y. Tseng, et al., "Consistent view synthesis with pose-guided diffusion models," in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
[21] Z. Zhou and S. Tulsiani, "Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction," in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
[22] Z. Pan, X. Zhou, and H. Tian, "Extreme generative image compression by learning text embedding from diffusion models," arXiv preprint arXiv:2211.07793, 2022.
[23] K. Nagano, Multi-scale Dynamic Capture for High Quality Digital Humans, Ph.D. dissertation, Univ. of Southern California, 2017.