3D Reconstruction from Monocular Videos Using Neural Radiance Fields (NeRF)

Authors

  • Sajud Hamza Elinjulliparambil Pace University. Author

DOI:

https://doi.org/10.63282/3050-922X.IJERET-V3I4P113

Keywords:

Neural Radiance Fields, 3D Reconstruction, Monocular Video, Implicit Neural Representations, Pose Optimization

Abstract

Monocular video-based 3D reconstruction has emerged as a fundamental yet challenging problem in computer vision, due to depth ambiguity, scale uncertainty, and limited viewpoint coverage. Traditional geometry-related approaches, which include Structure-from-Motion (SfM), Multi-View Stereo (MVS) and SLAM, are partial solutions and usually result in incomplete or noisy reconstruction. Neural Radiance Fields (NeRF) broke the previous paradigm of 3D generation by modelling the scene as a continuous volumetric generator, which takes 3D coordinates and viewing directions as inputs and neural colour and density as outputs to generate photorealistic novel-view images. This review follows the history of NeRF and its initial extensions to monocular video, such as sparse-view adaptations (PixelNeRF, DietNeRF, RegNeRF), dynamic and deformable scene modeling (D-NeRF, NSFF, NeRF-T), and optimization strategies, such as pose estimation, regularization, and efficiency. We address evaluation policies, datasets, and applications in the areas of AR/VR, robotics, cultural heritage, and digital content creation. Lastly, we provide a critical reflection on the limitations of NeRF and are able to identify future perspectives, such as improved priors in monocular input, faster inference, generalizable architectures, and lightweight models. The paper is a detailed overview of the methods that form the basis of neural-radiance-field-based monocular-video reconstruction and preconditions for further progress in that direction

References

[1] T. Georgiou, A. et al., "A survey of traditional and deep learning-based feature descriptors for high dimensional data in computer vision," Int. J. Multimedia Inf. Retr., vol. 9, no. 3, pp. 135–170, 2020.

[2] A. Bapat, Towards High-Frequency Tracking and Fast Edge-Aware Optimization, Ph.D. dissertation, Univ. North Carolina at Chapel Hill, 2019.

[3] M. B. Alatise and G. P. Hancke, "A review on challenges of autonomous mobile robot and sensor fusion methods," IEEE Access, vol. 8, pp. 39830–39846, 2020.

[4] Y. Lu, et al., "A survey of motion-parallax-based 3-D reconstruction algorithms," IEEE Trans. Syst., Man, Cybern., Part C (Appl. Rev.), vol. 34, no. 4, pp. 532–548, 2004.

[5] M. Zollhöfer, et al., "State of the art on monocular 3D face reconstruction, tracking, and applications," Comput. Graph. Forum, vol. 37, no. 2, 2018.

[6] M. Kholil, I. Ismanto, and M. N. Fu’Ad, "3D reconstruction using structure from motion (SFM) algorithm and multi view stereo (MVS) based on computer vision," IOP Conf. Ser.: Mater. Sci. Eng., vol. 1073, no. 1, 2021.

[7] D. Maier, A. Hornung, and M. Bennewitz, "Real-time navigation in 3D environments based on depth camera data," in Proc. 12th IEEE-RAS Int. Conf. Humanoid Robots (Humanoids), 2012.

[8] H. Hofer, Real-time visualization pipeline for dynamic point cloud data, Ph.D. dissertation, Wien, 2018.

[9] G. Pintore, et al., "State‐of‐the-art in automatic 3D reconstruction of structured indoor environments," Comput. Graph. Forum, vol. 39, no. 2, 2020.

[10] A. R. Kosiorek, et al., "NeRF-VAE: A geometry aware 3D scene generative model," in Proc. Int. Conf. Mach. Learn. (PMLR), 2021.

[11] B. Mildenhall, et al., "NeRF: Representing scenes as neural radiance fields for view synthesis," Commun. ACM, vol. 65, no. 1, pp. 99–106, 2021

[12] .A. Flint, D. Murray, and I. Reid, "Manhattan scene understanding using monocular, stereo, and 3D features," in Proc. Int. Conf. Comput. Vis., 2011.

[13] C. Russell, R. Yu, and L. Agapito, "Video pop-up: Monocular 3D reconstruction of dynamic scenes," in Eur. Conf. Comput. Vis., Cham: Springer, 2014.

[14] O. Özyeşil, et al., "A survey of structure from motion*," Acta Numerica, vol. 26, pp. 305–364, 2017.

[15] G. Vogiatzis and C. Hernández, "Video-based, real-time multi-view stereo," Image Vis. Comput., vol. 29, no. 7, pp. 434–441, 2011.

[16] S. Sumikura, M. Shibuya, and K. Sakurada, "OpenVSLAM: A versatile visual SLAM framework," in Proc. 27th ACM Int. Conf. Multimedia, 2019.

[17] L. Yariv, et al., "Volume rendering of neural implicit surfaces," in Adv. Neural Inf. Process. Syst., vol. 34, pp. 4805–4815, 2021.

[18] V. Sitzmann, M. Zollhöfer, and G. Wetzstein, "Scene representation networks: Continuous 3D-structure-aware neural scene representations," in Adv. Neural Inf. Process. Syst., vol. 32, 2019.

[19] V. Sitzmann, et al., "DeepVoxels: Learning persistent 3D feature embeddings," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019.

[20] H. Kato, et al., "Differentiable rendering: A survey," arXiv preprint arXiv:2006.12057, 2020.

[21] M. Kettunen, et al., "An unbiased ray-marching transmittance estimator," arXiv preprint arXiv:2102.10294, 2021.

[22] M. E. Mirici, et al., "Land use/cover change modelling in a Mediterranean rural landscape using multi-layer perceptron and Markov chain (MLP-MC)," Appl. Ecol. Environ. Res., vol. 16, no. 1, 2018.

[23] T. Bardak and S. Bardak, "Prediction of wood density by using red-green-blue (RGB) color and fuzzy logic techniques," Politeknik Dergisi, vol. 20, no. 4, pp. 979–984, 2017.

[24] A. Yu, et al., "pixelNeRF: Neural radiance fields from one or few images," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021.

[25] A. Jain, M. Tancik, and P. Abbeel, "Putting NeRF on a diet: Semantically consistent few-shot view synthesis," Supplementary Materials, 2021.

[26] C. Gao, et al., "Dynamic view synthesis from dynamic monocular video," in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021.

[27] J. I. Agulleiro and J.-J. Fernandez, "Fast tomographic reconstruction on multicore computers," Bioinformatics, vol. 27, no. 4, pp. 582–583, 2011.

[28] E. Tretschk, et al., "Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video," in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021.

[29] J. Chen, et al., "Animatable neural radiance fields from monocular RGB videos," arXiv preprint arXiv:2106.13629, 2021.

[30] M. Jaritz, et al., "Sparse and dense data with CNNs: Depth completion and semantic segmentation," in 2018 Int. Conf. 3D Vision (3DV), IEEE, 2018.

[31] T. Wu, et al., "Density-aware Chamfer distance as a comprehensive metric for point cloud completion," arXiv preprint arXiv:2111.12702, 2021.

[32] D. Applegate, et al., "Unsupervised clustering of multidimensional distributions using earth mover distance," in Proc. 17th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., 2011.

[33] J. Korhonen and J. You, "Peak signal-to-noise ratio revisited: Is simple beautiful?," in 4th Int. Workshop Quality Multimedia Exp., 2012.

[34] S. Basak, et al., "Methodology for building synthetic datasets with virtual humans," in 2020 31st Irish Signals and Systems Conf. (ISSC), IEEE, 2020.

[35] E. Tretschk, et al., "Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video," in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021.

Downloads

Published

2022-12-30

Issue

Section

Articles

How to Cite

1.
Elinjulliparambil SH. 3D Reconstruction from Monocular Videos Using Neural Radiance Fields (NeRF). IJERET [Internet]. 2022 Dec. 30 [cited 2026 Jan. 21];3(4):115-27. Available from: https://ijeret.org/index.php/ijeret/article/view/398