Virtual reality (VR) creates an exceptional experience in which users can explore virtual environments. Wearing a head‐mounted display (HMD), users are able to observe a virtual world that is rendered based on their physical movement and actions. A common solution for capturing the visual and geometric information needed for the construction of virtual environments is the use of RGB‐D sensors. These sensors not only capture a collection of RGB data like conventional cameras do, but additionally record a depth value for each pixel. Thus, RGB‐D sensors are able to capture both the visual and geometric properties of a space, including any objects or people. This makes immersive social VR experiences possible, where people in different physicallocations can be placed in the same virtual environment. However, HMDs obstruct the RGB‐D sensor from capturing the wearer’s upper face, which severely impacts the social aspects of VR applications. To address this, we proposed a framework that is capable of the virtual removal of head‐mounted displays in RGB‐D images, which is referred to as the task of HMD removal. Due to its novelty, we took an exploratory approach to this task. We formulated this problem as a joint RGB‐D face image inpainting task and proposed a GAN‐based coarse‐to‐fine architecture that is capable of simultaneously filling in the missing color and depth information of face images occluded by an HMD. To preserve the identity features of the inpainted faces, we proposed an RGB‐based identity loss function. Leveraging the knowledge of a pretrained identity embedding model, this perceptual loss function stimulates the preservation of identity‐specific facial features. Furthermore, we proposed several architectural structures to explore multimodal feature fusion of the color and depth information contained in RGB‐D images. To this end, we introduced data‐level fusion, which naively combines the color and depth information at network input. In addition, we introduced hybrid fusion, which involves feature‐level fusion in the coarse stage of the architecture and data‐level fusion in the refinement stage of the architecture. Within the concept of hybrid fusion, we investigated several fusion strategies, including residual fusion. Our findings suggest that data‐level fusion achieves similar performance to hybrid fusion. Moreover, to improve surface reproduction in the depth channel, we introduced the employment of a surface normal loss function and contextual surface attention module, which both rely on surface normals that are estimated based on the depth channel of the RGB‐D image. We also considered the addition of surface normal information to the discriminator input, which we found to have an adverse effect on the visual quality of the results. In absence of a large scale RGB‐D face dataset, we devised a pipeline for the creation of a synthetic RGB‐D face dataset for the evaluation of our network. Despite its exploratory nature, our research provides unique insights into the design and behavior of a multimodal image inpainting architecture that can be of interest to future research.