As network infrastructure and Internet of Things (IoT) technologies continue to evolve, immersive systems such as virtual reality (VR) are becoming increasingly integrated into interconnected environments. These advancements allow real-time processing of multi-modal data, improving user experiences with rich visual and three-dimensional interactions. However, ensuring continuous user authentication in VR environments remains a significant challenge. To address this issue, an effective user monitoring system is required to track VR users in real time and trigger re-authentication when necessary. Based on this premise, we propose a multi-modal authentication framework that uses eye-tracking data for authentication, named MobileNetV3pro. The framework applies a transfer learning approach by adapting the MobileNetV3Large architecture (pretrained on ImageNet) as a feature extractor. Its pre-trained convolutional layers are used to obtain high-level image representations, while a custom fully connected classification is added to perform binary classification. Authentication performance is evaluated using Equal Error Rate (EER), accuracy, F1-score, model size, and inference time. Experimental results show that eye-based authentication with MobileNetV3pro achieves a lower EER (3.00%) than baseline models, demonstrating its effectiveness in VR environments.