Lightweight Dual-View Feature Fusion for Hand-Object Interaction Recognition

Article Sidebar

Main Article Content

Houda Skhoun

Wrist-worn hand-object interaction (HOI) recog- nition is a critical capability for wearable rehabil- itation systems, assistive technologies,augmented reality and human-computer interaction applica- tions. Compared with fixed external cameras, wrist-worn wearable devices provide user-centered observations that are more suitable for contin- uous real-world interaction monitoring. How- ever, many existing wrist-worn HOI recognition systems still face important challenges, includ- ing incomplete interaction representation caused by single-view observations, viewpoint ambiguity, self-occlusion and the high computational com- plexity of recent deep learning approaches. To address these limitations, this paper proposes a lightweight dual-view framework for hand-object interaction recognition using synchronized palm- view and back-view RGB images acquired from a wrist-worn dual-camera device. The proposed framework employs a shared MobileNetV2 back- bone combined with multi-level feature extraction to jointly capture fine-grained spatial details and high-level semantic representations. To effectively integrate complementary information from differ- ent network depths, a view-specific adaptive fusion mechanism is introduced to dynamically balance intermediate and deep feature representations for each visual stream. The fused dual-view represen- tation is subsequently used for interaction classifi- cation. Experimental evaluation under the Leave- One-Participant-Out (LOPO) cross-subject pro- tocol demonstrates that the proposed framework achieves a mean accuracy of 82.36% and a mean F1-score of 81.65% while maintaining low compu- tational complexity suitable for real-time wearable applications. Ablation studies further confirm the effectiveness of the proposed multi-level feature extraction and adaptive fusion strategy. The pro- posed approach provides an effective balance be- tween recognition performance and computational efficiency for lightweight wearable HOI recognition systems.

Lightweight Dual-View Feature Fusion for Hand-Object Interaction Recognition. (2026). International Journal of Latest Technology in Engineering Management & Applied Science, 15(5), 2478-2495. https://doi.org/10.51583/

Downloads

References

Ohn-Bar E, Trivedi M M. Hand gesture recognition in real time for automotive in- terfaces[J]. IEEE Transactions on Intelligent Transportation Systems, 2014, 15(6): 2368–

2377.

Cheng H, Yang L, Liu Z. Survey on 3D hand gesture recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2015, 26(9): 1659–1673.

Fan H, Zhuo T, Yu X, et al. Understanding atomic hand-object interaction with human intention[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 32(1): 275–285.

Chung H Y, Chung Y L, Tsai W F. An effi- cient hand gesture recognition system based on deep CNN[C]// IEEE International Con- ference on Industrial Technology. IEEE, 2019: 853–858.

Lin H I, Hsu M H, Chen W K. Human hand gesture recognition using a convolution neu- ral network[C]// IEEE International Confer- ence on Automation Science and Engineer- ing. IEEE, 2014: 1038–1043.

Li G, Tang H, Sun Y, et al. Hand gesture recognition based on convolution neural net- work[J]. Cluster Computing, 2019, 22(S2): 2719–2729.

Ozcan T, Basturk A. Transfer learning-based convolutional neural networks with heuristic optimization[J]. Neural Computing and Ap- plications, 2019, 31(12): 8955–8970.

Sahoo J P, Prakash A J, Pl-awiak P, et al. Real-time hand gesture recognition using fine-tuned convolutional neural network[J]. Sensors, 2022, 22(3): 706.

Tran D S, Ho N H, Yang H J, et al. Real-time hand gesture spotting and recognition using RGB-D camera and 3D convolutional neu- ral network[J]. Applied Sciences, 2020, 10(2):

722.

Mahmud H, Morshed M M, Hasan M K. A deep learning-based multimodal depth-aware dynamic hand gesture recognition[EB/OL]. arXiv:2107.02543, 2021.

Ishihara T, Kitani K M, Ma W C, et al. Recognizing hand-object interactions in wear- able camera videos[C]// IEEE International Conference on Image Processing. IEEE, 2015: 1349–1353.

Tekin B, Bogo F, Pollefeys M. Unified egocen- tric recognition of 3D hand-object poses[C]// IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2019: 4511–4520.

Garcia-Hernando G, Yuan S, Baek S, et al. First-person hand action benchmark with RGB-D videos and 3D hand pose annota- tions[C]// IEEE Conference on Computer

Vision and Pattern Recognition. IEEE, 2018: 409–419.

Ahmad A, Migniot C, Dipanda A. Track- ing hands in interaction with objects: A review[C]// International Conference on Signal-Image Technology and Internet-Based Systems. IEEE, 2017: 360–369.

Romero J, Kjellstr¨om H, Kragic D. Hands in action: Real-time 3D reconstruction of hands[C]// IEEE International Conference on Robotics and Automation. IEEE, 2010: 458–463.

Hamer H, Schindler K, Koller-Meier E, et al. Tracking a hand manipulating an object[C]// IEEE International Conference on Computer Vision. IEEE, 2009: 1475–1482.

Kang B, Tan K H, Jiang N, et al. Hand seg- mentation for hand-object interaction from depth map[C]// IEEE Global Conference on Signal and Information Processing. IEEE, 2017: 259–263.

Sridhar S, Mueller F, Zollh¨ofer M, et al. Real- time joint tracking of a hand manipulating an object[C]// European Conference on Com- puter Vision. Springer, 2016: 294–310.

Cai M, Kitani K M, Sato Y. Understanding hand-object manipulation with grasp types and object attributes[C]// Robotics: Science and Systems. 2016.

Bertasius G, Park H S, Yu S X, et al. First person action-object detection with egonet[EB/OL]. arXiv:1603.04908, 2016.

Schroder M, Ritter H. Hand-object interac- tion detection with fully convolutional net- works[C]// IEEE Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 2017: 18–25.

Yan W, Gao Y, Liu Q. Human-object inter- action recognition using multitask neural net- work[C]// International Symposium on Au- tonomous Systems. IEEE, 2019: 323–328.

Kwon T, Tekin B, Stu¨hmer J, et al. H2O: Two hands manipulating objects for inter- action recognition[C]// IEEE International Conference on Computer Vision. IEEE, 2021: 10138–10148.

K¨opu¨klu¨ O, Gunduz A, Kose N, et al. Real- time hand gesture detection and classifica- tion using convolutional neural networks[C]IEEE International Conference on Automatic Face and Gesture Recognition. IEEE, 2019: 1–8.

Mujahid A, Awan M J, Yasin A, et al. Real- time hand gesture recognition based on deep learning YOLOv3 model[J]. Applied Sciences, 2021, 11(9): 4164.

Lai K, Yanushkevich S N. CNN+RNN depth and skeleton based dynamic hand gesture recognition[C]// International Conference on Pattern Recognition. IEEE, 2018: 3451–3456.

Pigou L, Van Den Oord A, Dieleman S, et al. Beyond temporal pooling: Recurrence and temporal convolutions for gesture recog- nition[J]. International Journal of Computer Vision, 2018, 126(2): 430–439.

Molchanov P, Gupta S, Kim K, et al. Hand gesture recognition with 3D convolutional neural networks[C]// IEEE Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 2015: 1–7.

Zhang L, Zhu G, Shen P, et al. Learning spa- tiotemporal features using 3DCNN and Con- vLSTM for gesture recognition[C]// IEEE International Conference on Computer Vi- sion Workshops. IEEE, 2017: 3120–3128.

Gao Q, Chen Y, Ju Z, et al. Dynamic hand gesture recognition based on 3D hand pose estimation[J]. IEEE Sensors Journal, 2021, 22(18): 17421–17430.

Miah A S M, Hasan M A M, Shin J. Dynamic hand gesture recognition using graph neural networks[J]. IEEE Access, 2023, 11: 4703–

4716.

Sun S. A survey of multi-view machine learn- ing[J]. Neural Computing and Applications, 2013, 23(7): 2031–2038.

Shukla D, Erkent O¨ , Piater J. A multi-view hand gesture RGB-D dataset for human- robot interaction scenarios[C]// IEEE Inter- national Symposium on Robot and Human In- teractive Communication. IEEE, 2016: 1084–

1091.

Wang L, Ding Z, Tao Z, et al. Generative multi-view human action recognition[C]// IEEE International Conference on Computer Vision. IEEE, 2019: 6212–6221.

Zhang Z, Wang C, Xiao B, et al. Cross-view action recognition using contextual maximum

margin clustering[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2014, 24(10): 1663–1668.

Arnold E, Dianati M, De Temple R, et al. Cooperative perception for 3D object detec- tion in driving scenarios[J]. IEEE Transac- tions on Intelligent Transportation Systems, 2020, 23(3): 1852–1864.

Teepe T, Wolters P, Gilg J, et al. EarlyBird: Early fusion for multi-view tracking in bird’s- eye view[C]// IEEE/CVF Winter Conference on Applications of Computer Vision. IEEE, 2024: 102–111.

Gao Y, Maggs M. Feature-level fusion in per- sonal identification[C]// IEEE Computer So- ciety Conference on Computer Vision and Pattern Recognition. IEEE, 2005: 468–473.

Fadadu S, Pandey S, Hegde D, et al. Multi-view fusion of sensor data for im- proved perception in autonomous driv- ing[C]// IEEE/CVF Winter Conference on Applications of Computer Vision. IEEE, 2022: 2349–2357.

Seeland M, M¨ader P. Multi-view classification with convolutional neural networks[J]. PLoS One, 2021, 16(1): e0245230.

Cheng J, Yin W, Wang K, et al. Adaptive fu- sion of single-view and multi-view depth for autonomous driving[C]// IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. IEEE, 2024: 10138–10147.

Zheng D, Zheng X, Yang L T, et al. Multi-view feature fusion network for cam- ouflaged object detection[C]// IEEE/CVF Winter Conference on Applications of Com- puter Vision. IEEE, 2023: 6232–6242.

Ezati A, Dezyani M, Rana R, et al. A lightweight attention-based deep network via multi-scale feature fusion for multi- view facial expression recognition[EB/OL]. arXiv:2403.14318, 2024.

He K, Zhang X, Ren S, et al. Deep resid- ual learning for image recognition[C]// IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2016: 770–778.

Lin T Y, Doll´ar P, Girshick R, et al. Feature pyramid networks for object detection[C]// IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2017: 2117–2125.

Article Details

How to Cite

Lightweight Dual-View Feature Fusion for Hand-Object Interaction Recognition. (2026). International Journal of Latest Technology in Engineering Management & Applied Science, 15(5), 2478-2495. https://doi.org/10.51583/