Lightweight Dual-View Feature Fusion for Hand-Object Interaction Recognition
Article Sidebar
Main Article Content
Wrist-worn hand-object interaction (HOI) recog- nition is a critical capability for wearable rehabil- itation systems, assistive technologies,augmented reality and human-computer interaction applica- tions. Compared with fixed external cameras, wrist-worn wearable devices provide user-centered observations that are more suitable for contin- uous real-world interaction monitoring. How- ever, many existing wrist-worn HOI recognition systems still face important challenges, includ- ing incomplete interaction representation caused by single-view observations, viewpoint ambiguity, self-occlusion and the high computational com- plexity of recent deep learning approaches. To address these limitations, this paper proposes a lightweight dual-view framework for hand-object interaction recognition using synchronized palm- view and back-view RGB images acquired from a wrist-worn dual-camera device. The proposed framework employs a shared MobileNetV2 back- bone combined with multi-level feature extraction to jointly capture fine-grained spatial details and high-level semantic representations. To effectively integrate complementary information from differ- ent network depths, a view-specific adaptive fusion mechanism is introduced to dynamically balance intermediate and deep feature representations for each visual stream. The fused dual-view represen- tation is subsequently used for interaction classifi- cation. Experimental evaluation under the Leave- One-Participant-Out (LOPO) cross-subject pro- tocol demonstrates that the proposed framework achieves a mean accuracy of 82.36% and a mean F1-score of 81.65% while maintaining low compu- tational complexity suitable for real-time wearable applications. Ablation studies further confirm the effectiveness of the proposed multi-level feature extraction and adaptive fusion strategy. The pro- posed approach provides an effective balance be- tween recognition performance and computational efficiency for lightweight wearable HOI recognition systems.
Downloads
References
Ohn-Bar E, Trivedi M M. Hand gesture recognition in real time for automotive in- terfaces[J]. IEEE Transactions on Intelligent Transportation Systems, 2014, 15(6): 2368–
2377.
Cheng H, Yang L, Liu Z. Survey on 3D hand gesture recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2015, 26(9): 1659–1673.
Fan H, Zhuo T, Yu X, et al. Understanding atomic hand-object interaction with human intention[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 32(1): 275–285.
Chung H Y, Chung Y L, Tsai W F. An effi- cient hand gesture recognition system based on deep CNN[C]// IEEE International Con- ference on Industrial Technology. IEEE, 2019: 853–858.
Lin H I, Hsu M H, Chen W K. Human hand gesture recognition using a convolution neu- ral network[C]// IEEE International Confer- ence on Automation Science and Engineer- ing. IEEE, 2014: 1038–1043.
Li G, Tang H, Sun Y, et al. Hand gesture recognition based on convolution neural net- work[J]. Cluster Computing, 2019, 22(S2): 2719–2729.
Ozcan T, Basturk A. Transfer learning-based convolutional neural networks with heuristic optimization[J]. Neural Computing and Ap- plications, 2019, 31(12): 8955–8970.
Sahoo J P, Prakash A J, Pl-awiak P, et al. Real-time hand gesture recognition using fine-tuned convolutional neural network[J]. Sensors, 2022, 22(3): 706.
Tran D S, Ho N H, Yang H J, et al. Real-time hand gesture spotting and recognition using RGB-D camera and 3D convolutional neu- ral network[J]. Applied Sciences, 2020, 10(2):
722.
Mahmud H, Morshed M M, Hasan M K. A deep learning-based multimodal depth-aware dynamic hand gesture recognition[EB/OL]. arXiv:2107.02543, 2021.
Ishihara T, Kitani K M, Ma W C, et al. Recognizing hand-object interactions in wear- able camera videos[C]// IEEE International Conference on Image Processing. IEEE, 2015: 1349–1353.
Tekin B, Bogo F, Pollefeys M. Unified egocen- tric recognition of 3D hand-object poses[C]// IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2019: 4511–4520.
Garcia-Hernando G, Yuan S, Baek S, et al. First-person hand action benchmark with RGB-D videos and 3D hand pose annota- tions[C]// IEEE Conference on Computer
Vision and Pattern Recognition. IEEE, 2018: 409–419.
Ahmad A, Migniot C, Dipanda A. Track- ing hands in interaction with objects: A review[C]// International Conference on Signal-Image Technology and Internet-Based Systems. IEEE, 2017: 360–369.
Romero J, Kjellstr¨om H, Kragic D. Hands in action: Real-time 3D reconstruction of hands[C]// IEEE International Conference on Robotics and Automation. IEEE, 2010: 458–463.
Hamer H, Schindler K, Koller-Meier E, et al. Tracking a hand manipulating an object[C]// IEEE International Conference on Computer Vision. IEEE, 2009: 1475–1482.
Kang B, Tan K H, Jiang N, et al. Hand seg- mentation for hand-object interaction from depth map[C]// IEEE Global Conference on Signal and Information Processing. IEEE, 2017: 259–263.
Sridhar S, Mueller F, Zollh¨ofer M, et al. Real- time joint tracking of a hand manipulating an object[C]// European Conference on Com- puter Vision. Springer, 2016: 294–310.
Cai M, Kitani K M, Sato Y. Understanding hand-object manipulation with grasp types and object attributes[C]// Robotics: Science and Systems. 2016.
Bertasius G, Park H S, Yu S X, et al. First person action-object detection with egonet[EB/OL]. arXiv:1603.04908, 2016.
Schroder M, Ritter H. Hand-object interac- tion detection with fully convolutional net- works[C]// IEEE Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 2017: 18–25.
Yan W, Gao Y, Liu Q. Human-object inter- action recognition using multitask neural net- work[C]// International Symposium on Au- tonomous Systems. IEEE, 2019: 323–328.
Kwon T, Tekin B, Stu¨hmer J, et al. H2O: Two hands manipulating objects for inter- action recognition[C]// IEEE International Conference on Computer Vision. IEEE, 2021: 10138–10148.
K¨opu¨klu¨ O, Gunduz A, Kose N, et al. Real- time hand gesture detection and classifica- tion using convolutional neural networks[C]IEEE International Conference on Automatic Face and Gesture Recognition. IEEE, 2019: 1–8.
Mujahid A, Awan M J, Yasin A, et al. Real- time hand gesture recognition based on deep learning YOLOv3 model[J]. Applied Sciences, 2021, 11(9): 4164.
Lai K, Yanushkevich S N. CNN+RNN depth and skeleton based dynamic hand gesture recognition[C]// International Conference on Pattern Recognition. IEEE, 2018: 3451–3456.
Pigou L, Van Den Oord A, Dieleman S, et al. Beyond temporal pooling: Recurrence and temporal convolutions for gesture recog- nition[J]. International Journal of Computer Vision, 2018, 126(2): 430–439.
Molchanov P, Gupta S, Kim K, et al. Hand gesture recognition with 3D convolutional neural networks[C]// IEEE Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 2015: 1–7.
Zhang L, Zhu G, Shen P, et al. Learning spa- tiotemporal features using 3DCNN and Con- vLSTM for gesture recognition[C]// IEEE International Conference on Computer Vi- sion Workshops. IEEE, 2017: 3120–3128.
Gao Q, Chen Y, Ju Z, et al. Dynamic hand gesture recognition based on 3D hand pose estimation[J]. IEEE Sensors Journal, 2021, 22(18): 17421–17430.
Miah A S M, Hasan M A M, Shin J. Dynamic hand gesture recognition using graph neural networks[J]. IEEE Access, 2023, 11: 4703–
4716.
Sun S. A survey of multi-view machine learn- ing[J]. Neural Computing and Applications, 2013, 23(7): 2031–2038.
Shukla D, Erkent O¨ , Piater J. A multi-view hand gesture RGB-D dataset for human- robot interaction scenarios[C]// IEEE Inter- national Symposium on Robot and Human In- teractive Communication. IEEE, 2016: 1084–
1091.
Wang L, Ding Z, Tao Z, et al. Generative multi-view human action recognition[C]// IEEE International Conference on Computer Vision. IEEE, 2019: 6212–6221.
Zhang Z, Wang C, Xiao B, et al. Cross-view action recognition using contextual maximum
margin clustering[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2014, 24(10): 1663–1668.
Arnold E, Dianati M, De Temple R, et al. Cooperative perception for 3D object detec- tion in driving scenarios[J]. IEEE Transac- tions on Intelligent Transportation Systems, 2020, 23(3): 1852–1864.
Teepe T, Wolters P, Gilg J, et al. EarlyBird: Early fusion for multi-view tracking in bird’s- eye view[C]// IEEE/CVF Winter Conference on Applications of Computer Vision. IEEE, 2024: 102–111.
Gao Y, Maggs M. Feature-level fusion in per- sonal identification[C]// IEEE Computer So- ciety Conference on Computer Vision and Pattern Recognition. IEEE, 2005: 468–473.
Fadadu S, Pandey S, Hegde D, et al. Multi-view fusion of sensor data for im- proved perception in autonomous driv- ing[C]// IEEE/CVF Winter Conference on Applications of Computer Vision. IEEE, 2022: 2349–2357.
Seeland M, M¨ader P. Multi-view classification with convolutional neural networks[J]. PLoS One, 2021, 16(1): e0245230.
Cheng J, Yin W, Wang K, et al. Adaptive fu- sion of single-view and multi-view depth for autonomous driving[C]// IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. IEEE, 2024: 10138–10147.
Zheng D, Zheng X, Yang L T, et al. Multi-view feature fusion network for cam- ouflaged object detection[C]// IEEE/CVF Winter Conference on Applications of Com- puter Vision. IEEE, 2023: 6232–6242.
Ezati A, Dezyani M, Rana R, et al. A lightweight attention-based deep network via multi-scale feature fusion for multi- view facial expression recognition[EB/OL]. arXiv:2403.14318, 2024.
He K, Zhang X, Ren S, et al. Deep resid- ual learning for image recognition[C]// IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2016: 770–778.
Lin T Y, Doll´ar P, Girshick R, et al. Feature pyramid networks for object detection[C]// IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2017: 2117–2125.

This work is licensed under a Creative Commons Attribution 4.0 International License.
All articles published in our journal are licensed under CC-BY 4.0, which permits authors to retain copyright of their work. This license allows for unrestricted use, sharing, and reproduction of the articles, provided that proper credit is given to the original authors and the source.