Deep Residual Convolutional Neural Networks for Robust Environmental Sound Classification Using Optimised Mel-Spectrogram Representations

Umar Mala Garba; Ankita Srivastava; Mohammad Suaib

doi:10.51583/IJLTEMAS.2026.150500125

Umar Mala Garba

Computer Science and Engineering Integral University Lucknow

Ankita Srivastava

Computer Science and Engineering Integral University Lucknow

Mohammad Suaib

Computer Science and Engineering Integral University Lucknow

DOI: https://doi.org/10.51583/IJLTEMAS.2026.150500125

Published: Jun 8, 2026

Environmental sound classification (ESC) is a fundamental machine-audition problem in the context of smart-city sensing, industrial monitoring, healthcare, and consumer devices. In this study, a convolutional classifier is shown to be a powerful approach for single-channel Mel-spectrogram representations of the ESC-50 benchmark and is evaluated on it in a controlled empirical setting using ResNet-34 architecture. The contribution is not an architectural family, but rather an optimisation and reproducibility study that aims to highlight the influence of residual shortcuts, batch normalisation, dropout, Mel-filter resolution, masking/augmenting with `Spec Augment`-style, mixup, and learning rate scheduling on a consistent training pipeline. The model performance on the ESC-50 benchmark was 83.0% (five-fold CV) and 84.0% (best single-fold validated) at epoch 88. The revised analysis includes the computation cost estimation, per-class performance metrics, confusion matrix analysis, modern benchmark positioning, and confidence intervals. Results show that residual CNNs still provide a salient and interpretable baseline for small-data ESC, despite the current state of the art of large pre-trained transformer and attention base networks.

Deep Residual Convolutional Neural Networks for Robust Environmental Sound Classification Using Optimised Mel-Spectrogram Representations. (2026). International Journal of Latest Technology in Engineering Management & Applied Science, 15(5), 1576-1589. https://doi.org/10.51583/IJLTEMAS.2026.150500125

Downloads

References

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet classification with deep convolutional neural networks," Adv. Neural Inf. Process. Syst., pp. 1097–1105, 2012.

K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," Proc. IEEE CVPR, pp. 770–778, 2016.

K. J. Piczak, "ESC: Dataset for environmental sound classification," Proc. ACM Int. Conf. Multimedia, pp. 1015–1018, 2015.

K. J. Piczak, "Environmental sound classification with convolutional neural networks," Proc. IEEE MLSP, pp. 1–6, 2015.

J. Salamon and J. P. Bello, "Deep convolutional neural networks and data augmentation for environmental sound classification," IEEE Signal Process. Lett., vol. 24, no. 3, pp. 279–283, 2017.

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA: MIT Press, 2016.

S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," Proc. ICML, pp. 448–456, 2015.

S. B. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no. 4, pp. 357–366, 1980.

L. R. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition," Proc. IEEE, vol. 77, no. 2, pp. 257–286, 1989.

B. C. J. Moore, An Introduction to the Psychology of Hearing, 6th ed. Leiden: Brill, 2012.

B. McFee et al., "librosa: Audio and music signal analysis in Python," Proc. 14th Python Sci. Conf. (SciPy), pp. 18–25, 2015.

Y. Tokozume and T. Harada, "Learning environmental sounds with end-to-end convolutional neural network," Proc. IEEE ICASSP, pp. 2721–2725, 2017.

X. Glorot and Y. Bengio, "Understanding the difficulty of training deep feedforward neural networks," Proc. AISTATS, pp. 249–256, 2010.

D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," Proc. ICLR, 2015.

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: A simple way to prevent neural networks from overfitting," J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014.

D. S. Park et al., "SpecAugment: A simple data augmentation method for automatic speech recognition," Proc. Interspeech, pp. 2613–2617, 2019.

H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, "mixup: Beyond empirical risk minimization," Proc. ICLR, 2018.

S. Hershey et al., "CNN architectures for large-scale audio classification," Proc. IEEE ICASSP, pp. 131–135, 2017.

I. Loshchilov and F. Hutter, "Decoupled weight decay regularization," Proc. ICLR, 2019.

J. Su, H. Zhang, K. Yu, and J. Sang, "Environment sound classification using a two-stream CNN based on decision-level fusion," Sensors, vol. 19, no. 7, p. 1733, 2019.

Z. Mushtaq and S.-F. Su, "Environmental sound classification using a regularized deep convolutional neural network with data augmentation," Appl. Acoust., vol. 167, p. 107389, 2020.

A. Guzhov, F. Raue, J. Hees, and A. Dengel, "ESResNet: Environmental sound classification based on visual domain models," arXiv:2004.07301, 2020.

Y. Gong, Y.-A. Chung, and J. Glass, "AST: Audio Spectrogram Transformer," Proc. Interspeech, pp. 571–575, 2021.

W. Chen et al., "EAT: Self-supervised pre-training with Efficient Audio Transformer," arXiv:2401.03497, 2024.

L. Huang et al., "Fast environmental sound classification based on resource adaptive convolutional neural network," Scientific Reports, vol. 12, 2022.

A. Mohaimenuzzaman et al., "ACDNet: An efficient compact convolutional neural network for environmental sound classification," IEEE Access, 2020.

G. Chen, B. Zhang, Z. Ding et al., "A lightweight dual branch masking network for environmental sound classification," Scientific Reports, vol. 16, 2026.

Z. Mushtaq, S.-F. Su, and Q.-V. Tran, "Environment sound classification using multiple feature channels and attention based deep convolutional neural network," arXiv:1908.11219, 2019.

This work is licensed under a Creative Commons Attribution 4.0 International License.

All articles published in our journal are licensed under CC-BY 4.0, which permits authors to retain copyright of their work. This license allows for unrestricted use, sharing, and reproduction of the articles, provided that proper credit is given to the original authors and the source.

How to Cite

Deep Residual Convolutional Neural Networks for Robust Environmental Sound Classification Using Optimised Mel-Spectrogram Representations. (2026). International Journal of Latest Technology in Engineering Management & Applied Science, 15(5), 1576-1589. https://doi.org/10.51583/IJLTEMAS.2026.150500125

Download Citation

Article Sidebar

Main Article Content

Downloads

References

Article Details

How to Cite