
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
greatest potential for achieving high accuracy. Finally, methods of interpretability (e.g. Grad-CAM over Mel-
spectrogram regions) were not employed, and therefore, the explanation of which time-frequency cues are used
to make each prediction is not possible.
CONCLUSIONS
This study has proposed a systematic and empirical investigation on deep residual CNNs for environmental
sound classification using optimised Mel-spectrogram representations. These 5 major conclusions are drawn
from the experimental evidence. Residual shortcut connections are the single most critical part of the architecture
for this task, accounting for 8.7 percentage points of final accuracy, by allowing effective gradient flow through
all 34 convolutional layers. Second, in order to ensure stable training on small-sized audio datasets, batch
normalisation was crucial, adding 6.9 pp and significantly mitigating training instability in the early epochs.
Third, the optimum Mel-spectrogram configuration for ESC-50 is empirically confirmed to be n_mels = 128
filter banks, which matches the default value set by librosa, and also showed diminishing returns at n_mels =
256. Fourth, the data augmentation by time shifting, frequency masking, time masking, and additive noise adds
5.2 pp, making it an essential part of the ESC training pipelines due to a limited number of labelled data. Lastly,
the proposed model classifies the ESC-50 dataset with classification accuracy of 83.0% as the five-fold cross-
validated classification accuracy, which is a 41.5-fold improvement over random chance (1/50 = 2.0%), a 1.7 pp
improvement over the human performance reference and an 18.5 pp improvement over the foundational Piczak
CNN baseline, showing the practical effectiveness of this proposed framework for automated environmental
sound recognition.
Transfer learning from large-scale audio models like PANNs, Multi-representation fusion using Mel
spectrograms and continuous wavelet transforms or gammatone filterbanks, integration of attention mechanisms
for adaptive spectro-temporal weighting, self-supervised pre-training on an unlabelled audio set, and cross-
dataset evaluation to test generalisation beyond ESC-50 are future directions to explore.
REFERENCES
1. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document
recognition," Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
2. A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet classification with deep convolutional neural
networks," Adv. Neural Inf. Process. Syst., pp. 1097–1105, 2012.
3. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," Proc. IEEE CVPR,
pp. 770–778, 2016.
4. K. J. Piczak, "ESC: Dataset for environmental sound classification," Proc. ACM Int. Conf. Multimedia,
pp. 1015–1018, 2015.
5. K. J. Piczak, "Environmental sound classification with convolutional neural networks," Proc. IEEE MLSP,
pp. 1–6, 2015.
6. J. Salamon and J. P. Bello, "Deep convolutional neural networks and data augmentation for environmental
sound classification," IEEE Signal Process. Lett., vol. 24, no. 3, pp. 279–283, 2017.
7. I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA: MIT Press, 2016.
8. S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal
covariate shift," Proc. ICML, pp. 448–456, 2015.
9. S. B. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word
recognition in continuously spoken sentences," IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no.
4, pp. 357–366, 1980.
10. L. R. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition,"
Proc. IEEE, vol. 77, no. 2, pp. 257–286, 1989.
11. B. C. J. Moore, An Introduction to the Psychology of Hearing, 6th ed. Leiden: Brill, 2012.
12. B. McFee et al., "librosa: Audio and music signal analysis in Python," Proc. 14th Python Sci. Conf. (SciPy),
pp. 18–25, 2015.
13. Y. Tokozume and T. Harada, "Learning environmental sounds with end-to-end convolutional neural
network," Proc. IEEE ICASSP, pp. 2721–2725, 2017.