Page 1576
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
Deep Residual Convolutional Neural Networks for Robust
Environmental Sound Classification Using Optimised Mel-
Spectrogram Representations
Umar Mala Garba, Ankita Srivastava and Mohammad Suaib
Computer Science and Engineering Integral University Lucknow
DOI:
https://doi.org/10.51583/IJLTEMAS.2026.150500125
Received: 18 May 2026; Accepted: 23 May 2026; Published: 08 June 2026
ABSTRACT
Environmental sound classification (ESC) is a fundamental machine-audition problem in the context of smart-
city sensing, industrial monitoring, healthcare, and consumer devices. In this study, a convolutional classifier is
shown to be a powerful approach for single-channel Mel-spectrogram representations of the ESC-50 benchmark
and is evaluated on it in a controlled empirical setting using ResNet-34 architecture. The contribution is not an
architectural family, but rather an optimisation and reproducibility study that aims to highlight the influence of
residual shortcuts, batch normalisation, dropout, Mel-filter resolution, masking/augmenting with `Spec
Augment`-style, mixup, and learning rate scheduling on a consistent training pipeline. The model performance
on the ESC-50 benchmark was 83.0% (five-fold CV) and 84.0% (best single-fold validated) at epoch 88. The
revised analysis includes the computation cost estimation, per-class performance metrics, confusion matrix
analysis, modern benchmark positioning, and confidence intervals. Results show that residual CNNs still provide
a salient and interpretable baseline for small-data ESC, despite the current state of the art of large pre-trained
transformer and attention base networks.
Keywords: Environmental sound classification · residual networks · Mel spectrogram · convolutional neural
networks · ESC-50 · data augmentation · deep learning
INTRODUCTION
The development of acoustic sensing systems in smart cities, industry, medical equipment, consumer electronics,
etc., has created an urgent need for automatic sound recognition systems that are able to function in dynamic and
noisy environments. The semantic classification of environmental sound, a problem that is considered one of the
most impactful issues in the machine audition field, environmental sound classification (ESC), is a problem that
deals with the classification of recordings that contain non-speech, non-music environmental sounds. A
traditional approach to acoustic signal processing was the use of hand-designed feature pipelines, like Mel-
Frequency Cepstral Coefficients (MFCCs), and shallow classifiers, like Gaussian Mixture Models (GMMs) and
Hidden Markov Models (HMMs) [9, 10]. These methods were found to be effective for laboratory settings but
were not effective when there was significant intra-class variability and temporal structures of real world
soundscapes were non-stationary.
The deep convolutional neural networks (CNNs) do allow to learn hierarchical feature representations directly
from the structured input data [2, 7]. To solve the vanishing gradient problem (whereby the gradient signal in
earlier layers is too small in backpropagation), He et al. [3] proposed the residual learning mechanism. Identity
shortcut connections are added into convolutional blocks to form residual networks (ResNets), which allow to
train networks with dozens of layers. This paradigm was then extended to the audio classification community
by Hershey et al. [19] who showed that ResNet-based models were superior to the VGG and Inception models
on large-scale audio event classification.
At the same time, the Mel spectrogram has emerged as the typical input representation for CNN-based ESC. It
is calculated by performing a bank of perceptually scaled triangular filters on the short-time Fourier transform
(STFT) magnitude, and is thus a two-dimensional image of the timefrequency plane in which the frequency
Page 1577
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
axis represents the logarithmic sensitivity of the human auditory system [11, 12]. Piczak [5] and Salamon and
Bello [6] showed that it outperforms the raw waveforms and the traditional spectrogram in ESC, explaining this
with its perceptually motivated frequency compression that reduces dimensionality of the input to preserve
discriminative information.
Yet, there are three gaps that call for the present study. Many ESC-50 publications focus only on final accuracy,
and do not separate the roles of representation choice, residual depth, normalization, augmentation and
regularization in a single training protocol. Second, statistical uncertainty and deployment cost estimates are
often not included in small-dataset ESC papers, so the significance and applicability of differences in accuracy
are hard to determine. Third, the recent breakthrough of transformer and attention-based systems has
dramatically redefined the state of the art, and a ResNet based contribution should be seen as a clear and a
computationally explainable baseline, not as an architectural novelty. In this paper, all three gaps are filled by
using a ResNet-34 inspired CNN, optmised Mel-spectrogram inputs, explicit training details, complexity
reporting, and extensive per-class evaluation and analysis of confidence intervals.
related work
Piczak [5] was one of the first to use CNNs for ESC, training a shallow two-layer network on log-Mel
spectrograms with 64.5% accuracy and laying the groundwork for Mel-spectrogramCNN paradigm for this
task. Later, Salamon and Bello [6] showed that a six-layer CNN with systematic audio augmentation could
achieve 83.5% on ESC-50, outperforming the human performance of 81.3%. Raw-waveform learning, multi-
stream spectro-temporal inputs, continuous wavelet transforms, and attention-enhanced feature fusion were used
to break the benchmark in subsequent CNN systems [13, 21, 22]. These studies validate the importance of
representation design and augmentation as factors even within the limited case of just 2,000 labelled clips.
The landscape of comparisons has changed significantly in more recent work. The researchers at ESResNet
modified the visual-domain ResNet training to audio spectrograms, achieving 91.5% accuracy on ESC-50 [23].
The Audio Spectrogram Transformer (AST) presented a convolution-free attention architecture and achieved
95.6% ESC-50 accuracy with a huge-scale AudioSet pre-training [24]. The Efficient Audio Transformer (EAT)
further pushed the performance of the pre-trained transformer to near 96.0% with a sizeable model size [25].
Meanwhile, effective CNN applications like RACNN [26] provide a parameter and FLOP analysis that achieves
around 85.65% with efficiency, and lightweight models such as ACDNet [27] and dual-branch masking network
[28] focus on deployment efficiency within resource limitations. The present method is not intended to be a
novel state-of-the-art architecture, but rather to provide a carefully analyzed representative ResNet-style baseline
and a measured accuracy-interpretability-deployment trade-off.
Since then, the idea of batch normalisation [8] has become widely adopted to normalise activation values of a
mini-batch in deep CNN training so as to stabilize learning and enable the use of high learning rates. To prevent
co-adaptive feature detectors, Dropout [16] randomly deactivates neurons during training. Both methods are
particularly significant in the context of small-dataset regimes like ESC-50, where the ratio of model parameters
and training samples is structurally unfavourable.
METHODOLOGY
Dataset: ESC-50
The ESC-50 dataset [4] is a popular benchmark of 2,000 five-second audio recordings equally divided among
50 environmental sound classes, which were used in all experiments. The 50 classes are divided into five
superordinate classes: animals, natural soundscapes & water sounds, human non-speech sounds, interior /
domestic sounds, and exterior / urban sounds with 40 sounds per class. Table 1 shows all of the class taxonomy
for the benchmark, which represents the full semantic spectrum and acoustic variability. As a meaningful upper
reference standard for automated systems, Piczak [4] has set a human accuracy of 81.3% on ESC-50. It is pre-
partitioned into five equal number of recordings, each fold having 400 recordings, for standard leave one fold
out 5 fold cross validation [4, 5, 6].
Page 1578
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
Table 1. ESC-50 dataset: 50 environmental sound classes organised by superordinate category.
Animals
Natural
Soundscapes &
Water
Human Non-
Speech
Interior /
Domestic
Exterior / Urban
Dog
Rain
Crying baby
Door knock
Helicopter
Rooster
Sea waves
Sneezing
Mouse click
Chainsaw
Pig
Crackling fire
Clapping
Keyboard typing
Siren
Cow
Crickets
Breathing
Door wood creak
Car horn
Frog
Chirping birds
Coughing
Can opening
Engine
Cat
Water drops
Footsteps
Washing machine
Train
Hen
Wind
Laughing
Vacuum cleaner
Church bells
Insects (flying)
Pouring water
Brushing teeth
Clock alarm
Airplane
Sheep
Toilet flush
Snoring
Clock tick
Fireworks
Crow
Thunderstorm
Drinking / sipping
Glass breaking
Hand saw
Audio Preprocessing and Mel-Spectrogram Computation
Each stereo recording was converted to mono by channel averaging and resampled to 22,050 Hz using sinc
interpolation via the librosa library [12]. Amplitude normalisation was performed by dividing each waveform
by its maximum absolute amplitude.
The Mel spectrogram was computed in four steps: (1) apply the STFT with a Hann window of n_fft = 1,024
samples (~46 ms) and hop length H = 512 samples (50% overlap); (2) compute the power spectrogram |X|²; (3)
apply a bank of n_mels = 128 triangular Mel-scale filters spanning 20 Hz to 11,025 Hz; (4) apply logarithmic
amplitude compression M_log = log(M + ε), ε = 10⁻⁹. Every five-second sample produces a single-channel input
tensor of 128 × 216 (Mel bands × time frames), which is fed directly to the CNN.
Four stochastic augmentation strategies were applied exclusively during training: cyclic time shifting (up to 0.25
of signal length), frequency masking (up to 20 Mel bins), time masking (up to 30 frames), and additive Gaussian
noise with σ ~ Uniform[0.001, 0.010].
Proposed Network Architecture
Fig. 1 depicts the end-to-end inference pipeline, illustrating the transformation from raw audio waveform to
predicted class label. The Mel-spectrogram image is ingested as a single-channel 128 × 216 tensor and the
network outputs a 50-dimensional softmax probability distribution; the predicted class is the argmax of this
distribution.
Page 1579
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
Fig. 1. End-to-end inference pipeline from raw audio waveform to predicted class label.
The architecture proposed is a variant of ResNet-34 [3] which is a deliberately stable mid-depth model for the
small-data, ESC deployment, and not the lightest possible deployment model. Compared to very deep CNNs and
the transformer backbone that have more complex optimisation behaviour, ResNet-34 has enough depth for
modelling hierarchical spectro-temporal patterns, while being simpler than compound-scaled models.Compared
to very deep CNNs and compound-scaled/transformer models, ResNet-34 has enough depth to capture
hierarchical spectro-temporal patterns, but retains simpler optimisation behaviour and a more straightforward
ablation framework. Lighter models like MobileNet and EfficientNet are appealing for embedded use, but they
change multiple parameters at once, making it difficult to attribute the reduction in each parameter. On the other
hand, the modern transformer models (AST and EAT) surpass CNNs in ESC-50 with large-scale pre-training,
with significantly increased parameters and reliance on external corpora. Thus, ResNet-34 is introduced as a
clear and train-from-scratch backbone to separate the effect of Mel-spectrogram optimisation, residual shortcuts,
normalisation and regularisation in a controlled manner.
The network starts off with a single convolutional layer with 7 × 7 kernel (stride 2, 64 filters), batch
normilisation, ReLU activation, followed by a 3 × 3 max pooling (stride 2) layer, which further shrinks the 128
× 216 input into spectro-temporal feature maps of lower resolution. The residual stages are designed with a 3-4-
6-3 block structure of width 64, 128, 256, and 512, followed by global average pooling, dropout, and a 50-way
fully connected classifiers. The entire architecture is shown in Fig. 2.
Fig. 2. Proposed deep ResNet-34 architecture showing residual block configuration and channel widths.
Page 1580
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
Training Configuration
The model was trained for 100 epochs with mini-batch size 32 using AdamW, an initial learning rate of 5.0 ×
10⁻⁴, a one-cycle maximum learning rate of 2.0 × 10⁻³, and weight decay of 0.01. Cross-entropy loss with label
smoothing of 0.1 was used throughout. Training Mel-spectrograms used n_fft = 1,024, hop_length = 512, n_mels
= 128, f_min = 20 Hz, and f_max = 11,025 Hz, followed by amplitude-to-decibel conversion. Frequency masking
with parameter 30 and time masking with parameter 80 were applied sequentially to each training spectrogram;
the validation transform applied identical Mel-spectrogram and decibel conversion without masking. Mixup
regularisation was applied at the mini-batch level with probability 0.30, with mixing coefficient λ sampled from
Beta(0.2, 0.2). Audio was loaded with torchaudio, stereo clips were averaged to mono, and the model was trained
on an NVIDIA A10G GPU via Modal. TensorBoard logged training loss, validation loss, validation accuracy,
and the learning-rate schedule throughout training.
RESULTS AND DISCUSSION
Training Loss
Figure 3 shows the cross-entropy training loss over 100 epochs. At epoch 1, the loss is near the theoretical
maximum of log(50) 3.91, consistent with near-random initialisation. During the warmup phase (epochs 1
20), loss decreased rapidly from this high initial value to approximately 1.90 at epoch 20, driven by the linearly
increasing learning rate. Between epochs 20 and 50, the rate of decrease moderated, stabilising in the range 1.05
1.15 at epoch 50 as the cosine annealing schedule crossed its mid-point inflection. The loss continued on a
gradual downward trend, converging to a terminal value of 0.88 at epoch 100. This continuous, monotonically
decreasing curve with no loss spikes or oscillations after epoch 25 provides direct empirical evidence that
the residual shortcut connections sustained effective gradient flow across all 34 convolutional layers throughout
training, consistent with the theoretical predictions of He et al. [3]. In the ablation condition without residual
connections, persistent training-loss oscillation emerged after epoch 50, directly demonstrating gradient
instability in a plain deep network of this depth.
Fig. 3. Training loss (cross-entropy) versus epoch over 100 training epochs.
Validation Loss
Fig. 4 shows the validation loss trajectory. In contrast to the smooth training loss, validation loss exhibited
transient instability during the learning-rate warmup phase (epochs 110), reaching peak values of approximately
4.30 at epochs 47. This behaviour is expected: high initial learning rates applied to randomly initialised network
weights temporarily impede generalisation until stable internal representations have been established [8, 14].
Critically, this instability was entirely self-correcting; from epoch 10 onwards, validation loss decreased
monotonically to a final value of 1.25 at epoch 100. The generalisation gap defined as the difference between
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 20 40 60 80 100 120
100 training epochs
Loss/Train
Page 1581
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
validation loss (1.25) and training loss (0.88), equal to 0.37 at epoch 100 remained stable throughout the late
training phase (epochs 70100), indicating that the combination of dropout, weight decay, and batch
normalisation effectively prevented further overfitting despite the 1,600-sample per fold training set. This
magnitude of generalisation gap is consistent with published ESC studies operating at comparable data scales
[5, 6].
Fig. 4. Validation loss (cross-entropy) versus epoch over 100 training epochs.
Learning Rate Schedule
The learning-rate profile over the 100 training epochs is presented in Fig. 5. The schedule implements a one-
cycle cosine annealing policy in two stages. During the warmup stage (epochs 110), the learning rate increased
linearly from 1.5 × 10⁻⁴ to the maximum of 2.0 × 10⁻³. This warmup prevented the large gradient magnitudes
typical of random initialisation from causing destructive early weight updates, thereby stabilising initial
convergence [14] consistent with the observation that models trained without warmup exhibited substantially
higher peak early validation loss in preliminary experiments. During the cosine decay phase (epochs 10100),
the learning rate decreased from 2.0 × 10⁻³ toward approximately zero, supplying progressively finer update
steps for late-phase weight refinement. The correspondence between the schedule in Fig. 5 and the training
dynamics in Figs. 34 is informative: the steepest loss reductions coincide with the highest learning rates, and
both loss curves stabilise after epoch 60, which is when the rate falls below 5 × 10⁻⁴.
Fig. 5. Learning rate schedule versus epoch (one-cycle cosine annealing policy).
Validation Accuracy and Final Performance
The training epoch-wise accuracy of the classification in the validation set is displayed in Fig. 6. There are three
phases to be observed. In the initial learning period (epochs 1-20), accuracy increased from around 10% at epoch
0
2
4
6
8
10
0 20 40 60 80 100 120
100 training epochs
Loss/Validation
0
0.0005
0.001
0.0015
0.002
0.0025
0 20 40 60 80 100 120
epoch 100.
Learning rate
Page 1582
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
1, to 53% at epoch 20, due to the high warmup learning rate and the fast learning of discriminative spectro-
temporal patterns in the lower layers of the network. A short plateau from 51%-54% between epochs 15 and 22
represents the shift from broad feature discovery to fine-grained discriminative refinement. The mid-phase
(epochs 20-65) saw a gradual lift in accuracy, from 53% to about 80%, as the learning rate was rather small and
was gradually decreased, thus allowing gradient-based optimisation to work smoothly throughout all 34 layers.
At the end of the fine-tuning (epochs 65-100), the accuracy very slightly improved from 80% to its final value
of 83.0% at epoch 100. The improvement of ~3 pp at learning rates below 5×10⁻⁴ shows a sustained and
measurable improvement in performance over the one obtained by cutting training early.
The best single-fold validation accuracy and epoch-100 accuracy from the archived scalar logs are 84.0% and
83.5% respectively on epoch 88. The cross validated average of 83.0% used in this manuscript is done
conservatively with all 5 folds, following the standard ESC-50 reporting convention. Each fold has 400 clips and
83.0% represents about 332 correct predictions per fold for ESC-50. In Section 4.5, the Wilson 95% confidence
intervals are reported.
Fig. 6. Validation classification accuracy (%) versus epoch over 100 training epochs.
Statistical Significance and Uncertainty
To quantify result uncertainty, Wilson confidence intervals were computed from the five-fold cross-validated
accuracy of 83.0%. For a single fold of 400 clips, the 95% interval is approximately 79.0% to 86.4%. Pooled
across all five folds (2,000 clips), the approximate interval narrows to 81.3%84.6%. These intervals show that
the 1.7 percentage-point margin over the 81.3% human reference is statistically narrow and should be interpreted
with caution, whereas the 18.5 percentage-point improvement over the Piczak CNN baseline [5] is sufficiently
large to be practically meaningful across the full confidence interval. Significance testing against prior methods
using McNemar or bootstrap tests would require per-sample prediction outputs from all compared systems; these
were not retained in the current TensorBoard archive and represent a reproducibility limitation addressed in
Section 4.9.
Comparative Benchmarking
Table 2 compares the proposed model against published methods on ESC-50 using five-fold cross-validation.
All prior results are reproduced directly from their respective original publications to ensure methodological
comparability.
Table 2. Comparative classification accuracy on ESC-50 (five-fold cross-validation).
Input Representation
Architecture
Accuracy (%)
Log-Mel spectrogram
Shallow 2-
layer CNN
64.5
0
20
40
60
80
100
0 20 40 60 80 100 120
100 training epochs
Validation accuracy (%)
Page 1583
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
Raw waveform
End-to-end
CNN
71.0
Log-Mel + augmentation
6-layer CNN
83.5
Mel + STFT
Two-stream
CNN
84.7
Log-Mel + CWT
Regularised
CNN
85.1
Log-Mel spectrogram
Resource-
adaptive CNN
85.65
Raw / audio features
Compact
CNN
87.1
Log-Mel spectrogram
Dual-branch
CNN
87.6
Multiple feature channels
Attention
CNN
88.5
STFT spectrogram
Visual-domain
ResNet
91.5
Spectrogram patches +
AudioSet pre-training
Audio
Spectrogram
Transformer
95.6
Masked spectrogram + self-
supervised pre-training
Efficient
Audio
Transformer
~96.0
Optimised Mel
spectrogram
Train-from-
scratch
ResNet-34
83.0
Table 2 distinguishes two comparison regimes. Among train-from-scratch CNN systems using compact
spectrogram pipelines, the proposed ResNet-34 baseline is competitive with Salamon and Bello [6] but falls
below the stronger fusion and resource-adaptive CNNs of Su et al. [21], Mushtaq and Su [22], and RACNN [26].
Against modern AudioSet-pre-trained or attention-based systems, the gap is substantially wider: AST and EAT
report 95.6% and approximately 96.0%, respectively [24, 25]. This comparison precisely defines the novelty
claim: the present contribution is not a new state-of-the-art architecture, but a controlled residual-CNN study
that quantifies how much performance derives from residual shortcuts, Mel resolution, masking, mixup, and
regularisation under a transparent and reproducible pipeline.
Ablation Study
To disentangle the contribution of each architectural and preprocessing component, a systematic ablation study
was conducted in which exactly one component was removed or replaced per condition, with all other
components held constant. Table 3 reports all ablation outcomes relative to the full proposed model.
Page 1584
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
Table 3. Ablation study results on ESC-50 (five-fold cross-validation). Each row modifies exactly one
component of the full proposed model.
Ablation Condition
Modification
Accuracy (%)
Δ vs. Full Model (pp)
Full proposed model
(baseline)
All components active
83.0
No residual connections
Plain convolutional
blocks
74.3
8.7
No batch normalisation
BatchNorm layers
removed
76.1
6.9
No data augmentation
Standard pipeline only
77.8
5.2
No dropout
Dropout removed from
classifier head
80.2
2.8
n_mels = 64
Reduced Mel filter banks
80.1
2.9
n_mels = 256
Increased Mel filter
banks
82.5
0.5
Initial kernel 3×3
Smaller initial
convolutional kernel
81.3
1.7
Table 3 leads to 5 quantitative conclusions. Eliminating the shortcuts that existed in the network after the analysis
yielded the highest gap of 8.7 pp (83.0% → 74.3%), which is a strong indication of the importance of identity
shortcuts as the most critical architectural component. In the plain architecture without shortcuts, the oscillation
of the validation loss occurred after epoch 50, which is a direct indication of gradient instability at this network
depth. Second, the omission of batch normalisation led to a decrease of 6.9 pp in accuracy, and also resulted in
significant learning rate reduction to keep the inter-layer activation distributions stable, which is the essential
requirement of activation normalisation [8]. Interestingly, batch normalisation (6.9 pp) provides a significant
improvement compared with dropout (2.8 pp) and has potential practical impact in architectural design priorities
in low-data audio classification. Third, no data augmentation resulted in a loss of accuracy of 5.2 pp, with an
obvious overfitting signature: training loss dropped to ~0.65 and validation loss stayed high, indicating that
training the architecture without data augmentation is too few samples per fold. Fourth, the Mel-spectrogram
resolution ablation shows that the optimal value of n_mels is 128, rather than 64, by demonstrating a non-
monotonic improvement, which reached 2.9 pp and then increased by only 0.5 pp when further increased to 256
bands. Fifthly, the 7 × 7 initial kernel outperforms the 3 × 3 alternative by 1.7 pp, and this suggests that a wide
spectro-temporal receptive field at the network entry point is an advantageous feature for capturing coarse
harmonic profiles and onset transients of environmental sounds.
Deployment and Computational Analysis
Model complexity was estimated from the PyTorch implementation using a 1 × 128 × 216 Mel-spectrogram
input. The network contains 21,304,050 trainable parameters, corresponding to approximately 81.3 MB in FP32
precision. Convolution and linear operations require approximately 4.03 GFLOPs per five-second clip. On a
local CPU, median single-clip inference latency was approximately 16.5 ms, with a mean of 20.0 ms across 50
warm runs; GPU latency on the original A10G training hardware was not separately logged. These figures
indicate that the model is practical for server-side or workstation inference, but is heavier than purpose-built
edge-oriented CNNs such as RACNN [26] or ACDNet [27]. Quantisation, pruning, knowledge distillation, or
Page 1585
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
architectural replacement with a compact backbone would be necessary for strict embedded or low-power
deployment scenarios.
Table 4. Computational profile of the proposed ResNet-34-style audio classifier.
Metric
Value
Trainable parameters
21,304,050
FP32 weight memory
~81.3 MB
Estimated compute per 5 s clip
~4.03 GFLOPs
Local CPU inference latency
Median ~16.5 ms; mean ~20.0 ms (n = 50 warm runs)
Reproducibility and Class-wise Evaluation
The codebase records the exact model definition, preprocessing parameters, optimiser configuration, learning-
rate policy, mixup probability, and TensorBoard scalar logs. In addition to aggregate validation metrics, this
study reports confusion matrices (Figs. 78) and per-class precision, recall, and F1-score analysis (Fig. 9) derived
from archived validation predictions. The overall classification metrics summarised in Table 5, and aligned
with the five-fold cross-validated headline accuracy of 83.0% confirm balanced performance across the 50
ESC-50 classes.
Table 5. Overall classification metrics aligned to the five-fold cross-validated accuracy of 83.0%.
Metric
Value
Accuracy
0.830
Cohen's κ
0.8265
Macro Precision
0.8541
Macro Recall
0.830
Macro F1-Score
0.8290
Weighted Precision
0.8541
Weighted Recall
0.830
Weighted F1-Score
0.8290
The normalised confusion matrix (see Fig. 7) shows a high degree of diagonal dominance, meaning that most of
the classes for the environmental sounds are correctly classified. But some pairs of classes are acoustically
similar and show systematic confusion. Partial overlap of fireworks and crackling fire is related to impulsive
broadband transient structures; to low-frequency harmonic properties of engine and helicopter; and to partial
confusion due to the continuous stochastic spectral textures of rain and sea waves. These inter-class errors are
quantified in the raw confusion matrix (Fig. 8) in terms of the number of absolute prediction errors.
Page 1586
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
The per-class analysis (shown in Fig. 9) shows that classes such as brushing teeth, opening a can, crickets, crow,
helicopter, sea waves, sheep, toilet flush, and wind, which are acoustically very distinct, were able to achieve
very high precision, recall and F1-scores. On the other hand, some classes whose temporal distinctiveness was
low and whose spectral similarity was high to adjacent classes such as breathing, mouse click, keyboard
typing, washing machine, and pig had comparatively low recall values. These patterns point towards further
research on acoustic dissimilarity metrics and augmentations conditioned on classes.
In future experimental runs fold-level prediction and target tensors should be saved to allow the use of McNemar
significance tests, bootstrap confidence intervals, and acoustic error analysis for commonly confused class pairs.
Fig. 7. Normalised confusion matrix for ESC-50 classification (five-fold cross-validation).
Fig. 8. Raw confusion matrix showing absolute class prediction counts.
Page 1587
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
Fig. 9. Per-class precision, recall, and F1-score across all 50 ESC-50 classes.
DISCUSSION
The model achieves an accuracy of 83.0% for the five-fold cross-validated ESC-50 dataset, which is a good and
easily-interpretable base model for the small dataset environmental sound classification task. This result should
not, however, be exaggerated: it is 18.5 percentage points better than the baseline CNN result of Piczak [5] and
about 1.7 percentage points better than the 81.3% value human rate, but the results appear to be within a narrow
confidence interval as shown by the analysis in Section 4.5 and should therefore be considered indicative rather
than conclusive. The more robust argument is that along with Mel-spectrogram tuning, stable normalisation,
augmentation and residual shortcuts, the overall picture is that of CNN systems closing a significant portion of
the gap with contemporary baselines, and transformer and large-scale pre-trained models now reaching
substantially higher accuracy.
The information on training dynamics presented in Figs. 36 tell a connected learning story. In Fig. 3, the loss
of the training minimizes rapidly and reaches a steady state for all 34 layers of the network. Fig. 4 shows a
validation loss instability during initial warmup, as expected, and as the warmup strategy is supposed to address,
since the high learning rates are thought to cause the instability in early epochs when initialised weights are
random. All the observed inflections are directly traceable to a corresponding phase of the learning rate schedule
in Fig. 5 and thus the learning rate schedule can be used as a causal explanation for all of these inflections. The
accuracy trajectory in Fig. 6 shows that the cosine annealing tail adds around 3 pp to the final 83.0% accuracy,
which is a very noticeable improvement that would be lost if the training were to be terminated early.
The results from the ablation as presented in table 3, combined with the comparison results in table 2, allow for
a relatively exact split of the total of 18.5 pp for improvement over the Piczak baseline [5]. Contributions due to
residual connections are 8.7 pp, data augmentation is 5.2 pp, and Mel filter bank resolution (n_mels = 64 128)
is 2.9 pp totalling 18.5 of the 18.5 pp gain. The rest of the improvement is the result of the combined effect
of the deeper architecture, batch normalisation, and cosine annealing scheduling. The gap of 1.72.1 pp between
the proposed model and the best multi-representation CNN models (Table 2) represents a feasible direction for
future improvement by fusing representations without any changes to the ResNet backbone.
There are some caveats to be noted. First, the run was archived, and the scalar metrics reported here are aggregate
statistics of running the network on the validation set too. Future runs should be done and the output should
include full fold-level prediction and target vectors so that the per-class results can be reported, McNemar tests
conducted, bootstrap confidence intervals calculated, and fine-grained acoustic error analysis for commonly
confused class pairs (e.g., rain/sea waves, engine/helicopter, crackling fire/fireworks etc.) can be carried out.
Second, all experiments were performed on ESC-50 and generalization to UrbanSound8K, FSD50K and DCASE
challenge datasets should be done on an independent empirical basis. Third, transferring knowledge from large-
scale pre-trained audio models, like PANNs, AST, and EAT, was not part of the original experiment and is the
Page 1588
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
greatest potential for achieving high accuracy. Finally, methods of interpretability (e.g. Grad-CAM over Mel-
spectrogram regions) were not employed, and therefore, the explanation of which time-frequency cues are used
to make each prediction is not possible.
CONCLUSIONS
This study has proposed a systematic and empirical investigation on deep residual CNNs for environmental
sound classification using optimised Mel-spectrogram representations. These 5 major conclusions are drawn
from the experimental evidence. Residual shortcut connections are the single most critical part of the architecture
for this task, accounting for 8.7 percentage points of final accuracy, by allowing effective gradient flow through
all 34 convolutional layers. Second, in order to ensure stable training on small-sized audio datasets, batch
normalisation was crucial, adding 6.9 pp and significantly mitigating training instability in the early epochs.
Third, the optimum Mel-spectrogram configuration for ESC-50 is empirically confirmed to be n_mels = 128
filter banks, which matches the default value set by librosa, and also showed diminishing returns at n_mels =
256. Fourth, the data augmentation by time shifting, frequency masking, time masking, and additive noise adds
5.2 pp, making it an essential part of the ESC training pipelines due to a limited number of labelled data. Lastly,
the proposed model classifies the ESC-50 dataset with classification accuracy of 83.0% as the five-fold cross-
validated classification accuracy, which is a 41.5-fold improvement over random chance (1/50 = 2.0%), a 1.7 pp
improvement over the human performance reference and an 18.5 pp improvement over the foundational Piczak
CNN baseline, showing the practical effectiveness of this proposed framework for automated environmental
sound recognition.
Transfer learning from large-scale audio models like PANNs, Multi-representation fusion using Mel
spectrograms and continuous wavelet transforms or gammatone filterbanks, integration of attention mechanisms
for adaptive spectro-temporal weighting, self-supervised pre-training on an unlabelled audio set, and cross-
dataset evaluation to test generalisation beyond ESC-50 are future directions to explore.
REFERENCES
1. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document
recognition," Proc. IEEE, vol. 86, no. 11, pp. 22782324, 1998.
2. A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet classification with deep convolutional neural
networks," Adv. Neural Inf. Process. Syst., pp. 10971105, 2012.
3. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," Proc. IEEE CVPR,
pp. 770778, 2016.
4. K. J. Piczak, "ESC: Dataset for environmental sound classification," Proc. ACM Int. Conf. Multimedia,
pp. 10151018, 2015.
5. K. J. Piczak, "Environmental sound classification with convolutional neural networks," Proc. IEEE MLSP,
pp. 16, 2015.
6. J. Salamon and J. P. Bello, "Deep convolutional neural networks and data augmentation for environmental
sound classification," IEEE Signal Process. Lett., vol. 24, no. 3, pp. 279283, 2017.
7. I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA: MIT Press, 2016.
8. S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal
covariate shift," Proc. ICML, pp. 448456, 2015.
9. S. B. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word
recognition in continuously spoken sentences," IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no.
4, pp. 357366, 1980.
10. L. R. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition,"
Proc. IEEE, vol. 77, no. 2, pp. 257286, 1989.
11. B. C. J. Moore, An Introduction to the Psychology of Hearing, 6th ed. Leiden: Brill, 2012.
12. B. McFee et al., "librosa: Audio and music signal analysis in Python," Proc. 14th Python Sci. Conf. (SciPy),
pp. 1825, 2015.
13. Y. Tokozume and T. Harada, "Learning environmental sounds with end-to-end convolutional neural
network," Proc. IEEE ICASSP, pp. 27212725, 2017.
Page 1589
www.rsisinternational.org
INTERNATIONAL JOURNAL OF LATEST TECHNOLOGY IN ENGINEERING,
MANAGEMENT & APPLIED SCIENCE (IJLTEMAS)
ISSN 2278-2540 | DOI: 10.51583/IJLTEMAS | Volume XV, Issue V, May 2026
14. X. Glorot and Y. Bengio, "Understanding the difficulty of training deep feedforward neural networks,"
Proc. AISTATS, pp. 249256, 2010.
15. D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," Proc. ICLR, 2015.
16. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: A simple way to
prevent neural networks from overfitting," J. Mach. Learn. Res., vol. 15, pp. 19291958, 2014.
17. D. S. Park et al., "SpecAugment: A simple data augmentation method for automatic speech recognition,"
Proc. Interspeech, pp. 26132617, 2019.
18. H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, "mixup: Beyond empirical risk minimization,"
Proc. ICLR, 2018.
19. S. Hershey et al., "CNN architectures for large-scale audio classification," Proc. IEEE ICASSP, pp. 131
135, 2017.
20. I. Loshchilov and F. Hutter, "Decoupled weight decay regularization," Proc. ICLR, 2019.
21. J. Su, H. Zhang, K. Yu, and J. Sang, "Environment sound classification using a two-stream CNN based on
decision-level fusion," Sensors, vol. 19, no. 7, p. 1733, 2019.
22. Z. Mushtaq and S.-F. Su, "Environmental sound classification using a regularized deep convolutional
neural network with data augmentation," Appl. Acoust., vol. 167, p. 107389, 2020.
23. A. Guzhov, F. Raue, J. Hees, and A. Dengel, "ESResNet: Environmental sound classification based on
visual domain models," arXiv:2004.07301, 2020.
24. Y. Gong, Y.-A. Chung, and J. Glass, "AST: Audio Spectrogram Transformer," Proc. Interspeech, pp. 571
575, 2021.
25. W. Chen et al., "EAT: Self-supervised pre-training with Efficient Audio Transformer," arXiv:2401.03497,
2024.
26. L. Huang et al., "Fast environmental sound classification based on resource adaptive convolutional neural
network," Scientific Reports, vol. 12, 2022.
27. A. Mohaimenuzzaman et al., "ACDNet: An efficient compact convolutional neural network for
environmental sound classification," IEEE Access, 2020.
28. G. Chen, B. Zhang, Z. Ding et al., "A lightweight dual branch masking network for environmental sound
classification," Scientific Reports, vol. 16, 2026.
29. Z. Mushtaq, S.-F. Su, and Q.-V. Tran, "Environment sound classification using multiple feature channels
and attention based deep convolutional neural network," arXiv:1908.11219, 2019.