1 Introduction

Gunshot analysis have received significant attention from both the military and scientific communities. Acoustic analysis of gunshots can provide useful information, such as the position of the shooter, the projectile trajectory, the caliber of the gun, and the gun model. Although acoustical evidence may significantly contribute to audio forensic reconstruction and analysis, the forensic analysis of gunshots is characterized by many challenges due to the broadcast and noisy nature of the acoustic channel.

Consider a scenario where a microphone is deployed in a close neighborhood to the shooter. The recorded audio sample can be significantly affected by the environmental surroundings, such as trees, foliage, and buildings, which attenuate and reflect the main component of the shock wave. The resulting audio sample may feature different echoes of the gunshot that are characterized by different attenuation factors as a function of their paths. This naive approach is impractical, which motivated the development of more complex ad-hoc acoustic data acquisition strategies over the last decade.

To mitigate echoes and overcome the intrinsic lack of information, that the aforementioned scenario suffers from, additional microphones are deployed. The comparison of multiple replicas of the same gunshot enables shooter localization and weapon identification. The physical characteristics of acoustic propagation can be exploited to infer the position of the shooter and the category of the gun. Multiple spatially diverse acoustic sources enable the estimation of Time of Arrival (ToA), Time of Arrival (ToA), and Time Difference of Arrival (TDoA). The obtained recordings can be modeled by geometrical acoustics that enable the localization of the shooter. Furthermore, multiple replicas of the same acoustic source allow to filter out echoes and background noise affecting a subset of the deployed microphones, thus enabling a deep characterization for both the time and frequency domains.

Acoustic acquisition via Wireless Sensor Network (WSN) requires a specialized infrastructure overlay to enable sensor communication, data processing, and computation distribution. Solutions that rely on the spatial diversity provided by the WSN introduce several types of burdens. Firstly, each soldier has to carry a wearable device equipped with a microphone and other sensors, such as a compass, to collect meaningful information about Angle of Arrival (AoA), Time of Arrival (ToA), and Time Difference of Arrival (TDoA). Secondly, in a military scenario, the WSN should feature a jamming-resistant communication protocol and non-interfering radio channels. Both assumptions are difficult to achieve given the resource constraints of WSNs in terms of CPU, battery, and memory. In most cases, WSNs cannot afford the computational burden of multimedia processing. Therefore, the captured data should be first off-loaded to a remote server, then downloaded and distributed again. This represents a challenge from the connectivity perspective since, in many cases, military WSNs are unattended or provided with a discontinued link to the control center.

In this work, we do not rely on ad-hoc acquisition setups, but we exploit publicly available audio recordings of gunshots, considering their temporal and spectral representations. Spectral analysis of sound has been adopted in many contexts to detect and identify recurrent patterns. In particular, the combination of time-frequency decomposition of audio samples with Convolutional Neural Network (CNN) provides promising performance in detecting recurrent patterns [21]. The CNN is trained over several “images” constituted by a three-dimensional representation of time, frequency, and amplitude. The result is a robust solution that can “recognize” the same sound by cross-matching similar images.

Contribution

We propose an inexpensive solution that is able to detect and identify gunshots without resorting to any ad-hoc infrastructure. Contrary to other studies, our solution requires only an audio sample of a gunshot that can be easily obtained by any commercially available microphone. Our approach is agnostic to the microphone position with respect to the shooter, and it does not require multiple spatially different replicas of the gunshot; we consider recordings from mono-channel setups with different sample rates. We proved the effectiveness of our solution by considering 3655 samples of gunshots constituted by 30 pistols, 18 rifles, and 11 shotguns for a total of 7 different calibers. The proposed approach guarantees an accuracy higher than 90% for all of the considered cases, namely, the category, model and caliber of the gun.

Paper organization

The remainder of this paper is organized as follows. Section 2 summarizes recent contributions in the field of weapon classification. Section 3 introduces the background concepts related to frequency domain analysis, CNNs, and acoustic characteristics of gunshots. Section 4 describes our dataset and Section 5 discusses the the dataset generation process. The neural network architecture is presented in Section 6. Section 7 shows the performance of our solution. Finally, Section 8 draws some concluding remarks.

2 Related work

Firearm classification based on the acoustic evidence generated by its discharge has long been investigated, but not extensively studied in the literature. Proposed solutions vary in many aspects, including the source of acoustic data, the type of analysis applied, the type of features extracted, and the application area. Table 1 summarizes prior studies, that provide gunshot classification and firearm identification, according to these aspects.

Table 1 Prior gunshot classification approaches

The source of the data is characterized by the type, the quality, and the environmental conditions of the deployed audio recording setup, which defines the amount of information that can be leveraged for classification. Most of the gunshot recordings used in the literature are either obtained under carefully controlled conditions, where a distributed set of microphone sensors are deployed [15, 22, 25], or extracted from a conventional recording device in less controlled environments [3, 8,9,10, 16].

In the former case, where a WASN is deployed, spatial information can be obtained by performing array processing and triangulation techniques. Direction of Arrival (DoA) and ToA estimation methods are applied to the obtained audio signals to determine the projectile speed and trajectory, as well as to infer the position of the shooter. Such information may also provide discriminant features, such as the bullet speed [22], that can be used to identify the firearm category. Furthermore, the distributed nature of the recording setup provides spatial diversity, where multiple acoustic observations from different locations of the same gunshot are obtained, which can be leveraged to increase the classification accuracy. Sánchez-Hevia et al. [25] exploited this feature and proposed a multi-observation weapon classification system that leverages various classifier ensembles to enhance classic decision fusion techniques. Each node in the sensor network produces a classification decision using Least Squares Linear Discriminant Analysis (LS-LDA). The decisions are later fused using a Maximum Likelihood-based fusion rule that weights the decision of each node based on its location.

The main constraint induced by this type of analysis is the requirement of spatial information, which can only be obtained by deploying a distributed sensor network. Therefore, limiting the applicability of gunshot detection and firearm classification to a carefully controlled recording setup only. Consequently, various pattern recognition approaches were proposed that identify the firearm category in the absence of spatial information. The most used classifiers for firearm identification are Gaussian Mixtures Model (GMM) [3, 8, 9] and Hidden Markov Model (HMM) [10, 16].

Most of these approaches can be described as frame-based feature classification approaches [3, 8,9,10], where the time-domain acoustic signal is subdivided into a sequence of short-time windowed frames. From each frame, a set of predetermined features is extracted and used for gunshot classification. The most common extracted features are statistical measures of the spectrum and intensity of the signal, in addition to perceptual features such as Mel-Frequency Cepstrum Coefficient (MFCC) or Perceptual Linear Prediction Coefficients (PLP). Temporal features, such as energy and Zero Crossing Rate (ZCR), are also used, but only in conjunction with spectral or perceptual features.

Morton et al. [16] proposed an alternative classification approach that does not rely on frame-based features aiming to eliminate the dependency on performance-driven parameters, which are often optimized over a finite training set. They proposed modeling each firearm category as an HMM with AutoRegressive (AR) source densities using non-parametric Bayesian priors to allow automated model order selection. The AR defines a set of energy and spectral characteristics of the captured gunshot, while the HMM identifies the transitions of these states.

The aforementioned techniques may perform adequately in matched experimental conditions, however, their effectiveness could reduce significantly when capture conditions vary in challenging unstructured environments, where noise and distortion are present. Although Khan et al. [9] addressed this problem by using an exemplary embedding approach to bridge between varying recording conditions, the achieved classification accuracy is relatively low (i.e., 60-72%). The authors used a dataset of 100 gunshot samples obtained from 20 different firearm models, where each model is represented by 5 to 15 gunshot samples. The different conditions included in their experiments were simulated, namely, “Room Reverb”, “Concert Reverb”, and “Doppler Effect”, which may not match real-life environmental conditions and do not include directional variations. Furthermore, their approach assumes prior knowledge of the recording conditions which is not always possible, especially in audio forensic reconstruction analysis.

Our solution, being the only one considering varying environment conditions and not requiring an ad-hoc setup, outperforms the state of the art studies in terms of dataset richness, including the number of gunshots samples and range of weapon models, reaching 90% accuracy.

3 Background

3.1 Spectrogram

A spectrogram is one of the most widely adopted visual representations of the frequencies spectrum of a signal over time. Being defined as an intensity plot of the Short-Time Fourier Transform (STFT) magnitude, a spectrogram is usually portrayed as a bi-dimensional graph, where one axis (usually the x-axis) represents time and the other axis (usually the y-axis) represents frequencies. An example of spectrogram is depicted in Fig. 1. Each intersection between time and frequency is assigned a color that refers to the Power Spectral Density (PSD) of that specific frequency at that particular time, which is considered a third dimension of the graph. To compute the spectrogram of a signal y, the signal is divided into shorter fixed-length segments \(y_{1}, \dots , y_{n}\), and the Fourier transform is applied separately to each segment. The spectrogram describes the changes of the signal frequencies spectrum as a function of time. This implies that, if the time is discrete, the data to be transformed may be partitioned into overlapping frames. The STFT is applied to each of the frames and the result, consisting of both phase and magnitude for each intersection between time and frequency, is stored in a matrix, as showed in (2).

$$ STFT\{y_{n}\}(m,\omega) = \sum\limits_{n=-\infty}^{\infty} y_{n} w[n-m]e^{-j\omega n} $$
(1)
$$ spectrogram\{y_{n}\}(m,\omega) = |STFT\{y_{n}\}(m,\omega)|^{2} $$
(2)

where yn represents the signal, w(∘) is the window function, while m and ω represent the time and the frequency in the discrete domain, respectively.

Fig. 1
figure 1

Example of a gunshot spectrogram. The x-axis represents the time expressed in seconds, while the y-axis represents the frequencies expressed in kHz. The color represents the PSD at the given time-frequency

The result, i.e., the squared magnitude of the STFT, consists of a bi-dimensional matrix that maps the audio frequencies to the time-localized points [20], i.e., the spectrogram representation of the power spectral density of the function.

The visual representation of audio traces through spectrograms have been extensively leveraged in the literature in the context of audio classification [6], sound event classification [2], emotion recognition [26], human activity recognition [20], cross-modality feature learning [19], and gunshot classification [18].

3.2 Convolutional neural network

A CNN belongs to the class of deep neural networks that have one or more convolutional layers (i.e., layers that perform convolution operations) [13]. A convolution is a linear operation that consists of a slide of a parametric-sized filter over the input representation (usually a visual image). The application of the same filter to different overlapping filter-sized portions of the input generates a feature map. There are several types of filters, also known as operators. Each filter tries to identify a specific feature within the input representation. For example, the Sobel, the Prewitt, and the Canny operators highlight edges, the Harris and the Shi and Tomasi operators highlight corners, etc. One of the most powerful features of CNNs, that is also the reason behind their wide adoption, is the ability to automatically apply an extensive number of filters to the input representation in parallel, thus highlighting specific features in every part of the input image simultaneously.

CNNs can be seen as regularized versions, that discourage learning complex models, of multilayer perceptrons. While in multilayer perceptrons several fully connected layers are used—a layer is fully connected if all the neurons it is composed of are connected to all the neurons of the next layer, CNNs exploit a hierarchical structure that allows building complex patterns by using small and simple patterns.

Figure 2 depicts a typical architecture of a CNN. The dimension of the input image (in this case representing a handwritten digit), keeps decreasing while going deeper in the neural network, while the number of filters, thus the features the architecture desires to highlight, increases. A CNN usually has three types of layers: (i) convolutional layers, to perform the convolution operations to the input, (ii) pooling layers, to discretize the input and reduce the number of learnable parameters; and (iii) fully connected layers, that are essentially feed-forward neural networks, usually placed at the end of the architecture. The goal of the fully connected layers is to hold the high-level features found during the convolutions and try to learn non-linear combinations of these features before assigning the input image a label. Details about these layers contextualized in our model are provided in Section 6.1.

Fig. 2
figure 2

Example of a CNN. LeNet-5 [14] is able to identify handwritten digits for zip code recognition in the postal service

One of the fundamental decisions to be taken when designing a CNN, or generically a neural network, concerns the representation of the input data. Several input representations are available in the literature, each bringing its advantages and drawbacks. Although for visual images the choice is straightforward, for audio samples numerous alternatives are possible, including MFCC, raw digitized sample stream, machine discovered features, and hand-crafted features. Even if the best input representation to adopt is strongly dependant on the problem to solve, several studies in the literature show that feeding CNN with spectrograms is effective in many fields, including musical onset detection [27], human detection and activity classification [11], music classification [1], and other interesting activities [32].

3.3 Guns and gunshots

Gunshots are the result of multiple acoustic events, namely, the muzzle blast created by the explosion inside the barrel and the ballistic shockwave that is generated by the supersonic projectile. These phenomena are the results of many characteristics and variables that eventually sum up and generate the acoustic blast, which include the firearm type, model, barrel length, ammunition type, powder quantity, weight and shape of the projectile, and possibly others. The aim of this work is to estimate at what extent it is possible to use a gunshot as a unique fingerprint that uniquely identifies one or more of the aforementioned variables. Figure 3 summarizes the most important characteristics affecting the acoustic blast generated by a gun.

Fig. 3
figure 3

Variables taken into account in our analysis: category of firearm, caliber, and gun model

Our observation is that different configurations of the aforementioned parameters may lead to unique gunshot patterns that can be detected by analyzing the frequency-time decomposition of the gunshot blast. In the next sections, we demonstrate how Convolutional Neural Networks (CNNs) can be effectively used to detect these patterns, thus uniquely identifying the category of gun, the caliber, and finally, the model of the gun.

4 Dataset description

Table 2 shows the dataset considered in this work. We collected the samples from several YouTube videos, such as C4Defense, hickok45, EmanuelRJSniper, mixup98, OneGear, and ReloaderJoe. Our choice of guns takes into account two main aspects: the Category of Guns and the Caliber.

Table 2 Dataset: Gun Model, Caliber, and number of extracted samples

Category of guns

We considered 30 different pistols, 18 rifles, and 11 shotguns. As for pistols, we considered 22 revolvers and 10 semiautomatic.

Caliber

We took into account the most popular calibers in U.S. and world-wide [29, 31], such as 9mm and .45acp for automatic pistols, .44M and .357M for revolvers, 7.62x39 and 5.56NATO for rifles, and 12 gauge caliber for shotguns.

4.1 Muzzle blast: preliminary considerations

When a gun is fired, there are two distinct acoustic phenomena, the muzzle blast and the ballistic shockwave [23]. The latter is generated by the bullet that compresses the air in front of itself creating a sonic boom that propagates with a shape of a cone where the vertex is the bullet itself. Conversely, the muzzle blast is a high energy acoustic signal originated by the gun’s muzzle with a spherical wavefront, propagating at the speed of sound, and with center the muzzle of the gun. The ballistic shockwave is a very important source of information to locate a sniper in an open field [24, 30]. However, to achieve that, the ballistic shockwave has to be sampled from different locations requiring an array of microphones. The ballistic shockwave cannot be observed for subsonic projectiles such as those used in shotguns and pistols.

Given the aforementioned considerations, we focus on the muzzle blast and the echoes associated with it. In the following, we discuss and highlight three critical parameters that have to be carefully set in order to maximize the detection performance of a neural network: (i) muzzle blast duration, (ii) number of frequency bins, and (iii) the number of time slots.

Figure 4 shows the acoustic signal amplitude recorded from a Beretta PX4 Storm, 9mm. The muzzle blast lasts for a few milliseconds (up to 5ms in the figure), depending on the model of gun and caliber. We also observe some echo effects (Echo 1, Echo 2, and Echo 3) at 10ms, 22ms, and 63ms due to reflections of the sound from obstacles around the shooter. We highlight that this is consistent with previous findings from other studies [23], while the muzzle blast duration will be a critical parameter from the analysis carried out in this work.

Fig. 4
figure 4

Acoustic signal amplitude of a muzzle blast for a Beretta PX4 Storm, 9mm

Figure 5 shows the PSD as a function of time and frequency (spectrogram) associated with the muzzle blast in Fig. 4. We consider both the bi-dimensional and the three-dimensional representation of the spectrogram. We observe that the muzzle blast (time less than 5ms) takes all the frequency components between 0 and 24KHz with a significant power spanning between -30dB (lower frequencies) and -80dB (higher frequencies). As soon as the blast finishes, the echoes take the frequencies less than 18KHz with a decreasing power between -40dB and -60dB. The aforementioned spectrogram components constitute the input for the training process of our neural network.

Fig. 5
figure 5

Spectrogram of a muzzle blast: bi-dimensional and three-dimensional PSD of a muzzle blast (Beretta PX4 Storm, 9mm) as a function of time and frequency

We identify two more critical parameters affecting our algorithm performance: the number of frequency bins and the number of time slots. For our analysis, we adopted the spectrogram function of MATLAB-R2019b considering as input the acoustic sample (0.1 seconds from the beginning of the blast), a window of size w = 44 to divide the signal into segments and performing windowing according to the Hann function, no = ⌊w/2⌋ as the number of overlapping samples between adjacent segments, fl = 65 as the FFT length, and fs = 48000 as the number of samples per seconds acquired by the microphone. Assuming the previous parameters, the spectrogram function returns the PSD of (fl + 1)/2 frequencies and \(\lfloor {\frac {length(x) - no}{w - no}}\rfloor \) time bins, where x is the vector of the acoustic samples being equal to 0.21 ⋅ fs = 10080 sample—we considered the first 0.21 seconds after the first abrupt change as per Fig. 4. For instance, in the previous example, the frequency range (0 to 24KHz) has been divided into 33 bins, while the time has been divided into 46 slots.

4.2 Quality of the audio samples

In the following, we provide a quantitative analysis of the quality of the collected audio samples. As a quality metric, we consider the Signal-to-Noise Ratio (SNR) computed on each muzzle blast from the actual starting of the blast for a period of 400ms. For each audio sample, we consider a pre-defined reference noise pattern constituted by random samples of amplitude 0.1, i.e., one-tenth of the maximum signal amplitude taken by the microphone. The previous sound pressure is equivalent to a classical background noise that can be sampled from an outdoor environment characterized by a gentle wind. Figure 6 shows the probability distribution function associated with the SNR computed as described before. The overall audio quality is very high since the muzzle blast is + 20dB higher than the reference noise pattern. We observe that even the echoes can be easily identified from the noise reference.

Fig. 6
figure 6

Sound quality analysis: SNR of a main blast and its associated echos assuming a reference random noise of amplitude one tenth of the microphone saturation threshold

5 Dataset generation

We generated a dataset of 3655 samples extracted from videos found on YouTube. Each of the collected audio samples has a sample rate of either 48000 or 44100 samples per second. Generating a dataset of gunshots extracted from YouTube videos involves the following steps:

  • Audio extraction. We performed the audio extraction (MP3 format) from the selected videos using the youtube-dl [33] and ffmpeg [4] tools.

  • Abrupt change detection. A preliminary filtering is performed by identifying abrupt changes in the audio signal.

  • Gunshot detection. Gunshots are detected among blasts by relying on a SVM learning algorithm.

In the following, we describe the procedure of automatically extracting gunshots from an audio trace focusing on Blast detection and Gunshot detection.

5.1 Identification of abrupt changes in an audio trace

To detect abrupt changes in an audio trace, we computed the variance over a sliding window of 5ms, equivalent to either 220 or 240 samples depending on the quality of the audio trace, i.e., 44100 or 48000 samples per second, respectively. Subsequently, we searched for the peaks adopting windows of size 0.3 seconds and a minimum peak prominence of 0.3. Figure 7 shows the three computation stages from the sound pressure to the blast sequences that are passing by the moving window averaging. This figure refers to two sound chunks extracted from an audio trace, where the first part (i.e., 0 ≤ t ≤ 5.5 seconds) is a sequence of gunshots, while the second part (i.e., t > 5 seconds) is mainly constituted by voice. We stress that the main aim of this part is to detect abrupt changes in the sound pressure, while subsequently we will show how gunshots are identified.

Fig. 7
figure 7

Detection of abrupt changes in audio traces: from sound pressure to abrupt change detection by computing moving variance and peak detection

5.2 Gunshot detection

Gunshot detection is performed via a human-assisted supervised learning approach. The intention is to have a growing training set of actual gunshots that is supervised by the user. The user checks for both false positives and false negatives by listening to the newly generated samples in the training set. Figure 8 shows the training, validation, and testing procedures. We assume that the training set is populated with an initial dataset of actual gunshots that have been manually selected. In our case, we started from an initial dataset of 10 gunshot samples only. At each cycle, a new model is trained with the current training set (Step 1 in Fig. 8). Subsequently (Step 2 in Fig. 8), new samples are selected from the list that is generated by the procedure presented in Section 5.1. Finally, the generated samples are tested with the current training set. The output is assessed by the supervisor (Step 3 in Fig. 8), and the verified samples are added to the training set (Step 4 in Fig. 8).

Fig. 8
figure 8

Gunshot detection via human-assisted supervised learning. The SVM classifier is trained by verified samples of gunshots. When new gunshot samples are tested, the output of the classifier is verified (by the user) and added to the Training Set

Classification performance

To assess the quality of the classification procedure, we considered 6 additional videos (V1, …, V6) downloaded from YouTube, which are not included in the training set. For each video, we detected the abrupt changes according to the procedure presented in Section 5.1, and we executed the gunshot detection procedure presented in Fig. 8. As for the Training Set, we considered the one we generated from the samples found in Table 2. Figure 9 shows the frequency of the similarity indexes provided by the SVM classifier for the Shot and No-Shot audio samples with red crosses and green circles, respectively. The similarity indexes were categorized into bin width of 10, where each cross/circle aggregates adjacent similarity indexes. Figure 9 represents the decision after one iteration of the procedure presented in Fig. 8. The decision Shot vs No-Shot is taken as a function of the threshold Thr, which has been empirically set to zero. We observed that 96% of the No-Shot samples feature a similarity index of -189.5, while the remaining 4% are spread between -178.9 and -0.55. There are no samples from the No-Shot class with a similarity index that is greater than 0. As for the Shot class, the samples are distributed between 0.41 and 275, with frequencies between 1% and 11%. Even in this case, we highlight that there are no samples from the Shot class with a similarity index that is less than 0.

Fig. 9
figure 9

Gunshot detection performance: frequency distribution of Shot and No-Shot samples as a function of their similarity index

To precisely assess the effectiveness of our solution, we manually checked all of the classified samples, namely, Shot vs No-Shot. Table 3 shows the result of our analysis. For each video, we report the number of detected abrupt changes (N), the threshold used by the SVM classifier (Thr = 0), True Positive (TP), False Positive (FP), True Negative (TN), False Negative (FN), the actual number of gunshots (Actual), and the overall accuracy of the detection algorithm. As previously stated, during our evaluation, we considered only one iteration as depicted in Fig. 8. We would like to highlight that the proposed algorithm achieves the main purpose of generating a dataset of gunshot samples (i) in a fast and efficient way and (ii) with the minimum amount of false positives. The output of this phase will be the training set to be used by the CNN.

Table 3 Shot detection performance considering 6 videos

At this stage, we aim at minimizing the number of FP, which might bias the subsequent training process. We also aim at maximizing the process efficiency of creating a large dataset of gunshot samples. Therefore, the task of the supervisor resorts mainly to listening to a very few samples (TP + FP) despite the dataset N, in order to remove the FP, which are overall very few: only 2 out of 4931 samples. Conversely, we observe that our approach might lose some good samples (FN = 16 + 4). However, these samples do not affect the performance of our solution hence we consider them not important.

The above procedure has been applied to each audio sample found in Table 2 to generate a dataset of actual gunshot samples, that is, one dataset for each gun model.

6 Overall architecture

Figure 10 depicts the overall architecture of our CNN consisting of five layers with weights: the first four are convolutional layers, while the last one is a fully connected layer. The output of the fully connected layer is fed to a 7-way softmax, that outputs the probability distribution over the 7 class labels. The details of our architecture, including information about the layers and their learnable parameters, are reported in Appendix A.

Fig. 10
figure 10

Structure and details of our Convolutional Neural Network

Considering the dimension of the starting image and the need to give importance also to peripheral pixels, every convolution of our architecture makes use of padding to avoid losing information. By adding additional pixels to the border, every convolution layer outputs an image with the same number of pixels as the one fed into that layer. Furthermore, in our CNN architecture, we make use of a stride of 1 during convolutions, and a stride of 2 during the Max Pooling application. The stride is a critical hyperparameter in the context of CNN, as it allows us to specify the number of cells by which filters (e.g., convolution filters, pooling filters) slides over the image. If the stride is equal to 2, the filter starts from the top left corner and moves over the image with jumps of 2 units at a time. By considering square filters (i.e., f xf ) and square initial images (i.e., n xn), after having specified the dimension of the filters f, whether they are convolutional filters or pooling filters, the stride parameter s, the dimensions of the initial images n, and the padding p, it is possible to calculate the dimension of the square output image of a layer as:

$$ \left\lfloor\frac{n + 2p - f}{s} + 1\right\rfloor $$
(3)

Our choice to keep a unit stride during the convolutions and a stride equals to 2 during the pooling is guided by the intention of not losing information during convolution phases, while exploiting the pooling technique to summarize the features, thus reducing the input dimensionality.

The first convolutional layer filters the 36 x 99 x 1 spectrogram image with 40 kernels of size 3 x 3 x 1, without any stride and with ‘same’ padding. Setting the padding to “same” allows the classifier to calculate and add the padding so that the output has the same size as the input. The second convolutional layer takes as input the normalized (40 channels) and pooled (3x3 max pooling, stride = 2) output of the first convolutional layer and filters it with 80 kernels of size 3 x 3 x 40. The third convolutional layer takes as input the normalized (80 channels) and pooled (3x3 max pooling, stride = 2) output of the second convolutional layer and filters it with 160 kernels of size 3 x 3 x 80. The fourth convolutional layer takes as input the normalized (160 channels) and pooled (3x3 max pooling, stride = 2) output of the third convolutional layer and filters it with additional 160 kernels of size 3 x 3 x 160. The normalized (160 channels) and pooled (1x13 max pooling, stride = 2) output of the fourth convolutional layer is fed to a 7-neuron fully connected layer that, in turn, outputs the result to a 7-way softmax, that produces a distribution over the 7 class labels.

6.1 CNN details

Activation Function

Our neural network relies on the ReLU activation function [17] after each convolution. The ReLU activation function, whose operation is simplified in (4), outputs the maximum value between zero and the input value.

$$ f(x) = \begin{cases} x & \text{if $x \geq 0$}, \\ 0 & \text{otherwise} \end{cases} $$
(4)

Although the literature uses other variants (e.g., Tanh, SoftSign, Sigmoid), several studies show that ReLU outperforms the competitors in terms of performance [5, 13].

Regularization

Our neural network relies on Dropout [28] regularization to reduce the likelihood of overfitting. The Dropout regularization technique allows the neural network to randomly cut out units (together with their connections) during the training phase with a given probability. This discourages neurons to rely on the presence of particular other neurons and forces them to find more robust features with different ones [13], thus reducing the probability of learning the training set by heart.

Normalization

Training neural network without normalization leads to the internal covariate shift phenomena, where the distribution of each layer’s input change during training, thus requiring a more sophisticated tuning of the parameters. To mitigate this issue we add Batch Normalization layers after each convolution. The Batch Normalization technique [7] performs the normalization for each training mini-batch, allowing the usage of higher learning rates and reducing the need for a cherry-picking tuning of the parameters. As summarized in the (7), Batch Normalization normalizes the output of an activation layer by subtracting the mean and dividing by the standard deviation of the batch.

Given a mini-batch β = x1,...,xm:

$$ \mu_{\beta} \leftarrow \frac{1}{m} \sum\limits_{i=1}^{m} x_{i} $$
(5)
$$ \sigma^{2}_{\beta} \leftarrow \frac{1}{m} \sum\limits_{i=1}^{m} (x_{i} - \mu_{\beta})^{2} $$
(6)
$$ \hat{x} \leftarrow \frac{x_{i} - \mu_{\beta}}{\sqrt{\sigma^{2}_{\beta} + \epsilon}} $$
(7)

where 𝜖 is defined as a constant to add to the mini-batch variances, specified as a scalar ≥ 10− 5.

Although the Batch Normalization technique brings a slight regularization effect to the neural network, in some cases eliminating the need for Dropout [7], we find that the combined use of the Batch Normalization and Dropout aids generalization [13].

Discretization

The application of discretization techniques to an input representation consists of reducing its dimensionality to evaluate the features within the obtained, summarized sub-regions. This process allows us to mitigate the overfitting of the training set and to reduce the number of parameters to be learned for the training, thus reducing the overall computational cost. To attain these benefits, in our architecture we use a Max Pooling sample-based discretization process layer after each activation layer. Max Pooling applies a max filter to non-overlapping sub-regions of the input feature map, whose dimension is dictated by the dimension of the filter. When Max Pooling is applied, the passage of the moving filter onto a sub-region produces, as output, a value, consisting of the maximum value of that sub-region.

Output

As for the output layer, our neural network architecture relies on the commonly used softmax function. The softmax function, taking as input a vector of real numbers, produces a probability distribution proportional to the exponential of the input numbers. In detail, the input real numbers are mapped in a (0,1) interval that sums up to one, thus allowing to treat the output provided by the softmax function as probabilities.

In general, given a vector of real numbers \(v = (v_{1}, \dots , v_{K}) \in \text {I}\!\text {R}^{\text {K}}\), the standard unit softmax function \(\sigma : \text {I}\!\text {R}^{\text {K}} \rightarrow \text {I}\!\text {R}^{\text {K}}\) is defined by:

$$ \sigma(v)_{i} = \frac{e^{v_{i}}}{{\sum}_{j=1}^{K} e^{v_{j}}} $$
(8)

6.2 Learning details

Table 4 summarizes the training options of our network, that are detailed in the following.

Table 4 Training options of our network

Optimizer

An optimizer is defined as an algorithm (or a method) used to tune the parameters of a neural network with the goal of reducing the loss function. In our architecture, we rely on the Adam optimizer [12], an extensively adopted optimizer that inherits the advantages of both RMSProp and SGD with momentum (i.e., SGD where each gradient update is a linear combination of the previous gradient updates) optimizers. From RMSProp it inherits the squared gradients to scale the learning rate, while from SGD with momentum it inherits the concept of the moving average of the gradients. An empirical analysis conducted in [12] shows that Adam outperforms the other optimizers, thus working better in practice. As recommended in the original paper (whose algorithm is reported below with our parameters), in our implementation we set to 0.9 the gradient decay factor β1, to 0.999 the squared gradient decay factor β2, and to 10− 8 the denominator offset (to avoid divisions by zero), respectively. However, although the original paper recommends using an initial learning rate of 10− 3, we empirically found (relying on the grid search hyperparameter tuning technique) that setting this value to x ∗ 10− 4,x ∈ [1,3] provides better results.

figure a

Number of Epochs

An epoch is defined as a single pass through the training set, i.e., 1 forward pass and 1 backward pass for all the training samples. A forward pass is defined as the calculation process to obtain the output values from inputs data, from the first layer to the last layers, while a backward pass is defined as the process of changing the weights (i.e., learning) by relying on an optimization algorithm (e.g., the gradient descent algorithm) from the last layer backward to the first layer. We empirically set as 50 the max number of epochs, since each of the subsequent epochs does not bring any benefit to our model learning.

Mini-Batch Size

Using mini-batch that consists of processing small subsets of training samples in every iteration, instead of processing them all together. The choice of mini-batch size (e.g., the number of training samples to process) does not affect the performance of the model in terms of accuracy, but affects the resource required during the training process. A larger mini-batch size requires more memory and takes more time per epoch, but allows the classifier to better optimize the vectorization (i.e., the linear transformation of a matrix into a column vector), while a smaller mini-batch size requires less memory but loses the speed-up given by vectorization. In our model, we set the mini-batch size to 8, to better optimize the resources of our server.

Shuffle

The “shuffle” option allows shuffling the order of which training samples are fed to the model, with the goal of reducing variance, thus reducing overfitting. Shuffling the training samples becomes crucial in case mini-batches are used, due to the need to avoid having batches containing highly correlated samples that would slow down (or, in many cases, compromise) the performance of the model. In our model, we shuffle the training data before each training epoch, as well as the validation data before each validation.

Plot

The “plot” option in Matlab provides several pieces of information to be taken into account during the training process. Information include, but are not limited to, the mini-batch training loss and accuracy, the smoothed training loss and accuracy (i.e., the result of the application of a smoothing algorithm to the training accuracy), the validation loss and accuracy, hardware resources, etc.

Validation Data

The validation data, also known as validation set, refers to a subset of samples separated from the training set, that the model will rely on to evaluate the effectiveness of its training. In our case, by following the 80/20 rule, the validation set is represented by 20% of the whole dataset.

Validation Frequency

The validation frequency represents the number of iterations between evaluations of validation metrics. We empirically set this value to \(\lfloor \frac {|training\_set|}{miniBatchSize}\rfloor \).

7 Performance

7.1 Category of gun identification

In this section, we consider the neural network previously introduced to infer the Category of gun. We reconsider Table 2 and we divide the dataset into three classes, namely, Pistols, Rifles, and Shotguns, according to the gun models in the dataset. Figure 11 shows the confusion matrix computed as the average of 50 training and validation runs.

Fig. 11
figure 11

Confusion matrix associated with the classification of the Category of Guns

The accuracy acc can be computed according to (9).

$$ acc = \frac{1}{N} \sum\limits_{i=1}^{N_{C}} x_{ii} $$
(9)

where N = 720 is the total number of samples, NC = 3 is the number of classes, and xii is the ith diagonal element of the confusion matrix, yielding to acc ≈ 0.92. The confusion matrix in Fig. 11 also reports summaries of columns and rows, predicted and true classes, respectively.

We observe that the classification error spans between 4.4% and 12.9% for Pistol and Rifle classes, respectively. The class Rifle (an actual gunshot from a rifle) is incorrectly classified as either Pistol (4 times) or Shotgun (21 times) in the 12.9% of the cases. The same type of analysis can be performed column-wise, where the prediction error spans between 1.2% and 24.1%. As an example, we observe that a prediction on class Shotgun is wrong in the 24.1% of the cases (6 times for Pistol and 21 times for Rifle).

Finally, we observe that while the Pistol class is likely to be correctly classified all the times, the vast majority of errors are happening between the Rifle and Shotgun classes.

7.2 Caliber identification

In this section, we report the performance of our classification algorithm when considering 7 different calibers from Table 2. We group the video chunks based on gun caliber, obtaining 7 different classes, namely, 12, 357M, 44M, 45acp, 556NATO, 762x39, and 9mm. Figure 12 shows the confusion matrix computed as the average of 50 training and validation runs. The overall accuracy computed according to (9) sums up to acc ≈ 0.9. Best and worst performance are achieved by 9mm and 762x39, respectively. In particular, class 762x39 is wrongly predicted 8 times as class 556NATO. Classes 556NATO and 762x39 are intrinsically similar, since both are from class Rifle. Therefore, they are prone to be confused. Nevertheless, we observe that this phenomenon is very limited since we have 3 cases of 556NATO classified as 762x39, and 8 cases for the opposite configuration. We also observe that 556NATO and 762x39 classes experience a significant amount of misclassifications with classes 12 and 357M. Conversely, class 44M and 9mm are the most likely to be correctly classified with 94.8% and 93.6%, respectively.

Fig. 12
figure 12

Confusion matrix associated with the classification of the Calibers of Guns

Highlights

By combining Figs. 11 and 12 we can draw some interesting observations. Rifle class is misclassified for class Shotgun 21 times (the opposite is happening 9 times) in Fig. 11, while 7 + 4 = 11 times (4 + 5 = 9 times) in Fig. 12. We think that the error is not due to a specific caliber, either the 556NATO or the 762x39, but to the feature similarities between the two classes: Shotgun and Rifle classes.

The pistol class is also misclassified as Rifle class 15 times. By looking into the details of Fig. 12, we observe that the major source of misclassifications is coming from the 357M class, classified 4 times as 556NATO and 1 time as 762x39. We observe that the 357M is the most powerful among the pistol calibers hence it is the closest to Rifle class in terms of bullet size, pressure, and barrel diameter.

Finally, we observe that our solution is particularly robust in detecting pistols. In particular, one of the most adopted worldwide caliber (9mm) is characterized by a very limited number of misclassifications (9 out of 167 total). The same considerations apply to classes 44M and 45acp.

7.3 Model identification

In this section, we consider all of the gun models previously introduced in Table 2 with the aim of classifying each of them. The total number of classes sums up to 59, which is the number of gun models considered throughout this paper. We report the confusion matrix associated with the aforementioned classification in Appendix B. The accuracy sums up to acc ≈ 0.90 and the maximum number of misclassifications (per model) never exceeds 2. We observe that class 38 (Ruger GP100 Match Champion) is never correctly classified. Finally, we highlight that the number of samples for the validation process is small (20% of each gun model in Table 2). Nevertheless, the diagonal of the matrix in Appendix B collects the vast majority of the samples confirming the effectiveness of our model. We are confident that a larger data sample can increase the accuracy performance and effectiveness of gun model detection from gunshot sounds.

7.4 Testing

To validate our methodology, we tested the model against a new set of audio samples taken from videos different than the ones considered before with varying conditions, including the background noise and relative positions between the microphone and the shooter. We consider a total of 115 audio samples constituted by 13 Pistols (Beretta 92 FS), 59 Rifles (Ruger AR, Daniel Defense M4 A1 SOCOM, Maadi AK-47) and 44 Shotguns (Maverick 88, Winchester Model 300 Defender). We observe that Pistol and Rifle classification is characterized by high performance, where only 4 Rifles samples are misclassified for Pistol. As for the Shotgun class, we highlight that the two shotguns considered are not in the training set (Table 2) because we did not find any valid samples from additional videos. Although the audio samples are coming from different shotgun models, our algorithm can still detect the caliber with high probability (only 8 audio samples are misclassified), which verifies the effectiveness and correctness of our algorithm. Finally, we observe that the overall accuracy is consistent with the validation process and sums up to about 0.9.

8 Conclusion

Although scenarios requiring in-depth digital forensic of gunshots are countless, including military operations, mass-shooting, and possibly others, current solutions are far from reaching an adequate accuracy under real conditions.

In this paper, we have proposed an effective and efficient methodology to uniquely fingerprint gunshots enabling the identification of the category, caliber, and the model of the gun with an accuracy higher than 90% regardless of the capture conditions. Unlike existing solutions, our technique requires neither ad-hoc deployment of microphone networks, nor specific sample quality, and is agnostic to the microphone position with respect to the shooter. We have demonstrated that forensic analysis in the time-frequency domain of a single gunshot audio sample recorded by a commercial microphone (44100 samples per seconds) can be effectively used to infer the gun model (and other related characteristics). The proposed solution may lead to new insights and further developments in the area of weapon classification considering more samples, different noise levels, and a much larger weapon database.