Unsupervised Meta-Embedding for Bird Songs Clustering in Soundscape Recordings

Code available on GitHub

Abstract

Amazonian forests are threatened by numerous anthropogenic pressures not visible by satellite imagery, such as over-hunting or undercover forest degradation. Knowledge of the effects of these degradations is essential for an effective local conservation policy. However, these effects can only be assessed using quantitative methods for monitoring biodiversity in the field. In recent years, ecoacoustics has offered an alternative to traditional techniques with the development of Passive Acoustic Monitoring (PAM) systems allowing, among other things, to automatically monitor species that are difficult to identify by observers, such as crepuscular and nocturnal tropical birds. Although the use of such systems makes it possible to acquire large sets of data collected in the field, it is often difficult to process these data because they generally represent several thousand hours of recordings that need to be annotated and validated manually by an expert with in-depth knowledge of the phenology and behavior of the species studied. The objective of this thesis is to develop a new method to facilitate the work of ecoacousticians in managing large unlabeled acoustic datasets and to improve the identification of potential new taxa. Based on the advancement of Meta-Learning methods and unsupervised learning techniques integrated into the Deep Learning (DL) framework, the Meta Embedded Clustering (MEC) method is proposed to progressively discover and improve the inherent structure of unlabeled data.

Background

Ecoacoustics

This thesis mainly relies on a recent discipline called ecoacoustics that studies sound along a broad range of spatial and temporal scales in order to tackle biodiversity and other ecological questions. This discipline has been introduced in 2015 in [1].

In this thesis, soundscape recordings are used in order to analyze sounds produced by animals (e.g. biophony). Soundscape are generally defined by three components:

Geophony (i.e. sounds generated by physical events such as waves, earthquakes, or rain)
Biophony (i.e. sounds produced by animals)
Anthropophony (i.e. sounds associated with human activity)

A soundscape recording. — A soundscape recording with Regions Of Interest (ROIs) (source: https://scikit-maad.github.io)

Objectives

This thesis was carried out in the Muséum National d’Histoire Naturelle (MNHN) de Paris with the EcoAcoustics Research (EAR) team. The objective was to develop a framework useful for a better understanding and visualization of highly dynamic and complex sound scenes in tropical environments in order to tackle the following problems:

National biodiversity inventories not carried out yet (e.g. in developing countries)
Rare species are targeted (e.g. nocturnal and crepuscular bird species)
Absence of experts
Large amount of unlabeled data (i.e. many hours of recordings)

Therefore, the main problem is how to get around the problem of lack of large labeled datasets in challenging acoustic environments?

Meta-Embedded Clustering (MEC)

For this purpose, the Meta-Embedded Clustering (MEC) method was proposed to facilitate the issues of discovering and gradually improving the inherent structure of unlabeled data. This method is mainly based on Meta-Learning algorithms and more specifically on Unsupervised Meta-Learning (UML) techniques [2] that have the advantage to improve few-shot image classification by learning features into clustering space.

Unsupervised Meta-Learning (UML). From “Unsupervised few-shot image classification by learning features into clustering space”, by S. Li et. al, Conference, Tel Aviv, Israel, October 23, 2022, Springer.

The MEC method is performed in five successive steps where (i) the data is passed through the initialized model, (ii) initial estimate of the non-linear mappings are computed to avoid the curse of dimensionality, (iii) a clustering algorithm (HDBSCAN) is performed on the latent space, (iv) a pseudo-labeled dataset is built from the clustering algorithm’s predictions and (v) the model is fine-tuned on the pseudo-labeled dataset for n episodic tasks.

Meta Embedded Clustering (MEC). — Meta Embedded Clustering (MEC)

Results & Discussion

Three research questions were asked in this thesis:

Q1: How well does episodic training improve the performance of a Meta-Learning algorithm compared to classical training?
Q2: To what extent can Meta-Learning algorithms fine-tuned on pseudo-labeled data classify classes that were not used during training?
Q3: To what extent Meta-embeddings can improve the clustering quality of unlabeled data?

In this blogpost, particular attention is paid to the research question Q3 because it is at the core of this thesis and is closely related to the MEC method. When performing clustering on the latent space, results presented in the following table show that Meta-embeddings compared to baseline embeddings (ResNet18) improve the quality of the clustering.

Embedding	Number of clusters	Accuracy	NMI	ARI	DBCV
Baseline	4	30.58%	0.1201	0.0212	-0.0949
Meta	13	67.48%	0.8142	0.5813	-0.2029

The goal in this thesis was to find a ground truth number of 21 clusters. Results presented in the following table show that performing iterative clustering on the Meta-embeddings can help refine the clusters and further improve the clustering quality (from 17 to 19 clusters with 69.10% to 76.60% accuracy).

Iteration	Number of clusters	Accuracy	NMI	ACI	DBCV
0	17	69.10%	0.8460	0.5650	-0.3547
16	19	76.60%	0.8681	0.6842	-0.0920

Conclusion

The global objective of this thesis has been to facilitate the work of ecoacousticians in their management of acoustic data and identification of potential new taxa, by discovering and gradually improving the inherent structure of unlabeled data. Based on nsupervised clustering-based methods, the Meta Embedded Clustering (MEC) method turned out to progressively improve the inherent structure of unlabeled data. This method has eventually allowed to further improve the accuracy of the data clustering (69.10% vs. 76.60%) and, in this way, contribute to determine a number of clusters closer to the actual number of clusters expected. In conclusion, the use of unsupervised Meta-embedding has proven to be an effective solution for improving the clustering of bird songs in soundscape recordings. These technological methods can therefore bring forward novel research in developing countries that can facilitate the identification of species as well as the detection of potential new rare bird species.

References

[1] Sueur, et. al (2015). Ecoacoustics: the ecological investigation and interpretation of environmental sound. Biosemiotics, Springer. https://link.springer.com/article/10.1007/s12304-015-9248-x

[2] S. Li et. al (2022). Unsupervised few-shot image classification by learning features into clustering space. BConference, Tel Aviv, Israel, October 23, Springer. https://link.springer.com/chapter/10.1007/978-3-031-19821-2_24