CosyAudio: Improving Audio Generation with Confidence Scores and Synthetic Captions

0. Contents

Abstract
Automated Audio Captioning
Audio Generation in AudioCaps
Audio Generation in MACS
Seamless Audio Generation from Any Text Input

1. Abstract

Text-to-Audio (TTA) generation is an emerging area within AI-generated content (AIGC), where audio is created from natural language descriptions. Despite growing interest, developing robust TTA models remains challenging due to the scarcity of well-labeled datasets and the prevalence of noisy or inaccurate captions in large-scale, weakly labeled corpora. To address these challenges, we propose CosyAudio, a novel framework that utilizes confidence scores and synthetic captions to enhance the quality of audio generation. CosyAudio consists of two core components: AudioCapTeller and an audio generator. AudioCapTeller generates synthetic captions for audio and provides confidence scores to evaluate their accuracy. The audio generator uses these synthetic captions and confidence scores to enable quality-aware audio generation. Additionally, we introduce a self-evolving training strategy that iteratively optimizes CosyAudio across both well-labeled and weakly-labeled datasets. Initially trained with well-labeled data, AudioCapTeller leverages its assessment capabilities on weakly-labeled datasets for high-quality filtering and reinforcement learning, which further improves its performance. The well-trained AudioCapTeller refines corpora by generating new captions and confidence scores, serving for the audio generator training. Extensive experiments on open-source datasets demonstrate that CosyAudio outperforms existing models in automated audio captioning, generates more faithful audio, and exhibits strong generalization across diverse scenarios.

Figure 1: Overview of the proposed CosyAudio

2. Automated Audio Captioning

We showcase the effectiveness of caption refinement by comparing original and synthetic captions.

Input Audio	Original Caption	Synthetic Caption
	Vehicles pass by on an urban street.	sirens blare as footsteps are heard and a car honks.
	Waste gases are burning	an aircraft is taking off. there is an aircraft engine in the background.
	Sounds are being heard across a band.	some short audio clips that are recorded with a cassette recorder in the office.
	Recorded Stereo within a cheap Moto G cellphone.	ambience of the dawn chorus of the great tit.
	this sound is a remix of freesounds created by andrew duke, freed, freed.	an eerie bell like sound which can be used for a horror game or for an alien game.
	Paisaje sonoro del Campus de la Universidad Europea de Madrid.	me making a little whistling noise on a cheap mic.

3. Audio Generation in AudioCaps

We conduct audio generation in AudioCaps, a homologous test set.

Ground Truth	AudioLDM 2	Make-An-Audio 2	TANGO	TANGO 2	CosyAudio
Input Caption: A large explosion and a heartbeat, a person speaks.

Input Caption: Firecrackers popping as a crowd of people cheer and whistle.

Input Caption: An engine rumbles loudly, then an air horn honk three times.

Input Caption: Thunder and a gentle rain.

Input Caption: Ocean waves crashing and water splashing as wind blows into a microphone followed by a man talking.

Input Caption: A car is passing by with leaves rustling.

4. Audio Generation in MACS

We conduct audio generation in MACS, a non-homologous test set.

Ground Truth	AudioLDM 2	Make-An-Audio 2	TANGO	TANGO 2	CosyAudio
Input Caption: the sound of approaching footsteps adults talking.

Input Caption: adults talking and baby crying.

Input Caption: adults speaking in english their footsteps are heard and a police siren in the background.

Input Caption: water falling into a puddle.

Input Caption: a moped passes by pretty near noises in the background.

Input Caption: adults talking while a church bell is rung far away.

5. Seamless Audio Generation from Any Text Input

We conduct zero-shot audio generation, with captions produced by ChatGPT.

Input Caption	CosyAudio
A gentle stream flowing through a forest, with birds chirping in the background.
Crowded subway station with trains arriving and people hurrying by.
A cozy fireplace crackling in a quiet room, with the sound of a gentle breeze outside.
A soccer match with the crowd cheering and players shouting instructions.
A peaceful night in the countryside with crickets chirping and an owl hooting.
A carnival with laughter, music, and the sounds of rides in the background.