BALSa: Bootstrapping Audio-Language alignment via Synthetic datA generation from backbone LLMs

Chun-Yi Kuan, Hung-yi Lee

National Taiwan University

📜 Read the Paper | 🌐 Visit the Website | 🐙 Explore GitHub |

📣 📈 News

  • Our preprint "From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data" is now available on Arxiv.
    This paper details the model development process and provides complete benchmark results from our evaluation.
    📜 Read the Paper
  • Our paper, "Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples", has been accepted to Interspeech 2025 🇳🇱.
    This paper proposes LISTEN (Learning to Identify Sounds Through Extended Negative Samples), a contrastive-like training method that helps ALLMs distinguish between present and absent sounds, improving their reliability and reducing hallucinations. 📜 Read the Paper

🌟 🔥 Essential Features

  • BALSa: Bootstrapping Audio-Language Alignment.
    A data construction pipeline that uses a backbone LLM to generate synthetic data for improving audio-language alignment.
  • LISTEN: Audio-Aware Recognition.
    A training method that helps models identify both present and absent sounds, reducing audio hallucinations and improving reliability.
  • BALSa-MA: Multi-Audio Processing.
    Extends BALSA to handle multiple audio samples, enabling comparative analysis and joint captioning for better audio understanding.
  • Efficient Training & Data Usage.
    Reaches strong performance with only 12% of the training data used by Qwen2-Audio, by training a lightweight adapter with frozen LLM and audio encoder. All experiments ran on only two NVIDIA RTX A6000 GPUs.
  • Comprehensive Evaluation.
    Assesses models on audio QA, reasoning, and safety using existing, reproduced, and newly introduced benchmarks.

Audio Reasoning Ability 🎧 🤔

Our model develops audio reasoning capabilities without requiring explicit audio reasoning datasets. We do not train it using question-answering pairs or explicitly teach it to reason about audio. Instead, this ability naturally emerges from the underlying text-based backbone LLM, LLaMA-3.1-8B-Instruct.

This section tests the model’s ability to reason about audio. Some questions require chain-of-thought reasoning, comparing different answer choices, and following specific answer formats. This also evaluates the model’s instruction-following skills.

Back to Table of Contents

Example 1

Sound Source Identification amd Attribution (Audio Source: Link)


Instruction Audio Input
Who is known for playing the type of instrument heard in the background?
A) Charlie Parker
B) Ludwig van Beethoven
C) Jimi Hendrix
D) Freddie Mercury
Think step by step and provide a detailed analysis of each choice. Finally, write the most probable answer.
Our Model Output Baseline Models' Output
BALSa
To determine the famous individual associated with the instrument heard in the background, let's analyze each option step by step:
A) Charlie Parker:
- Charlie Parker was known for playing the saxophone.
- He was a famous jazz musician.
- He was one of the most influential saxophonists in jazz history.

B) Ludwig van Beethoven:
- Ludwig van Beethoven was a classical composer and pianist, not a saxophonist.
- He was known for his works in the classical music genre.
- There is no association between Beethoven and the saxophone.

C) Jimi Hendrix:
- Jimi Hendrix was a famous guitarist.
- While he was known for experimenting with various instruments, there is no significant association with the saxophone.

D) Freddie Mercury:
- Freddie Mercury was a vocalist and keyboardist for the rock band Queen.
- There is no notable connection between Freddie Mercury and the saxophone.

Given the analysis, the most probable answer is A) Charlie Parker, as he is the one associated with the saxophone among the options.
Qwen2-Audio-Instruct
The instrument heard in the background is a saxophone. When considering the most famous musicians known for playing the saxophone, the most likely answer would be A) Charlie Parker. Parker is widely regarded as one of the most influential musicians in jazz history and is known for his saxophone playing技巧. Therefore, it is most probable that the musician in the background is Charlie Parker.

SALMONN
The most probable answer is C) Jimi Hendrix, as he was a prominent player of the electric guitar and used it to create unique sounds and techniques that were not common in classical music. Charlie Parker's work was primarily in the context of jazz music, Beethoven wrote for orchestra and piano, while Freddie Mercury's style could be seen as more influenced by rock and pop musicians.

Therefore, Jimi Hendrix would likely have played the type of instrument heard in the background with his innovative use of the electric guitar. This can be supported by his reputation as one of the greatest guitarists of all time, as well as the way in which his music incorporated elements from multiple genres such as blues, funk, and psychedelic music.

LTU
D) Freddie Mercury - because the type of instrument heard is known for his exceptional singing and piano playing.
Comment: BALSa and Qwen2-Audio-Chat give correct answers, but BALSa provides a more detailed response with background information and a step-by-step analysis, making the reasoning clearer. However, Qwen2-Audio-Instruct mixes Chinese in its output, which may affect readability and consistency. The responses from SALMONN and LTU are incorrect.


Example 2

Acoustic Scene Interpretation through Poetic Matching (Audio Source: MMAU)


Instruction Audio Input
你認為哪一首詩句比較符合這個音訊中的所有發生的聲音事件所構成的場景?請先列點分析,並給出你認為最有可能的答案。
(English Translation: Which poem best matches the scene composed of all the sound events in this audio? Please first analyze the elements in bullet points and then provide your best possible answer.)

-《竹枝詞》 劉禹錫
楊柳青青江水平,
聞郎江上唱歌聲。
東邊日出西邊雨,
道是無晴卻有晴。

-《春江花月夜》張若虛
江流宛轉繞芳甸,
月照花林皆似霰。
空裡流霜不覺飛,
汀上白沙看不見。

- 《宿建德江》孟浩然
移舟泊煙渚,
日暮客愁新。
野曠天低樹,
江清月近人。
Our Model Output Baseline Models' Output
BALSa
對於這個音訊,我們可以聽到以下聲音事件:

1. 雨聲
2. 水聲(可能是河流或湖泊的水聲)
3. 人類聲音(女人或孩子在唱歌)

根據這些聲音事件,我們可以對照給出的詩句進行分析:

* 雨聲和水聲與「東邊日出西邊雨」、「江流宛轉繞芳甸」、「江清月近人」等詩句相符,這些詩句都描述了自然界的雨和水景象。
* 人類聲音與「聞郎江上唱歌聲」、「月照花林皆似霰」這兩首詩中並無相關描述,但「江上唱歌聲」比較符合。

綜合分析後,我們可以認為最有可能的詩句是:《竹枝詞》劉禹錫的「東邊日出西邊雨,道是無晴卻有晴。」
Qwen2-Audio-Instruct
音訊中有雨声、浪花拍打岩石聲、音乐及女性低語的聲音。其中,女性的低語和雨声創造了一種静謐而略帶忧郁的氛圍,與孟浩然的《宿建德江》中「野旷天低树,江清月近人。」的景象相吻合,因此這句詩最符合此場景。

SALMONN
我認為最有可能符合這個音訊中所有的聲音事件所構成的場景,是《春江花月夜》張若虛所作。

其他的歌曲與此比較,都缺乏了具體地表達出水情和江景,如《竹枝詞》劉禹錫只提到「江上」等字眼,未明示任何寓意;《宿建德江》孟浩然在歌曲中更加用大量的形容詞去略而不及地表達出江景之美,完全不符合音訊中所呈現的特殊場景。而《春江花月夜》张若虚则直接用“江流”、“月照”等細致的语言描写出了江景和月光景色,这与音频场景相符合。

LTU
These pieces of poetry and music are both inspired by the sounds of nature, including the river, flowers, and the wind. The music can be heard as the man sings, while the poetry provides a deeper and more meaningful reflection on nature and human emotions.
Comment: This question tests the ability to understand audio information and match the scene created by the sound events to the most suitable poem. It also evaluates the model's multilingual ability, specifically in Chinese. This is an advanced audio reasoning and cross-lingual task.

The first poem covers almost all the sound events, especially the "river sound," "singing," and "rain."
The second poem mentions water sounds but lacks descriptions of human voices, and the rain sound doesn't match well.
The third poem has a quiet atmosphere and no mention of human voices, making it less suitable for the audio scene.

Therefore, the first poem, Bamboo Branch Song (《竹枝詞》), is the best choice. Only BALSa provided the correct reasoning and answer, demonstrating its strong audio reasoning and multilingual abilities.

BALSa: ✅ Best response
- Clearly identifies the key sound events in the audio.
- Analyzes each poem based on the presence of rain, water sounds, and human singing.
- Logically selects 《竹枝詞》 as the most fitting due to its reference to singing and rain.

Qwen2-Audio-Instruct: ✅ Reasonable response
- Identifies key sound elements but emphasizes a quiet and melancholic atmosphere.
- Selects 《宿建德江》, which matches the tranquil setting but lacks a direct reference to singing or rain.

Although our model wasn’t trained on non-English data, it can still understand multiple languages. This ability comes from the backbone text LLM.


Example 3

Multi-Hop Reasoning Questions (Audio Source: SAKURA)


Instruction Audio Input
Single-Hop Question
Out of the animals listed below, can you identify the one that most closely matches the sound in the provided audio recording based on its auditory characteristics? (a) hen (b) frog (c) dog (d) rooster

Multi-Hop Question
Considering the characteristics of the animal in the audio, which of the listed methods best aligns with how it reproduces? (a) Oviparous (b) Marsupial reproduction (c) Asexual reproduction (d) Viviparous
Our Model Output Baseline Models' Output
BALSa
Single-Hop Question Response
Based on the provided audio, the sound that most closely matches the sound in the recording is (c) dog.

Multi-Hop Question Response
Based on the provided audio, I detect the sounds of a "bark" and a "howl".

The animal that makes these sounds is a dog.

Dogs are viviparous, meaning they give birth to live young.

The best answer is: (d) Viviparous.
Qwen2-Audio-Instruct
Single-Hop Question Response
The sound in the provided audio recording most closely matches that of a dog.

Multi-Hop Question Response
The animal in the audio reproduces through a process that involves eggs. Therefore, the best answer is (a) Oviparous.

SALMONN
Single-Hop Question Response
c) dog

Multi-Hop Question Response
d) Viviparous


LTU
Single-Hop Question Response
It is not possible to identify the animal that most closely matches the sound in the provided audio recording based on its auditory characteristics.

Multi-Hop Question Response
The animal in the audio is a dog, which is a viviparous animal, meaning it reproduces by giving birth to live young. Therefore, the method of reproduction that best aligns with the characteristics of the animal in the audio is (d) viviparous.
Comment: BALSa and Qwen2-Audio-Chat can correctly answer single-hop questions, showing a proper understanding of sound identification. However, for multi-hop questions about an animal's reproduction method (e.g., oviparous, viviparous, marsupial, asexual reproduction), Qwen2-Audio-Chat incorrectly infers that a dog is viviparous. SALMONN lacks some reasoning steps and explanations.


Example 4

Bioacoustic Sound Identification and Taxonomic Classification (Audio Source: Link)


Instruction Audio Input
Listen to the given audio and identify the type of animal sound present.

Once identified, classify the animal using the following taxonomy levels:
- **Kingdom**
├── **Phylum**
│    ├── **Class**
│    │    ├── **Order**
│    │    │    ├── **Family**
│    │    │    │    ├── **Genus**

Return the result as a **Tree Diagram** like the above example.
Our Model Output Baseline Models' Output
BALSa
Based on the provided audio, I detect the following animal sound: Oinking.

Here is the classification of the animal based on the taxonomy levels:

```
Kingdom: Animalia
├── Phylum: Chordata
│    ├── Class: Mammalia
│    │    ├── Order: Artiodactyla
│    │    │    ├── Family: Suidae
│    │    │    │    ├── Genus: Sus
```
Qwen2-Audio-Instruct
After analyzing the sound, I identified the animal to be a pig. In the taxonomic classification, the pig belongs to the following levels:

Kingdom: Animalia
Phylum: Chordata
Class: Mammalia
Order: Artiodactyla
Family: Suidae
Genus: Sus

SALMONN
It is difficult to accurately identify the type of animal sound without more context. However, if it were an animal with a distinctive vocalization such as a cow or pig, then its vocalizations may be classified in the animal kingdom at the most general level (e.g., Kingdom). Once classified, further classification can occur at higher levels, including phylum, class, order, family, genus and species, based on observable characteristics and behavior patterns.

LTU


Kingdom: Animalia
Phylum: Chordata
Class: Mammalia
Order: Swine
Family: Sus scrofa
Comment: Qwen2-Audio-Instruct and LTU can classify the animal based on taxonomy levels, but they fail to present the results in a "Tree Diagram" format as required by the question.


Example 5

Intermittent Sound Classification and Categorization (Audio Source: MMAU)


Instruction Audio Input
Listen to the given audio and determine what type of sounds can be heard intermittently.

The choices are:
1. Musical instruments 2. Animal noises 3. Natural phenomena 4. Sound effects

Your response **must follow** this structure:

1️⃣ **Detected Sounds:**
      - Describe the key sounds that are heard intermittently in the audio.

2️⃣ **Analysis:**
      - Compare the detected sounds with each choice.
      - Explain why some options are more likely than others.

3️⃣ **Final Answer:**
      - Select the most probable option from the given choices.
      - Output only the selected choice without additional explanations.
Our Model Output Baseline Models' Output
BALSa
1️⃣ Detected Sounds:
      - A sequence of clinks and knocks, followed by a soft whooshing sound, and a low rumble.

2️⃣ Analysis:
      - Comparing the detected sounds with the options:
            - Option 1 (Musical instruments): The sequence of clinks and knocks could resemble percussion, but there's no clear melody or harmonies to support this. The whooshing sound and low rumble don't fit typical musical instruments.
            - Option 2 (Animal noises): There's no clear indication of animal noises, like growls, chirps, or whistles, in the detected sounds.
            - Option 3 (Natural phenomena): The low rumble could be associated with natural thunder, but the sequence of clinks and knocks and whooshing sound don't directly match typical natural phenomena sounds.
            - Option 4 (Sound effects): The combination of clinks, knocks, a whooshing sound, and a low rumble is more indicative of sound effects often used in movies or video games to create a specific atmosphere or environment.

3️⃣ Final Answer:
      4
Qwen2-Audio-Instruct
1️⃣ Detected Sounds:
      - Intermittent noise, static
      - Tapping
      - Beeping
      - Ringing, telephone bell ringing

2️⃣ Analysis:
      - The first type of sound, intermittent noise and static, is common in environments with poor signal reception or equipment malfunctions.
      - The second sound, tapping, could be due to someone using a tool or device, possibly repairing or assembling something.
      - The third sound, beeping, can be associated with various electronic devices or systems, such as a telephone or a computer.
      - The fourth sound, ringing, specifically telephone bell ringing, indicates a communication device being used.

3️⃣ Final Answer: 4️⃣ Telephony

SALMONN
Detected Sounds: Musical instruments, animal noises, natural phenomena, sound effects

LTU
1. Musical instruments: Intermittent sound effects in a jingling, percussive pattern are heard.
Comment: BALSa's response is better because it strictly follows the required format, systematically analyzes all options, and provides a clear final answer.


Example 6

Auditory Scene Classification and Interpretation (Audio Source: MMAU)


Instruction Audio Input
Listen to the given audio and determine what auditory experience it might suggest.
The choices are:
1. Listening to a podcast
2. Attending a public speech
3. Hearing an artificial song
4. Listening to a radio talk show

Your response **must** follow the format below:

<think>

- Provide a detailed step-by-step analysis of each choice.
- Compare the detected sounds with the typical characteristics of each option.
- Clearly explain why each option is likely or unlikely based on the audio.
- Do not include any text outside this section.

</think>


<answer>

- Write only the most probable answer here.
- Do not include explanations, just the final answer.
- Do not add any text outside this <answer> section.

</answer>

Our Model Output Baseline Models' Output
BALSa

<think>

- The audio features a human voice singing.
  - The human voice is singing and does not have the typical conversational tone of a talk show.
  - This rules out option 2 (attending a public speech) and option 4 (listening to a radio talk show).
  - The singing is not repetitive, but it does not follow the typical musical structure of a song, so option 3 (hearing an artificial song) is also unlikely.
- The human voice is singing in a language that sounds like a local dialect or a folk song, and it is likely in a traditional or cultural setting.
  - Given the human voice singing and its setting, option 1 (listening to a podcast) is also unlikely as podcasts typically do not feature singing in this context.

</think>


<answer>

3. Hearing an artificial song

</answer>

Qwen2-Audio-Instruct

<think>

1. Listening to a podcast: In this scenario, the sound could be a podcast episode, but we do not have enough context to confirm that.
2. Attending a public speech: This could be a possible option if the speaker is using a loud, monotone voice and clapping is a common reaction to their speech. However, we do not have information about a specific speech or the location.
3. Hearing an artificial song: The combination of vocal singing and synthetic sounds resembles a song that might be produced by a digital device or a music creation software. However, we do not have any additional context to confirm this.
4. Listening to a radio talk show: The audio features a female voice speaking, a slow pace, and a storytelling style. This suggests that the audio is from a radio talk show where hosts often share stories or engage in casual discussions.

SALMONN
4. Listening to a radio talk show

LTU
The most likely scenario is that the audio is a podcast, as the speech has a consistent tone and voice, and is likely to be the result of a well-prepared and deliberate speech, as in a podcast. The other options are less likely, as the sound does not suggest a particular activity or event.
Comment: BALSa is the better response because it strictly follows the required structure by reasoning within the think tag and providing the final answer within the answer tag. In contrast, Qwen2-Audio-Instruct does not adhere to this format, as it presents an analysis without explicitly structuring reasoning and answering within the designated tags.


Example 7

Temporal Reasoning (Audio Source: MMAU)


Instruction Audio Input
Listen to the given audio and identify the sound with the longest duration.

Break down your analysis step by step using bullet points.

The choices are: 1. Siren 2. Clicking 3. Mechanisms 4. Bird Chirping

Finally, select the most probable answer.
Our Model Output Baseline Models' Output
BALSa
To identify the sound with the longest duration, I will analyze the given audio step by step:

1. **Identify the distinct sounds present in the audio:**
      - The audio starts with the sound of a siren.
      - It is then followed by the clicking sound.
      - There's an instance of bird chirping, but it's very brief.
      - Mechanisms sounds are heard towards the end of the audio.

2. **Assess the duration of each sound:**
      - The siren sound lasts throughout the audio, covering the longest portion of it.
      - The clicking sound is shorter compared to the siren but longer than the bird chirping.
      - The bird chirping is the shortest sound in the audio.
      - Mechanisms sounds are longer than bird chirping but shorter than the siren.

3. **Select the sound with the longest duration:**
      Based on the analysis, the sound with the longest duration is the siren.

Therefore, the most probable answer is: **1. Siren**
Qwen2-Audio-Instruct
- The sound with the longest duration is 'Siren'
- This is evident from the time frame of 4.20 seconds to 10.00 seconds, where there are continuous sounds of a siren, unlike the other options which have shorter durations.

SALMONN
The most probable answer is siren.

LTU
The answer is: Siren
Comment:
BALSa: ✅ Best response
- Clearly follows step-by-step analysis as required.
- Identifies all sounds, compares durations, and selects the correct answer.
- Provides a well-structured and logical explanation.

Qwen2-Audio-Instruct: ❌ Incomplete analysis
- Gives the correct answer but lacks a full breakdown of all sound options.
- Mentions timestamps, but the reasoning is not as detailed as BALSa.

SALMONN: ❌ Too brief
- Only provides the answer without any reasoning.

LTU: ❌ Too brief
- Just states the answer without any explanation or breakdown.


Example 8

Temporal Reasoning (Audio Source: MMAU + TTS)


Instruction Audio Input
Listen to the given audio and identify the sound with the shortest duration.

Break down your analysis step by step using bullet points.

The choices are: 1. Train 2. Human voice 3. Wind 4. Dog barking

Finally, select the most probable answer.
Our Model Output Baseline Models' Output
BALSa
To identify the sound with the shortest duration, I will break down the analysis step by step using bullet points:

* I detect the sound of a train approaching and passing in the background.
* The train sound is loud and continuous for about 2-3 seconds.
* A human voice is heard talking briefly, but it's not loud and seems to be speaking for about 1-2 seconds.
* I do not detect any wind sounds.
* I also do not detect any dog barking sounds.

Based on the analysis, the human voice is the sound with the shortest duration.
Qwen2-Audio-Instruct
- The sound with the shortest duration is 'Human voice', lasting 0.37 seconds from 9.63 to 10.00 seconds.
- The other choices have longer durations:
      * Train: 10.00 seconds from 0.00 to 10.00 seconds.
      * Wind: 10.00 seconds from 0.00 to 10.00 seconds.
      * Dog barking: 0.63 seconds from 8.02 to 8.65 seconds.

SALMONN
The most probable answer is 1. Train.

LTU
Track: Train: Short and powerful
Comment:
BALSa: ✅ Best response
- Follows the step-by-step breakdown as required.
- Clearly identifies and compares the durations of all detected sounds.
- Explains why the selected answer has the shortest duration.

Qwen2-Audio-Instruct: ❌ Partially correct but lacks structure
- Provides precise timestamps but does not break down the analysis step by step as requested.
- Mentions durations but does not analyze why each sound is longer or shorter.

SALMONN: ❌ Incorrect answer
- Selects "Train," which is unlikely to be the shortest duration sound.
- No reasoning or breakdown is provided.

LTU: ❌ Unclear and incomplete
- Mentions "Train" but does not clearly answer the question.
- Lacks proper analysis and structure.


Example 9

Contextual Source Identification (Audio Source: MMAU)


Instruction Audio Input
Listen to the given audio and determine the most likely source of the music. The possible options are: radio, fire truck, construction site, and airplane. Think step by step before selecting the best answer. Present your response in a Markdown table format.
Our Model Output Baseline Models' Output
BALSa
Based on the provided audio, here's the determination of the most likely source of the music:


Based on the analysis, the music is most likely from a radio. The presence of a variety of music genres and a loud volume, along with static or talking sounds in the background, supports this conclusion.
Qwen2-Audio-Instruct

SALMONN
Here is a table with the possible sources and steps to reach each source, along with the most likely answer:

LTU
Labels: Fire truck siren; Music; Reverberant and spacious
Comment:
BALSa provides the best response due to its clear reasoning and structured presentation. Key strengths include:
- Step-by-step analysis: It systematically evaluates multiple criteria (volume, instrumentation, background noise, etc.).
- Well-justified conclusion: The explanation clearly supports "radio" as the most likely source.
- Readable Markdown table: The structured format makes it easy to compare different sources.

In contrast:
- Qwen2-Audio-Instruct lists sources but lacks clear justification.
- SALMONN provides a reasoning process but is too general.
- LTU only gives keywords without explanation.


Example 10

Bioacoustic Sound Source Identification (Audio Source: MMAU)


Instruction Audio Input
Based on the given audio, identify the source of the roars. The choices are: lion, elephant, bear, and wolf. Analyze each option step by step, comparing the detected sounds with known characteristics of each animal. Explain your reasoning in detail after 'Reason:' and write your final answer after 'Answer:'.
Our Model Output Baseline Models' Output
BALSa
The detected sound events in the given audio are: a roars.

Reason:
- The sound is a low, loud roar.
- It does not sound like a sharp, high-pitched howl associated with a wolf.
- It is louder and deeper than a typical bear roar, which usually sounds more growling and snarling.
- It does not sound like the deep, low rumble of an elephant, but rather a more sharp, intense roar.

Answer: Lion.
Qwen2-Audio-Instruct
Reason: The sound in the audio is roaring, which is a common trait among large carnivores.狮子 (lion) and老虎 (tiger) are known for their powerful roars. Among these, the sound seems to be more aggressive and similar to that of a tiger or a big cat.
Answer: Tiger

SALMONN
Roars in the audio clip could potentially come from a lion, an elephant, or a bear. While wolves are known for their howling, it's possible that they were roaring as well. Let's analyze each option step by step and compare the detected sounds with known characteristics of each animal.

Lion: Lions are known to have powerful roars, which can be heard across long distances due to their massive chest size. They also tend to roar frequently, particularly during territorial disputes or when establishing dominance over other males. In this case, the lion would likely produce a distinctive "king of the jungle" sound. (Reason: Lions have large bodies and vocal cords which amplify the power of their roars.)

LTU
Step 1: Lion
The audio consists of a deep, loud and rumbling roar. The frequency of the roar is low to mid-range, with a dominant range between 100-500 Hz. The duration is short, lasting only a few seconds.

Step 2: Elephant
The audio consists of a deep, low-frequency roar. The frequency of the roar is low, with a dominant range between 100-500 Hz. The duration is long, lasting for several seconds.

Step 3: Bear
The audio consists of a low and rumbling roar. The frequency of the roar is low, with a dominant range between 100-500 Hz. The duration is long, lasting for several seconds.

Step 4: Wolf
The audio consists of a high-pitched howl. The frequency of the howl is high, with a dominant range between 1000-3000 Hz. The duration is short, lasting only a few seconds.

Step 5: The audio is not distinct enough to determine which of these animals the roars belong to. The different roars might be coming from different animals in the same vicinity.
Comment: BALSa provides the best response—concise, structured, and correctly follows the instructions.


Example 11

Contextual Sound Source Attribution (Audio Source: MMAU)


Instruction Audio Input
Identify a famous location associated with the instrument heard in the background. The choices are: Central Park, New York; Notre-Dame Cathedral; Sydney Opera House; Grand Canyon. Analyze each option step by step, considering whether the sound of a chainsaw is commonly associated with the location. Explain your reasoning in detail.
Our Model Output Baseline Models' Output
BALSa
Based on the audio, I detect the sound of a chainsaw.

Considering the locations given, I will analyze each option:

1. **Central Park, New York**: Chainsaws are not typically used in Central Park for its maintenance. The park is known for its gardens, walking paths, and recreational activities. Chainsaw sounds are not commonly associated with this location.

2. **Notre-Dame Cathedral**: Chainsaws are occasionally used in historic locations like Notre-Dame Cathedral for restoration and maintenance. However, they are not typically used on a daily basis or during normal operations, which makes it less likely that chainsaw sounds would be a common feature.

3. **Sydney Opera House**: Chainsaws can be associated with the maintenance of outdoor spaces, gardens, and structural elements around the Sydney Opera House. Given the vast open spaces and large structures surrounding the opera house, it is plausible that chainsaws could be used in the area for various tasks.

4. **Grand Canyon**: Chainsaws might be used by hikers, park rangers, or maintenance personnel in the Grand Canyon for camping, clearing trails, or other activities. This location has many wooded areas, making chainsaw sounds a possible feature.

Considering these factors, based on the given audio of a chainsaw, I would associate the sound with **Grand Canyon**.
Qwen2-Audio-Instruct
The famous location associated with the sound of a chainsaw in the background is the Grand Canyon. The Grand Canyon is a natural wonder known worldwide and is frequently visited by tourists. It is known for its massive size and unique geological formations, which make it a popular destination for adventure activities such as hiking, rafting, and off-road vehicle tours. The sound of a chainsaw in the background could suggest activities such as woodcutting, construction, or maintenance around the Grand Canyon, which are common in the area.

SALMONN
The famous location associated with the chainsaw sound is likely a construction site or farm. This is because chainsaws are commonly used for cutting wood in these settings, making them a noisy and recognizable presence. While Central Park, Notre-Dame Cathedral, Sydney Opera House, and Grand Canyon all offer unique sights and experiences, it would be unlikely to hear the sound of a chainsaw in any of these locations.

LTU
Step 1: Sound identification - Instrument (chainsaw): The loud and sharp chainsaw sound suggests it could be associated with the construction of buildings, roads or other infrastructure projects in urban or suburban environments.
Step 2: Location sound association: If it is true, the sound of a chainsaw in this audio clip could be associated with construction activities happening in an environment like Central Park in New York, Sydney Opera House, or even the Grand Canyon.
Step 3: Feedback: No additional information is needed in this audio clip. The association between the sound of a chainsaw and various places in different contexts should be considered.

### Explanation:
Step 1: The sound of a chainsaw is a common feature in urban or suburban environments where construction activities are taking place. This could be associated with many different locations such as Central Park in New York, Sydney Opera House, or even the Grand Canyon.
Step 2: The sound of a chainsaw can be associated with various locations, such as urban or suburban environments, where construction activities are taking place. However, the specific location cannot be determined from the audio clip.
Step 3: Feedback: No additional information is needed in this audio clip. The association between the sound of a chainsaw and various locations in different contexts should be considered. No further information is needed.
Comment: BALSa: ✅ Best response
- Clearly follows the required step-by-step analysis.
- Evaluates each location logically with respect to chainsaw usage.
- Correctly concludes that the Grand Canyon is the most probable answer.

Qwen2-Audio-Instruct: ❌ Partially correct but lacks structured analysis
- Provides a reasonable answer (Grand Canyon).
- Does not analyze all options step by step as required.

SALMONN: ❌ Off-topic and incorrect
- Discusses construction sites and farms, which are not among the given choices.
- Fails to make a clear selection from the provided options.

LTU: ❌ Unclear and inconclusive
- Overcomplicates the analysis with repetitive explanations.
- Does not explicitly select the best answer from the given choices.


Example 12

Historical Sound Attribution and Contextual Reasoning (Audio Source: MMAU)


Instruction Audio Input
Listen to the given audio and determine which famous individual is most associated with the device heard in the background, given that an infant is crying.

The choices are: 1. Thomas Edison 2. Nikola Tesla 3. Alexander Graham Bell 4. Guglielmo Marconi

First, analyze the detected sounds and describe their characteristics. Then, compare the possible device producing the sound with each option, considering their historical contributions and associations. Finally, select the most probable answer.
Our Model Output Baseline Models' Output
BALSa
Based on the provided audio, I detect the following sounds:

1. A low, persistent humming noise.
2. An infant crying.

The humming noise is characteristic of a device that is continuously operating and producing a steady tone.

Comparing the possible device producing the sound with each option:

1. Thomas Edison: Not associated with a device that hums.
2. Nikola Tesla: Known for experimenting with alternating current (AC), but I don't have information that associates him directly with a humming device.
3. Alexander Graham Bell: Associated with the first telephone, which was an early version that had a motor or an electromagnet that could produce a humming sound.
4. Guglielmo Marconi: Not associated with a device that hums.

Considering these associations, the most probable answer is:

3. Alexander Graham Bell
Qwen2-Audio-Instruct
The sounds in the background consist of ringing bells, sounds reminiscent of mechanisms, and a crying infant. Analyzing these sounds, it is likely that the device producing such noises is a telephone, as the ringing bells and mechanisms sounds resemble traditional telephone components.

Now let's examine each historical figure in relation to the telephone:

1. Thomas Edison: While Edison is known for inventing the practical incandescent light bulb and developing an early form of the electric refrigerator, he is not typically associated with the telephone.

2. Nikola Tesla: Tesla is known for his work on alternating current (AC) electrical systems and his contributions to wireless communication. However, he did not invent the telephone.

3. Alexander Graham Bell: Bell is credited with inventing the practical telephone in 1876, which uses an acoustic diaphragm to convert sound waves into electrical signals and vice versa.

4. Guglielmo Marconi: Marconi is known for his contributions to wireless communication, particularly in the development of radio systems. While he did not invent the telephone, he did help to improve its technology.

Based on this analysis, the most likely answer is Alexander Graham Bell, as he is the only historical figure whose work is directly associated with the invention and development of the telephone.

SALMONN
The infant is crying, but the sound of a device can still be heard in the background. While there are many potential devices that could produce such a sound, including telephones and alarm clocks, one technology comes to mind - radio. Thomas Edison invented the first practical electrical generator, Nikola Tesla conducted groundbreaking research on alternating current and wireless communication, Alexander Graham Bell developed the telephone, and Guglielmo Marconi patented wireless telegraphy. So which invention do you think this crying baby might be making? The answer is clear: a telephone or other electromagnetic device that uses electromagnetism to transmit signals through the airwaves. In fact, it's interesting to note that while each of these technologies revolutionized communication and society as we know it, the first practical telephone was actually invented by Alexander Graham Bell and his mentor, Thomas Watson, not Thomas Edison himself.

LTU
Sound description: A device producing a bubbling and gurgling noise. The sound is characterized by being soft, high-pitched, and continuous, resembling a beeping sound.
Comment: BALSa: ✅ Best response
- Follows the required structure.
- Analyzes the detected sounds and compares each historical figure logically.
- Correctly concludes Alexander Graham Bell as the most relevant individual.

Qwen2-Audio-Instruct: ✅ Good response
- Provides a structured comparison.
- Identifies the telephone correctly.
- Slightly longer than necessary but reaches the correct conclusion.

SALMONN: ❌ Unfocused and vague
- Mentions multiple unrelated devices (radio, alarm clock).
- Overcomplicates the reasoning without clearly selecting an answer.

LTU: ❌ Completely off-topic
- Describes sounds incorrectly (bubbling, gurgling, beeping).
- Does not follow the question's structure or provide a clear conclusion.

Back to Table of Contents

Let's spice 🌶️ things up a bit. 🔊📝

By adding ASR text, the model can seamlessly handle textual conversations—on top of its original audio understanding capabilities.
We show some examples of spoken dialogues mixed with audio situations.

Back to Table of Contents

Question Our Model's Response Corresponding Audio
What scene is most likely happening, based on the sounds in the audio? Based on the provided audio, I detect the following sounds:

1. Footsteps
2. A man's voice saying "Move move move we're surrounded"
3. A man's voice saying "fall back"
4. A man's voice saying "I'm out of ammo"
5. Gunfire
6. A man's voice saying "Aim for the gun"

Given these sounds, it is likely that the scene is a battlefield or a combat situation where multiple individuals are fighting and trying to communicate with each other.

Source:
Sound Effects: Pixabay
Spoken Dialogues: Nari Dia

Based on the audio, what happened that made them feel so terrified? Provide a brief reasoning. Based on the audio, I detected the following sound events:

1. A loud crash.
2. A man speaking.

The loud crash may have caused them to feel terrified, as it suggests a sudden and potentially violent or intense event.

Source:
Sound Effects: Pixabay
Spoken Dialogues: Nari Dia

What might have triggered the first speaker's emotional outburst? Analyze the audio step by step and provide a detailed explanation. Based on the provided audio, I can identify the following sound events:

1. A conversation between two individuals.
2. One of the individuals says something that causes the other person to respond with an emotional outburst ("What do you mean by that? I didn't say anything wrong!").
3. The second speaker responds with an emotional tone, but it's unclear what exactly triggered the outburst.
4. A glass shattering sound ("shatter") is heard.

Considering the sequence of events:

- The conversation is calm until the second speaker says something that causes the first speaker to become defensive ("What do you mean by that? I didn't say anything wrong!").
- The first speaker's response indicates that they feel misunderstood or misinterpreted.
- The emotional outburst might have been triggered by a miscommunication or an unintended implication from the second speaker's statement.
- The shattering sound ("shatter") might be a physical response to the emotional outburst or a separate incident.

Source:
Sound Effects: Pixabay
Spoken Dialogues: Nari Dia

What might have triggered the first speaker's emotional outburst? Analyze the audio step by step and provide a detailed explanation. Based on the provided audio, I detected the following sound events:

1. A shatter sound
2. A woman's voice
3. A man's voice
4. A woman's voice (again)
From these sound events, I can only provide a neutral analysis based on the given information.

Reasonable explanation: The argument could have started due to the shatter sound, which might have been caused by an accident or an intentional action. The woman's voice could be expressing shock, frustration, or concern about the incident, while the man's voice could be defensive or dismissive. However, without more information, it's impossible to determine the exact cause or nature of the argument.

To solve this problem, more context or additional information would be necessary to understand the situation and the underlying reasons for the disagreement.

Source:
Sound Effects: Pixabay
Spoken Dialogues: Nari Dia

Back to Table of Contents

Instruction-following Ability 🤖 🎯

Our model can understand and reason about audio while also handling complex instructions, like following specific formats and step-by-step reasoning.
This section showcases real examples of these abilities.

Back to Table of Contents

Instruction Model Output
Listen to the given audio and tell me what you hear.
Your entire response should be in English, capital letters only.
I HEAR:
WOMAN TALKING
MUSIC PLAYING
WATER RAINING
Listen to the given audio and tell me what you hear.
Write your entire response in English, using only lowercase letters.
i hear a woman speaking in a normal tone.
i hear music in the background.
i hear rain falling.
i hear the sound of water splashing, possibly from waves or a wave machine.
Listen to the given audio and tell me what you hear.
Please provide your answer using JSON format.
{
   "audio_events": [
      {
            "event": "adult female voice",
            "description": "person speaking"
      },
      {
            "event": "raindrops falling",
            "description": "water droplets hitting a surface"
      },
      {
            "event": "ocean waves crashing",
            "description": "large body of water moving"
      },
      {
            "event": "music playing",
            "description": "sound of musical composition"
      }
   ]
}
請描述這段聲音,這段聲音中發生了什麼事情? 以中文回答。

Comment: Despite not being trained on any non-English data, our model exhibits multilingual capabilities. This ability is entirely inherited from the backbone text LLM.
這段聲音中發生了以下事情:

* 聲音開始時聽到雨聲,可能是大雨或輕微雨。
* 雨聲後聽到女人在唱歌,有音樂伴奏。
* 在歌曲中,可能聽到一些水流的聲音,可能是海浪或水流。
* 聲音結束時,音樂仍然在播放,可能是歌曲的結尾或續奏部分。
この音声を説明してください。この音声では何が起こっていますか? 日本語で回答してください。 この音声では、女性の声と、雨の音、そして音楽が聞こえています。
你認為哪一首詩句比較關聯這個音訊中的內容?例如發生的聲音事件、氛圍等等。
推理的過程寫在 [推理] 和 [/推理] 之間,推理的過程你可以加入一些思考過程中的自言自語,在你想出答案的時候,以「啊哈」開頭,最終答案寫在 [最終答案] 部分。

-《竹枝詞》 劉禹錫
楊柳青青江水平,
聞郎江上唱歌聲。
東邊日出西邊雨,
道是無晴卻有晴。

-《春江花月夜》張若虛
江流宛轉繞芳甸,
月照花林皆似霰。
空裡流霜不覺飛,
汀上白沙看不見。

- 《宿建德江》孟浩然
移舟泊煙渚,
日暮客愁新。
野曠天低樹,
江清月近人。
[推理] 問題是要找出與音訊相關的詩句。音訊中有許多聲音和場景描述,可以嘗試與每首詩進行比對。首先,我注意到詩句中提到「江上唱歌聲」,這與音訊中有女性唱歌的聲音相關聯。「楊柳青青江水平」也描述了一個與水相關的環境,與音訊中的人聲和水流聲有關聯。啊哈!這可能是答案! [/推理]

[最終答案] 最終答案是《竹枝詞》劉禹錫的詩句。
First turn
Describe the sounds you hear in the audio in detail.

Second turn
In what kind of scenario do you think this audio might appear? Provide some analysis.

Comment: Our model is not trained on multi-turn dialogue data. Its ability to engage in multi-turn interactions arises naturally from the pretrained text LLM.
First turn
Based on the detected sounds, I think this audio might appear in a scene that combines natural ambiance with a narrative element. The presence of a woman's voice and the smooth, gentle quality of the sound suggest a peaceful and possibly calming atmosphere.

The sound of rain might imply a location outside, possibly near a natural setting such as a forest, a park, or a mountainous area. The flowing sound could be indicative of a nearby stream, river, or waterfall.

The combination of these sounds could be present in:

1. A meditation or nature documentary, where the sounds are used to create a soothing background atmosphere.
2. A podcast or an audiobook set in a natural environment, possibly with the narrator guiding the listener through a hike or describing a scene.
3. A film or television show scene, where the setting is a serene natural location and the sound is used to establish the atmosphere.

The specific scenario is difficult to determine without further information, but these possibilities are based on the detected sounds.

Second turn
Based on the detected sounds, the scenario where this audio might appear could be:

A scene of a calm, peaceful atmosphere, possibly outdoors, where a person is walking or standing near a body of water, such as a river or a stream. The gentle rain falling around them creates a soothing ambiance.

The presence of the woman's voice could suggest that she might be narrating or commenting on the scene, providing a calming and serene environment for the listener.

It's also possible that the audio could be part of a nature documentary, a guided meditation, or a calming soundtrack for a visual media piece, using the natural sounds to create a sense of tranquility and relaxation.
Back to Table of Contents

Method 🔬 ⚙️

Stage 1: Data Construction

  • BALSa is a data construction pipeline that uses a backbone LLM to generate synthetic audio-text pairs, improving audio-language alignment. The LLM analyzes metadata (e.g., sound event tags) and generates flexible audio descriptions.
  • LISTEN is a training method under BALSa that helps models recognize both present and absent sounds, reducing audio hallucinations and improving reliability.
  • BALSa-MA extends BALSa to multi-audio scenarios, enabling models to compare audio samples and generate structured descriptions, enhancing audio understanding.

  • Back to Table of Contents




    Prompt Backbone LLM's Response (LLaMA-3.1-8B-Instruct) Corresponding Audio
    Seed Prompt
    A woman talks nearby as water pours

    Generation Prompt
    Repeat the audio.
    *water pouring sounds*
    *woman talking in the background*

    Source: AudioCaps

    Seed Prompt
    A woman talks nearby as water pours

    Generation Prompt
    List some specific sound events that are not actually present in the audio as contrastive examples.
    Based on the provided audio, here are some specific sound events that are not present in the audio:

    1. A car driving by
    2. Birds chirping
    3. A dog barking
    4. A train whistle blowing
    5. A piano playing a melody
    6. A person speaking in a different language
    7. A loud crash or explosion
    8. A baby crying
    9. A lawnmower cutting grass
    10. A plane flying overhead

    Source: AudioCaps

    Seed Prompt
    A woman talks nearby as water pours

    Generation Prompt
    Replay the audio and identify sounds that are absent as contrasting examples.
    Specific sound events detected in the provided audio:
    1. A woman’s voice 2. Water pouring
    Contrastive examples of specific sound events not present
    in the provided audio:
    1. A car driving by 2. A dog barking 3. The door bang

    Source: AudioCaps

    Seed Prompt
    Audio1: Different types of birds chirping.

    Audio2: A man speaking briefly as food and oil sizzle.

    Generation Prompt
    Explain the contrast between the two audio pieces in one long sentence.
    The contrast between the two audio pieces is that audio1 features a continuous, natural soundscape of various bird species chirping, whereas audio2 contains a brief, human speech and a dominant, sharp sounds of food and oil sizzling.

    Audio 1

    Audio 2

    Source: AudioCaps

    Seed Prompt
    Audio1: Baby crying and a woman giggling.
    Audio2: Loud claps of thunder followed by music playing.

    Generation Prompt
    Play back both audio files.
    For Audio 1:
    - A baby is crying.
    - A woman is giggling.

    For Audio 2:
    - Loud claps of thunder are heard.
    - Music starts playing after the thunder sound.

    Audio 1

    Audio 2

    Source: AudioCaps


    comparison of duration comparison of samples

    The left image compares the equivalent training duration of BALSa with other ALLMs, while the right image compares the number of training samples used by BALSa and other ALLMs. BALSa achieves strong performance using only 12% of the training data duration required by Qwen2-Audio.

    Stage 2: Training

  • BALSA trains a lightweight audio modality adapter while keeping the backbone LLM frozen, ensuring efficient learning without modifying large models.
  • Single-Audio Training: The adapter aligns a single audio input with text descriptions (orange arrows).
  • Multi-Audio Training: The adapter processes multiple audio inputs simultaneously, aligning them through comparative descriptions (orange + blue arrows).
  • This approach effectively reduces computational costs while maintaining strong audio-language alignment.

  • Back to Table of Contents