💦 AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering


Chun-Yi Kuan & Hung-yi Lee · National Taiwan University

Overview Figure
Figure 1. AQUA-Bench overview.
  • We introduce AQUA-Bench, a benchmark for assessing how models handle unanswerable audio questions across AAD, IASD, and IAQD settings.
  • We construct test sets spanning diverse audio types including animal, vocal, and instrumental sounds.
  • Experiments reveal that while current ALLMs handle solvable cases well, they often fail to detect unanswerable ones, highlighting a key reliability gap.

Experimental Results

Overall Results
Figure 2. Overall accuracy on solvable vs. unanswerable questions.
Category Breakdown
Figure 3. To further probe these behaviors, we also explored the effect of including explicit guidance in the prompt. For instance, we appended instructions such as, ``Select None of the above if you believe none of the listed answers are right'' for AAD and IASD tasks, or ``Pick Unanswerable when the audio lacks the details needed to decide'' for IAQD tasks. We observed a universal improvement in performance across all models.

Data Construction

AQUA-Bench includes solvable and unanswerable tasks generated under three systematic settings: AAD, IASD, and IAQD. Below we describe the design process and semantic distractors.

1. Animal Sounds

Base Classes: Dog, Rooster, Pig, Cow, Frog, Cat, Hen, Sheep, Crow.

Verb to Animal Mapping: barking → Dog, meowing → Cat, oinking → Pig, mooing → Cow, croaking → Frog, clucking → Hen, chirping → Bird, crowing → Rooster, bleating → Sheep, cawing → Crow.

Standard Question Templates:


(1) What animal is {verb} in the audio?
(2) Which animal is responsible for the {verb} sound in the clip?
(3) What animal can be heard {verb} in the recording?
(4) Identify the animal that is {verb} in the audio.
(5) Based on the audio, which animal is {verb}?
(6) Which of the following animals is {verb} in the clip?
(7) Which animal makes the {verb} sound in the audio?
(8) From the audio, which animal is making the {verb} sound?
(9) Which animal do you hear {verb} in the audio?
(10) The sound of {verb} in the clip is made by which animal?
(11) Which animal produces the {verb} sound in the audio?
(12) Can you tell which animal is {verb} in the recording?
(13) Which creature is responsible for the {verb} sound?
(14) Listen to the clip — what animal is {verb}?
(15) What creature is making the {verb} noise in the audio?

IAQD Templates:


(1) What color is the animal that is {verb}?
(2) Where is the animal that is {verb} located?
(3) What object is near the {verb} animal?
(4) What emotion does the {verb} animal express?
(5) What does the animal that is {verb} look like?
Semantic Distractors (👈 Expand Full List)

animal = Dog, Rooster, Pig, Cow, Frog, Cat, Hen, Sheep, Crow
profession = Doctor, Teacher, Nurse, Pilot, Police, Fireman, Chef, Farmer, Builder
color = Red, Blue, Green, Yellow, Black, White, Orange, Purple, Pink, Brown, Gray
object = Table, Chair, Phone, Bottle, Book, Cup, Computer, Refrigerator, Flower, Fence
emotion = Happy, Sad, Angry, Excited, Disgusted, Confused
fictional = Alien, Robot, Cartoon character, Stuffed toy, Fairy, Wizard, Superhero, Vampire
place = Park, Forest, Desert, Farm, House, Jungle, Zoo, Hospital, Library, Restaurant, Cafe, Bar, Beach, Mountain, River, Lake, Sea, Sky
appearance = Spotted, Striped, Furry, Slimy, Feathery, Shiny
size = Small, Medium, Large, Tiny, Giant, Huge, Enormous, Colossal
food = Ramen, Sushi, Udon, Soba, Sukiyaki, Shabu-shabu, Pasta, Pizza, Risotto, Lasagna, Gelato, Escargot, Beef Bourguignon
drink = Matcha, Sencha, Mugicha, Umeshu, Espresso, Cappuccino, Latte, Vin Rouge, Vin Blanc, Champagne
clothing = Kimono, Yukata, Beret, Tabi, Loafer, Happi, Shirt, Jumper, T-shirt, Blouse, Trousers, Jeans, Skirt, Leggings
city = Tokyo, Kyoto, Osaka, Hiroshima, Sapporo, Rome, Paris, Venice, Milan, Barcelona, Lyon, Istanbul, Bangkok, Ho Chi Minh City, Hanoi, Phnom Penh, Vientiane
vehicle = Car, Truck, Bike, Bus, Boat, Plane, Train, Motorcycle, Tractor, Horse-drawn cart
game = Soccer, Basketball, Tennis, Baseball, Tag, Hide and seek, Racing, Swimming, Water polo
book = Manga, Comic, Novel, Magazine, Textbook, Dictionary, Encyclopedia, Cookbook, Travel guide, Art book, Science book, History book, Biography, Self-help book, Horror book, Romance book, Science fiction book, Fantasy book, Mystery book, Thriller book
          

2. Vocal (Non-verbal Human) Sounds

Base Classes: Laughter, Sighs, Coughs, Throat Clearings, Sneezes, Sniffs.

Standard Question Templates:


(1) What non-verbal human sound is heard in the audio?
(2) Which non-speech vocal sound can be heard in the recording?
(3) Identify the non-verbal sound present in the clip.
(4) Based on the audio, which human sound is being made?
(5) Based on the audio, which human sound is being made?
(6) Which of the following non-verbal sounds is heard in the clip?
(7) From the sound in the clip, what non-speech sound do you recognize?
(8) Can you tell which non-verbal expression is present in the recording?
(9) Which non-verbal human sound is responsible for the noise in the audio?
(10) Listen to the clip — what type of non-speech sound is being made?
(11) The sound in the audio corresponds to which kind of human vocalization?
(12) Which non-verbal vocalization is heard in the recording?
(13) Which type of non-speech human sound appears in the audio?
(14) What non-verbal sound is featured in the clip?
(15) What kind of human expression is being produced in this audio segment?

IAQD Templates:


(1) What color is the person in the audio wearing?
(2) Where is the person located in the recording?
(3) Which city is the speaker in?
(4) What object is near the person in the audio?
(5) What emotion does the speaker express?
(6) How big is the person in the audio?
(7) What is the person eating?
(8) What is the person drinking?
(9) What is the person wearing?
(10) What kind of vehicle is the person riding?
(11) What game is the speaker playing?
(12) What is the person reading?
(13) What is the person's job?
Semantic Distractors (👈 Expand Full List)

vocal_sound = Laughter, Sighs, Coughs, Throat Clearings, Sneezes, Sniffs
animal = Dog, Cat, Frog, Cow, Pig, Horse, Sheep, Hen, Bird, Lion
profession = Doctor, Teacher, Nurse, Pilot, Police, Fireman, Chef, Farmer, Builder
color = Red, Blue, Green, Yellow, Black, White, Orange, Purple, Pink, Brown, Gray
object = Table, Chair, Phone, Bottle, Book, Cup, Computer, Refrigerator, Flower, Fence
emotion = Happy, Sad, Angry, Excited, Disgusted, Confused
fictional = Alien, Robot, Cartoon character, Stuffed toy, Fairy, Wizard, Superhero, Vampire
place = Park, Forest, Desert, Farm, House, Jungle, Zoo, Hospital, Library, Restaurant, Cafe, Bar, Beach, Mountain, River, Lake, Sea, Sky
appearance = Spotted, Striped, Furry, Slimy, Feathery, Shiny
size = Small, Medium, Large, Tiny, Giant, Huge, Enormous, Colossal
food = Ramen, Sushi, Udon, Soba, Sukiyaki, Shabu-shabu, Pasta, Pizza, Risotto, Lasagna, Gelato, Escargot, Beef Bourguignon
drink = Matcha, Sencha, Mugicha, Umeshu, Espresso, Cappuccino, Latte, Vin Rouge, Vin Blanc, Champagne
clothing = Kimono, Yukata, Beret, Tabi, Loafer, Happi, Shirt, Jumper, T-shirt, Blouse, Trousers, Jeans, Skirt, Leggings
city = Tokyo, Kyoto, Osaka, Hiroshima, Sapporo, Rome, Paris, Venice, Milan, Barcelona, Lyon, Istanbul, Bangkok, Ho Chi Minh City, Hanoi, Phnom Penh, Vientiane
vehicle = Car, Truck, Bike, Bus, Boat, Plane, Train, Motorcycle, Tractor, Horse-drawn cart
game = Soccer, Basketball, Tennis, Baseball, Tag, Hide and seek, Racing, Swimming, Water polo
book = Manga, Comic, Novel, Magazine, Textbook, Dictionary, Encyclopedia, Cookbook, Travel guide, Art book, Science book, History book, Biography, Self-help book, Horror book, Romance book, Science fiction book, Fantasy book, Mystery book, Thriller book
          

3. Musical Instruments

Base Classes: Piano, Acoustic Guitar, Drum Set, Violin, Flute, Saxophone, Clarinet, Trumpet, Keyboard, Harmonica.

Standard Question Templates:


(1) What musical instrument sound is heard in the audio?
(2) Which instrumental sound can be heard in the recording?
(3) Identify the musical instrument present in the clip.
(4) Based on the audio, which instrument is being played?
(5) Which of the following instrument sounds is heard in the clip?
(6)From the sound in the clip, what musical instrument do you recognize?
(7) Can you tell which instrument is present in the recording?
(8) Which musical instrument is responsible for the sound in the audio?
(9) Listen to the clip — what type of instrument sound is being played?
(10) The sound in the audio corresponds to which kind of musical instrument?
(11) Which instrument sound is heard in the recording?
(12) Which type of instrumental sound appears in the audio?
(13) What musical instrument sound is featured in the clip?
(14) What kind of instrument is being played in this audio segment?

IAQD Templates:


(1) What color is the musical instrument in the audio?
(2) Where is the musical instrument located in the recording?
(3) Which city is the musical instrument in?
(4) What object is near the musical instrument in the audio?
(5) How big is the musical instrument in the audio?
(6) Who is playing the musical instrument?
(7) Can you tell who the instrument belongs to?
(8) What kind of vehicle is the instrument being transported with?
(9) What is the profession of the performer?
(10) What does the performer look like or wear?
Semantic Distractors (👈 Expand Full List)

music_instrument = Piano, Acoustic Guitar, Drum set, Violin, Flute, Saxophone, Clarinet, Trumpet, Keyboard, Harmonica
vocal_sound = Laughter, Sighs, Coughs, Throat Clearings, Sneezes, Sniffs
animal = Dog, Cat, Frog, Cow, Pig, Horse, Sheep, Hen, Bird, Lion
profession = Doctor, Teacher, Nurse, Pilot, Police, Fireman, Chef, Farmer, Builder
color = Red, Blue, Green, Yellow, Black, White, Orange, Purple, Pink, Brown, Gray
object = Table, Chair, Phone, Bottle, Book, Cup, Computer, Refrigerator, Flower, Fence
emotion = Happy, Sad, Angry, Excited, Disgusted, Confused
fictional = Alien, Robot, Cartoon character, Stuffed toy, Fairy, Wizard, Superhero, Vampire
place = Park, Forest, Desert, Farm, House, Jungle, Zoo, Hospital, Library, Restaurant, Cafe, Bar, Beach, Mountain, River, Lake, Sea, Sky
appearance = Spotted, Striped, Furry, Slimy, Feathery, Shiny
size = Small, Medium, Large, Tiny, Giant, Huge, Enormous, Colossal
food = Ramen, Sushi, Udon, Soba, Sukiyaki, Shabu-shabu, Pasta, Pizza, Risotto, Lasagna, Gelato, Escargot, Beef Bourguignon
drink = Matcha, Sencha, Mugicha, Umeshu, Espresso, Cappuccino, Latte, Vin Rouge, Vin Blanc, Champagne
clothing = Kimono, Yukata, Beret, Tabi, Loafer, Happi, Shirt, Jumper, T-shirt, Blouse, Trousers, Jeans, Skirt, Leggings, Dress, Suit, Tuxedo, Coat, Jacket, Sweater, Hoodie, Cap, Beanie, Scarf, Gloves, Socks, Shoes, Boots, Sandals, Flip-flops, Sneakers, High heels, Wedges, Ankle boots, Mules, Slippers
city = Tokyo, Kyoto, Osaka, Hiroshima, Sapporo, Rome, Paris, Venice, Milan, Barcelona, Lyon, Istanbul, Bangkok, Ho Chi Minh City, Hanoi, Phnom Penh, Vientiane
vehicle = Car, Truck, Bike, Bus, Boat, Plane, Train, Motorcycle, Tractor, Horse-drawn cart
game = Soccer, Basketball, Tennis, Baseball, Tag, Hide and seek, Racing, Swimming, Water polo
book = Manga, Comic, Novel, Magazine, Textbook, Dictionary, Encyclopedia, Cookbook, Travel guide, Art book, Science book, History book, Biography, Self-help book, Horror book, Romance book, Science fiction book, Fantasy book, Mystery book, Thriller book
player_identity = (placeholder identities)
owner_identity = (placeholder identities)
player_appearance = (placeholder attributes)
          

4. Guidance Prompts

All questions are paired with short instructions to standardize model behavior:


(1) Please select one from the options provided.
(2) Choose one of the options listed above.
(3) Pick one option to answer the question.
(4) Select your answer from the options given.
(5) Choose one of the choices shown.
(6) You may pick one from the provided options.
(7) Make your choice from the options above.
(8) Pick your answer from the choices provided.

References

  1. Arushi Goel et al. Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models. arXiv preprint arXiv:2507.08128, 2025.
  2. Jin Xu et al. Qwen2.5-omni technical report. arXiv preprint arXiv:2503.20215, 2025.
  3. Gheorghe Comanici et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025.
  4. Josh Achiam et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  5. Karol J. Piczak. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, pp. 1015–1018. ACM, 2015.
  6. Abdulvahap. Music Instrument Sounds for Classification. Available at: Kaggle.
  7. Yuan Gong, Jin Yu, James Glass. Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition. In ICASSP 2022, pp. 151–155. IEEE, 2022. doi:10.1109/ICASSP43922.2022.9746828.
  8. S. Sakshi et al. MMAU: A massive multi-task audio understanding and reasoning benchmark. In International Conference on Learning Representations (ICLR), 2024.
  9. Chun-Yi Kuan, Hung-yi Lee. Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples. Interspeech 2025.
  10. Chun-Yi Kuan, Hung-yi Lee. From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data. arXiv preprint arXiv:2505.20166, 2025.