Vox-Infinity

Benchmarking the Limits of Long-Context Spoken Language Models

🤗 Dialogues of Vox-Infinity 🤗 QAs of Vox-Infinity

Abstract

Long-context reasoning remains a fundamental challenge for large language models, as excessively long inputs often lead to the forgetting of salient information. This issue is even more pronounced in the speech domain, where audio, as a low-compression modality, requires significantly more embeddings than text to preserve both semantic content and acoustic cues. To address this, we introduce Vox-Infinity, the first benchmark specifically designed to evaluate long-context understanding in spoken language models. Vox-Infinity systematically extends audio history along two dimensions: turns and duration. It covers a diverse range of representative scenarios, including dialogues with varying structural depth and semantic complexity. Crucially, it provides explicit answer provenance annotations and organizes samples based on the context length required to resolve each query, enabling precise and length-aware evaluation of model performance. Furthermore, we present the first comprehensive study of history modeling strategies in this setting, analyzing how models balance the trade-off between preserving long-range semantics and retaining recent acoustic signals.

Dialogue illustration
User
Hi there, my name is Jane. I'm traveling to Cambridge and I'd like to try some local food.
Bot
Good day and thank you for calling. How can I help you?
User
I'm looking for a train from Cambridge to Peterborough on Sunday.
Bot
Sure, not a problem. Can you hold on for me while I quickly search through my database?
Bot
Shofeng, can you please find one that leaves after 15 minutes past 4 p.m.?
User
Okay, I have a train at 1606.
User
Okay, great. Can I please have the travel time?
Bot
Sure, it's 15 minutes, 15 minutes, sorry.
User
Okay, and what is the arrival time?
Bot
1656
User
Okay, great. I'm also looking for an expensive place to dine in the center.
Bot
Okay, can you hold on while I quickly search through my database?
User
Sure thing. Please look for one that serves traditional Chinese food.
Bot
Okay. All right. I have picked up the ugly ducking. It's actually serving Chinese food at an expensive range and it's situated in the center.
User
Oh, that's excellent. Can I please have their postcode?
Bot
All right, just a second. Yes, CB21TW.
User
Thank you. And what is their address?
Bot
It's 12 St. John Street, City Center.
User
Can I please have their phone number?
Bot
Okay, unfortunately, we do not have a contact number for them.
User
Oh, that's okay. Thank you very much.
Bot
Not a problem. Would you like me to send you an email?
User
Yes, sure. That would be great.
Bot
All right. And what is your email address?
User
My email address is jforjuliette.
Bot
Okay.
User
And then Edward, S and E for Echo, D for Delta, W for Whiskey, and A for Alpha.
Bot
Yes.
User
R for Romeo, D for Delta, S and T for Tango.
Bot
R-D-S-T-S?
User
And then six, the number six.
Bot
Yes.
User
And then E for Eko, N for November, and Yahoo.
Bot
Yes.
Bot
Yes, so it's jedwards, T6EN at yahoo.com.
Bot
Okay, just to quickly confirm that, it's jedwards6en at yahoo.com. Is that correct? Yes.
User
Yes, that's correct.
Bot
All right. Thank you so much. Is there anything else that I can assist you with?
User
No, thank you. Thank you for helping me out.
Bot
All right then. Take care. Bye-bye.
User
(Question) What is the arrival time of the train from Cambridge to Peterborough?
Bot
(Ref Answer, Source from Turn 10) The arrival time is 1656.