Unified-Understanding-and-Generalization-Demo

Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models

Abstract

Extending pre-trained Large Language Models (LLMs)’s speech understanding or generation abilities by introducing various effective speech tokens has attracted great attention in the speech community. However, building a unified speech understanding and generation model still faces the following challenges: (1) Due to the huge modality gap between speech tokens and text tokens, extending text LLMs to unified speech LLMs relies on large-scale paired data for fine-tuning, and (2) Generation and understanding tasks prefer information at different levels, e.g., generation benefits from detailed acoustic features, while understanding favors high-level semantics. This divergence leads to difficult performance optimization in one unified model. To solve these challenges, in this paper, we present two key insights in speech tokenization and speech language modeling. Specifically, we first propose an Understanding-driven Speech Tokenizer (USTokenizer), which extracts high-level semantic information essential for accomplishing understanding tasks using text LLMs. In this way, USToken enjoys better modality commonality with text, which reduces the difficulty of modality alignment in adapting text LLMs to speech LLMs. Secondly, we present DualSpeechLM, a dual-token modeling framework that concurrently models USToken as input and acoustic token as output within a unified, end-to-end framework, seamlessly integrating speech understanding and generation capabilities. Furthermore, we propose a novel semantic supervision loss and a Chain-of-Condition (CoC) strategy to stabilize model training and enhance speech generation performance. Experimental results demonstrate that our proposed approach effectively fosters a complementary relationship between understanding and generation tasks, highlighting the promising strategy of mutually enhancing both tasks in one unified model.

Model Overview

DualSpeechLM’s dual-token modeling paradigm. The left illustrates the baseline pipeline treating LLM input/output as identical tokens. In contrast, DualSpeechLM incorporates an Acoustic GPT module into the text LLM module for joint training, separately processing USToken inputs and acoustic token outputs through distinct modeling paths, effectively capturing the different levels of information required for both generation and understanding tasks.

Text-to-Speech (TTS)

Target Text
Prompt
Baseline-Semantic
Baseline-Acoustic
SpeechGPT
Qwen-TTS
DualSpeechLM-Hubert
DualSpeechLM-USToken(Ours)
Ground Truth
Every doctor should provide himself with an antidote case.
Remain, I implore you: the evening is most lovely.
One room is papered, carpeted, over furnished; the next is almost bare.
The free State men shrank from forcible resistance to even bogus laws.
It’s my idee as he’s braver than the whole Blue Army put together.
Only under those conditions could the social organization be justified.
When Gordon and Jenkins came back, Murdoch tossed the money to them.

Voice Conversion (VC)

Target Text
Source
Prompt
Baseline-Semantic
Baseline-Acoustic
DualSpeechLM-Hubert
DualSpeechLM-USToken(Ours)
Ground Truth
Charles Gordon, leader of Glasgow City Council declined to comment.
The actual primary rainbow observed is said to be the effect of super-imposition of a number of bows.
Throughout the centuries people have explained the rainbow in various ways.
Ask her to bring these things with her from the store.
Others have tried to explain the phenomenon physically.

Text-to-Speech Translation (T2ST)

French-to-English (Fr–En):

Source Text
DualSpeechLM-USToken(Ours)
Reference
L’Allemand remporte deux autres courses, au Canada et en Allemagne. the german won two other races one in canada and the other in germany.
Il est reconnu pour sa contribution à la mise en valeur de Trois-Rivières. he is known for his contribution in developing the trois rivieres.
Le groupe des prosimiens est considéré comme un grade par les systématiciens évolutionnistes. the prosimian group is considered a rank by the evolutionist systematists.

Spanishto-English (Es–En):

Source Text
DualSpeechLM-USToken(Ours)
Reference
El idioma oficial del país toma como referencia los dialectos de Funafuti-Vaitupu. the official language of the country is influenced by the funafuti vaitupu dialects.
Su verdadero nombre era Hans Emil Hansen. his real name was hans emil hansen.
Aquel mismo día cambió su formato tradicional. that very same day the traditional format changed.

Speech Conversation (SC)

In the SC task, the DualSpeechLLM is required to generate speech responses to speech prompts that include speech content and the corresponding speech question.

Speech Content
Question
DualSpeechLM-USToken(Ours)
Reference
conseil replied but when a little glass of vinegar is worth one million five hundred thousand francs its taste is a small price to pay.
How much is a little glass of vinegar worth?
One million five hundred thousand francs.
my gallant ned i replied laughing those were artificial pearls ordinary glass beads whose insides were coated with essence of orient wow.
What were the pearls made of?
Ordinary glass beads whose insides were coated with essence of Orient.
i believe the average time underwater that these fishermen can tolerate is thirty seconds during which they hastily stuff their little nets with all the pearl oysters they can tear loose.
What is the average time underwater that fishermen can tolerate?
The average time underwater that fishermen can tolerate is thirty seconds.
and how i resented the offhand manner in which the captain had extended his deplorable invitation.
What did the speaker resent?
The speaker resented the captain’s offhand manner in which he extended his deplorable invitation.
that such primitive methods are still all that they use all.
What methods are still used?
Primitive methods are still used.