Azure.AI.VoiceLive (.NET)

SDK IA vocale temps réel pour construire des assistants vocaux bidirectionnels avec Azure AI.

Installation

dotnet add package Azure.AI.VoiceLive
dotnet add package Azure.Identity
dotnet add package NAudio                    # Pour capture/lecture audio

Versions actuelles : Stable v1.0.0, Preview v1.1.0-beta.1

Variables d'environnement

AZURE_VOICELIVE_ENDPOINT=https://<resource>.services.ai.azure.com/
AZURE_VOICELIVE_MODEL=gpt-4o-realtime-preview
AZURE_VOICELIVE_VOICE=en-US-AvaNeural
# Optionnel : clé API si vous n'utilisez pas Entra ID
AZURE_VOICELIVE_API_KEY=<your-api-key>

Authentification

Microsoft Entra ID (Recommandé)

using Azure.Identity;
using Azure.AI.VoiceLive;

Uri endpoint = new Uri("https://your-resource.cognitiveservices.azure.com");
DefaultAzureCredential credential = new DefaultAzureCredential();
VoiceLiveClient client = new VoiceLiveClient(endpoint, credential);

Rôle requis : Cognitive Services User (à assigner dans Portail Azure → Contrôle d'accès)

Clé API

Uri endpoint = new Uri("https://your-resource.cognitiveservices.azure.com");
AzureKeyCredential credential = new AzureKeyCredential("your-api-key");
VoiceLiveClient client = new VoiceLiveClient(endpoint, credential);

Hiérarchie client

VoiceLiveClient
└── VoiceLiveSession (connexion WebSocket)
    ├── ConfigureSessionAsync()
    ├── GetUpdatesAsync() → événements SessionUpdate
    ├── AddItemAsync() → UserMessageItem, FunctionCallOutputItem
    ├── SendAudioAsync()
    └── StartResponseAsync()

Flux de travail principal

1. Démarrer la session et la configurer

using Azure.Identity;
using Azure.AI.VoiceLive;

var endpoint = new Uri(Environment.GetEnvironmentVariable("AZURE_VOICELIVE_ENDPOINT"));
var client = new VoiceLiveClient(endpoint, new DefaultAzureCredential());

var model = "gpt-4o-mini-realtime-preview";

// Démarrer la session
using VoiceLiveSession session = await client.StartSessionAsync(model);

// Configurer la session
VoiceLiveSessionOptions sessionOptions = new()
{
    Model = model,
    Instructions = "You are a helpful AI assistant. Respond naturally.",
    Voice = new AzureStandardVoice("en-US-AvaNeural"),
    TurnDetection = new AzureSemanticVadTurnDetection()
    {
        Threshold = 0.5f,
        PrefixPadding = TimeSpan.FromMilliseconds(300),
        SilenceDuration = TimeSpan.FromMilliseconds(500)
    },
    InputAudioFormat = InputAudioFormat.Pcm16,
    OutputAudioFormat = OutputAudioFormat.Pcm16
};

// Définir les modalités (texte et audio pour assistants vocaux)
sessionOptions.Modalities.Clear();
sessionOptions.Modalities.Add(InteractionModality.Text);
sessionOptions.Modalities.Add(InteractionModality.Audio);

await session.ConfigureSessionAsync(sessionOptions);

2. Traiter les événements

await foreach (SessionUpdate serverEvent in session.GetUpdatesAsync())
{
    switch (serverEvent)
    {
        case SessionUpdateResponseAudioDelta audioDelta:
            byte[] audioData = audioDelta.Delta.ToArray();
            // Lire l'audio via NAudio ou autre bibliothèque audio
            break;

        case SessionUpdateResponseTextDelta textDelta:
            Console.Write(textDelta.Delta);
            break;

        case SessionUpdateResponseFunctionCallArgumentsDone functionCall:
            // Gérer l'appel de fonction (voir section Function Calling)
            break;

        case SessionUpdateError error:
            Console.WriteLine($"Error: {error.Error.Message}");
            break;

        case SessionUpdateResponseDone:
            Console.WriteLine("\n--- Response complete ---");
            break;
    }
}

3. Envoyer un message utilisateur

await session.AddItemAsync(new UserMessageItem("Hello, can you help me?"));
await session.StartResponseAsync();

4. Function Calling

// Définir la fonction
var weatherFunction = new VoiceLiveFunctionDefinition("get_current_weather")
{
    Description = "Get the current weather for a given location",
    Parameters = BinaryData.FromString("""
        {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state or country"
                }
            },
            "required": ["location"]
        }
        """)
};

// Ajouter à les options de session
sessionOptions.Tools.Add(weatherFunction);

// Gérer l'appel de fonction dans la boucle d'événements
if (serverEvent is SessionUpdateResponseFunctionCallArgumentsDone functionCall)
{
    if (functionCall.Name == "get_current_weather")
    {
        var parameters = JsonSerializer.Deserialize<Dictionary<string, string>>(functionCall.Arguments);
        string location = parameters?["location"] ?? "";

        // Appeler un service externe
        string weatherInfo = $"The weather in {location} is sunny, 75°F.";

        // Envoyer la réponse
        await session.AddItemAsync(new FunctionCallOutputItem(functionCall.CallId, weatherInfo));
        await session.StartResponseAsync();
    }
}

Options vocales

Type de voix	Classe	Exemple
Azure Standard	`AzureStandardVoice`	`"en-US-AvaNeural"`
Azure HD	`AzureStandardVoice`	`"en-US-Ava:DragonHDLatestNeural"`
Azure Custom	`AzureCustomVoice`	Voix personnalisée avec ID endpoint

Modèles supportés

Modèle	Description
`gpt-4o-realtime-preview`	GPT-4o avec audio temps réel
`gpt-4o-mini-realtime-preview`	Interactions légères et rapides
`phi4-mm-realtime`	Multimodal économique

Référence des types clés

Type	Objectif
`VoiceLiveClient`	Client principal pour créer des sessions
`VoiceLiveSession`	Session WebSocket active
`VoiceLiveSessionOptions`	Configuration de session
`AzureStandardVoice`	Fournisseur de voix Azure standard
`AzureSemanticVadTurnDetection`	Détection d'activité vocale
`VoiceLiveFunctionDefinition`	Définition d'outil de fonction
`UserMessageItem`	Message texte utilisateur
`FunctionCallOutputItem`	Réponse d'appel de fonction
`SessionUpdateResponseAudioDelta`	Événement chunk audio
`SessionUpdateResponseTextDelta`	Événement chunk texte

Bonnes pratiques

Toujours définir les deux modalités — Inclure Text et Audio pour assistants vocaux
Utiliser AzureSemanticVadTurnDetection — Fournit un flux de conversation naturel
Configurer une durée de silence appropriée — 500 ms typiquement pour éviter les coupures prématurées
Utiliser l'instruction using — Assure la libération correcte de la session
Gérer tous les types d'événements — Vérifier les erreurs, audio, texte et appels de fonction
Utiliser DefaultAzureCredential — Ne jamais coder en dur les clés API

Gestion des erreurs

if (serverEvent is SessionUpdateError error)
{
    if (error.Error.Message.Contains("Cancellation failed: no active response"))
    {
        // Erreur bénigne, peut être ignorée
    }
    else
    {
        Console.WriteLine($"Error: {error.Error.Message}");
    }
}

Configuration audio

Format d'entrée : InputAudioFormat.Pcm16 (PCM 16-bit)
Format de sortie : OutputAudioFormat.Pcm16
Fréquence d'échantillonnage : 24 kHz recommandé
Canaux : Mono

SDKs associés

SDK	Objectif	Installation
`Azure.AI.VoiceLive`	Voix temps réel (ce SDK)	`dotnet add package Azure.AI.VoiceLive`
`Microsoft.CognitiveServices.Speech`	Speech-to-text, text-to-speech	`dotnet add package Microsoft.CognitiveServices.Speech`
`NAudio`	Capture/lecture audio	`dotnet add package NAudio`

Liens de référence

Ressource	URL
NuGet Package	https://www.nuget.org/packages/Azure.AI.VoiceLive
API Reference	https://learn.microsoft.com/dotnet/api/azure.ai.voicelive
GitHub Source	https://github.com/Azure/azure-sdk-for-net/tree/main/sdk/ai/Azure.AI.VoiceLive
Quickstart	https://learn.microsoft.com/azure/ai-services/speech-service/voice-live-quickstart

azure-ai-voicelive-dotnet