One way translation - update images (#1735)

This commit is contained in:
erikakettleson-openai 2025-03-25 12:55:44 -07:00 committed by GitHub
parent 4100fb99d0
commit 985d09d110
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 11 additions and 7 deletions

View File

@ -10,13 +10,14 @@ A real-world use case for this demo is a multilingual, conversational translatio
Let's explore the main functionalities and code snippets that illustrate how the app works. You can find the code in the [accompanying repo](https://github.com/openai/openai-cookbook/tree/main/examples/voice_solutions/one_way_translation_using_realtime_api/README.md Let's explore the main functionalities and code snippets that illustrate how the app works. You can find the code in the [accompanying repo](https://github.com/openai/openai-cookbook/tree/main/examples/voice_solutions/one_way_translation_using_realtime_api/README.md
) if you want to run the app locally. ) if you want to run the app locally.
### High Level Architecture Overview ## High Level Architecture Overview
This project has two applications - a speaker and listener app. The speaker app takes in audio from the browser, forks the audio and creates a unique Realtime session for each language and sends it to the OpenAI Realtime API via WebSocket. Translated audio streams back and is mirrored via a separate WebSocket server to the listener app. The listener app receives all translated audio streams simultaneously, but only the selected language is played. This architecture is designed for a POC and is not intended for a production use case. Let's dive into the workflow! This project has two applications - a speaker and listener app. The speaker app takes in audio from the browser, forks the audio and creates a unique Realtime session for each language and sends it to the OpenAI Realtime API via WebSocket. Translated audio streams back and is mirrored via a separate WebSocket server to the listener app. The listener app receives all translated audio streams simultaneously, but only the selected language is played. This architecture is designed for a POC and is not intended for a production use case. Let's dive into the workflow!
![Architecture](translation_images/Realtime_flow_diagram.png) ![Architecture](https://github.com/openai/openai-cookbook/blob/main/examples/voice_solutions/translation_images/Realtime_flow_diagram.png?raw=true)
## Step 1: Language & Prompt Setup
### Step 1: Language & Prompt Setup
We need a unique stream for each language - each language requires a unique prompt and session with the Realtime API. We define these prompts in `translation_prompts.js`. We need a unique stream for each language - each language requires a unique prompt and session with the Realtime API. We define these prompts in `translation_prompts.js`.
@ -37,7 +38,8 @@ const languageConfigs = [
## Step 2: Setting up the Speaker App ## Step 2: Setting up the Speaker App
![SpeakerApp](translation_images/SpeakerApp.png) ![SpeakerApp](https://github.com/openai/openai-cookbook/blob/main/examples/voice_solutions/translation_images/SpeakerApp.png?raw=true)
We need to handle the setup and management of client instances that connect to the Realtime API, allowing the application to process and stream audio in different languages. `clientRefs` holds a map of `RealtimeClient` instances, each associated with a language code (e.g., 'fr' for French, 'es' for Spanish) representing each unique client connection to the Realtime API. We need to handle the setup and management of client instances that connect to the Realtime API, allowing the application to process and stream audio in different languages. `clientRefs` holds a map of `RealtimeClient` instances, each associated with a language code (e.g., 'fr' for French, 'es' for Spanish) representing each unique client connection to the Realtime API.
@ -94,7 +96,8 @@ const connectConversation = useCallback(async () => {
}; };
``` ```
### Step 3: Audio Streaming ## Step 3: Audio Streaming
Sending audio with WebSockets requires work to manage the inbound and outbound PCM16 audio streams ([more details on that](https://platform.openai.com/docs/guides/realtime-model-capabilities#handling-audio-with-websockets)). We abstract that using wavtools, a library for both recording and streaming audio data in the browser. Here we use `WavRecorder` for capturing audio in the browser. Sending audio with WebSockets requires work to manage the inbound and outbound PCM16 audio streams ([more details on that](https://platform.openai.com/docs/guides/realtime-model-capabilities#handling-audio-with-websockets)). We abstract that using wavtools, a library for both recording and streaming audio data in the browser. Here we use `WavRecorder` for capturing audio in the browser.
@ -114,7 +117,9 @@ const startRecording = async () => {
}; };
``` ```
### Step 4: Showing Transcripts
## Step 4: Showing Transcripts
We listen for `response.audio_transcript.done` events to update the transcripts of the audio. These input transcripts are generated by the Whisper model in parallel to the GPT-4o Realtime inference that is doing the translations on raw audio. We listen for `response.audio_transcript.done` events to update the transcripts of the audio. These input transcripts are generated by the Whisper model in parallel to the GPT-4o Realtime inference that is doing the translations on raw audio.

View File

@ -1847,4 +1847,3 @@
- audio - audio
- speech - speech