This Go package provides a client for interacting with the Gemini Live API, enabling real-time multimodal interactions including text, audio, and video streaming. This client is inspired by the functionalities demonstrated in the Gemini 2.0 - Multimodal live API: Streaming, showcasing the ability to stream bidirectional audio and handle text-based conversations.
This client allows you to build applications that can:
- Send and receive text messages.
- Stream audio input to the Gemini API and receive audio responses.
- Stream video input (from camera or screen capture) to the Gemini API.
- Real-time Communication: Leverages WebSockets for low-latency, bidirectional communication with the Gemini Live API.
- Text Input: Send textual prompts to the Gemini API for conversational interactions.
- Audio Input & Output:
- Record audio from your microphone and stream it to the API.
- Receive and play back audio responses from the API in real-time.
- Video Input (Camera & Screen Capture):
- Capture frames from your camera or screen and stream them to the API for multimodal prompts.
- Supports configuration to switch between camera and screen capture modes.
- Automatic Reconnection: Attempts to reconnect to the API if the WebSocket connection is interrupted.
- Configurable: Allows for customization of the Gemini model to be used.
- Error Handling & Logging: Provides robust error handling and logging using
logrus
.
- Go: Version 1.18 or higher.
- PortAudio: Required for audio input/output. Installation instructions vary by operating system.
- Linux (Debian/Ubuntu):
sudo apt-get install libportaudio2 libportaudiocpp0 portaudio19-dev
- Linux (Debian/Ubuntu):
- FFmpeg: Required for camera capture and image conversion.
- You can usually install this using your system's package manager (e.g.,
sudo apt-get install ffmpeg
on Debian/Ubuntu,brew install ffmpeg
on macOS).
- You can usually install this using your system's package manager (e.g.,
- GEMINI_API_KEY: You need a valid API key from Google AI Studio, which you can obtain by signing up for the Gemini API.
-
Clone the repository:
git clone <repository_url> cd <repository_directory>
-
Build the application:
go build -o gemini-client .
-
Set the API Key: You need to set the
GEMINI_API_KEY
environment variable. You can do this by running the following command:export GEMINI_API_KEY="YOUR_API_KEY"
Replace
YOUR_API_KEY
with your actual Gemini API key.
The client can be run in different modes by setting the MODE
environment variable. This mirrors the different interaction patterns showcased in the.
setupMsg := map[string]interface{}{
"setup": map[string]interface{}{
"model": "models/gemini-2.0-flash-exp",
"generationConfig": map[string]interface{}{
"responseModalities": []string{"TEXT"}, // "TEXT", "AUDIO"
},
},
}
The client defaults to text-based interaction, similar to the basic chat demonstrated in the "Live API - Websockets Quickstart" notebook. You can type messages in the terminal and send them to the Gemini API.
./gemini-client
You will be prompted with message>:
to enter your text.
To enable real-time bidirectional audio input and output, similar to the "Gemini 2.0 - Multimodal live API: Streaming", run the client without setting a specific responseModalities.
"responseModalities": []string{"AUDIO"}, // "TEXT", "AUDIO"
The client will automatically start recording audio from your microphone and playing back audio responses.
./gemini-client
To send camera frames to the API, enabling multimodal prompts as explored in the examples, set the MODE
environment variable to camera
.
export MODE="camera"
./gemini-client
Note: Ensure your camera is properly configured and accessible by the system. The code currently targets /dev/video1
. You might need to adjust this based on your camera setup.
func (c *Client) openVideoCapture() (*exec.Cmd, error) {
cap := exec.Command("ffmpeg", "-f", "v4l2", "-i", "/dev/video1", "-f", "rawvideo", "-pix_fmt", "rgb24", "-vframes", "1", "-")
return cap, nil
}
To send screen captures to the API, providing visual context to your prompts, set the MODE
environment variable to screen
.
export MODE="screen"
./gemini-client
The client will periodically capture your screen and send it to the API.
- Run the client (e.g., in text mode):
./gemini-client
- You will see the prompt:
message>:
- Type your message and press Enter:
message>: What is the weather like today?
- The response from the Gemini API will be printed in the terminal.
If running in audio or video modes, the client will continuously stream audio or video data to the API, enabling more dynamic and interactive conversations.
Refer to the Gemini API documentation for the available tools and their configurations.
- Windows Support: Add support for Windows, including audio input/output and camera/screen capture.
- MacOS Support: Add support for camera and screen capture on macOS.
- Tools Integrations: Add support for Google Search, etc
- Function Calling: Add support for calling functions in the Gemini API
Contributions are welcome! Please feel free to submit pull requests with improvements or bug fixes.
A Go-based implementation inspired by the functionality of live_api_starter.ipynb from https://github.com/google-gemini/cookbook.