crmne · tpaulshippy · Jun 14, 2025 · Jun 14, 2025 · Jun 14, 2025 · Jun 14, 2025
diff --git a/docs/_core_features/chat.md b/docs/_core_features/chat.md
@@ -129,7 +129,7 @@ Many modern AI models can process multiple types of input beyond just text. Ruby
 
 ### Working with Images
 
-Vision-capable models can analyze images, answer questions about visual content, and even compare multiple images. Common vision models include `gpt-4o`, `claude-3-opus`, and `gemini-1.5-pro`.
+Vision-capable models can analyze images, answer questions about visual content, and even compare multiple images. Some specialized models can also generate and edit images. Common vision models include `gpt-4o`, `claude-3-opus`, and `gemini-1.5-pro`.
 
 ```ruby
 # Ensure you select a vision-capable model
@@ -150,6 +150,34 @@ puts response.content
 
 RubyLLM automatically handles image encoding and formatting for each provider's API. Local images are read and encoded as needed, while URLs are passed directly when supported by the provider.
 
+### Image Generation with Chat
+
+While most vision models analyze images, some specialized models can generate and edit images through the chat interface. This approach is ideal for image editing workflows and iterative refinement:
+
+```ruby
+# Use a model capable of image generation
+chat = RubyLLM.chat(model: 'gemini-2.0-flash-preview-image-generation')
+
+# Edit an existing image
+response = chat.ask('make this look more futuristic', with: 'current_design.png')
+
+# Access generated images from attachments
+if response.content.attachments.any?
+  generated_image = response.content.attachments.first.image
+  puts "Generated image: #{generated_image.mime_type}"
+
+  # Save the generated image
+  generated_image.save('futuristic_design.png')
+end
+
+# Continue refining in the same conversation
+response = chat.ask('add some neon lighting effects')
+refined_image = response.content.attachments.first.image
+refined_image.save('futuristic_with_neon.png')
+```
+
+For simple text-to-image generation without existing images, see the [Image Generation Guide]({% link guides/image-generation.md %}).
+
 ### Working with Audio
 
 Audio-capable models can transcribe speech, analyze audio content, and answer questions about what they hear. Currently, models like `gpt-4o-audio-preview` support audio input.

diff --git a/docs/_core_features/image-generation.md b/docs/_core_features/image-generation.md
@@ -24,6 +24,8 @@ redirect_from:
 After reading this guide, you will know:
 
 *   How to generate images from text prompts.
+*   How to edit and modify existing images.
+*   How to refine images through multi-turn conversations.
 *   How to select different image generation models.
 *   How to specify image sizes (for supported models).
 *   How to access and save generated image data (URL or Base64).
@@ -98,6 +100,75 @@ end
 
 Refer to the [Working with Models Guide]({% link _advanced/models.md %}) and the [Available Models Guide]({% link _reference/available-models.md %}) to find image models.
 
+## Image Editing & Modification
+
+Beyond generating images from text prompts, you can also edit and modify existing images using capable models like `gemini-2.0-flash-preview-image-generation`. This approach uses the chat interface rather than the `paint` method.
+
+### Basic Image Editing
+
+Use the chat interface with image generation models to edit existing images:
+
+```ruby
+# Start a chat with an image generation model
+chat = RubyLLM.chat(model: 'gemini-2.0-flash-preview-image-generation')
+
+# Edit an existing image
+response = chat.ask('put this in a ring', with: 'path/to/ruby.png')
+
+# Access the generated image from the response
+image = response.content.attachments.first.image
+
+# Check image properties
+puts "Generated image: #{image.mime_type}"
+puts "Base64 encoded: #{image.base64?}"
+puts "Data size: ~#{image.data.length} bytes" if image.base64?
+
+# Save the edited image
+saved_path = image.save('ruby_with_ring.png')
+puts "Saved to: #{saved_path}"
+```
+
+### Multi-turn Image Refinement
+
+One of the powerful features of using the chat interface is the ability to refine generated images through conversation:
+
+```ruby
+chat = RubyLLM.chat(model: 'gemini-2.0-flash-preview-image-generation')
+
+# First edit - add a ring to the ruby image
+chat.ask('put this in a ring', with: 'path/to/ruby.png')
+
+# Refine the result in the same conversation
+response = chat.ask('change the background to blue')
+
+# The model will modify the previously generated image
+refined_image = response.content.attachments.first.image
+refined_image.save('ruby_ring_blue_background.png')
+
+# Continue refining
+response = chat.ask('make the ring more ornate and golden')
+final_image = response.content.attachments.first.image
+final_image.save('ruby_ornate_golden_ring.png')
+```
+
+### Chat vs Paint Methods
+
+RubyLLM provides two approaches for image generation:
+
+- **`RubyLLM.paint`**: Best for simple text-to-image generation from scratch
+- **`RubyLLM.chat` with image models**: Best for image editing, refinement, and complex workflows
+
+Use the chat interface for:
+- Editing existing images
+- Multi-turn image refinement and iteration
+- Complex image generation workflows
+- When you need conversation context and memory
+
+Use the paint method for:
+- Simple text-to-image generation
+- One-off image creation
+- When you don't need conversation context
+
 ## Image Sizes
 
 Some models, like DALL-E 3, allow you to specify the desired image dimensions via the `size:` argument.
@@ -124,7 +195,7 @@ image_portrait = RubyLLM.paint(
 
 ## Working with Generated Images
 
-The `RubyLLM::Image` object provides access to the generated image data and metadata.
+The `RubyLLM::Image` object provides access to the generated image data and metadata, whether the image was created using `RubyLLM.paint` or retrieved from a chat response.
 
 ### Accessing Image Data
 
@@ -138,10 +209,15 @@ The `RubyLLM::Image` object provides access to the generated image data and meta
 The `save` method works regardless of whether the image was delivered via URL or Base64. It fetches the data if necessary and writes it to the specified file path.
 
 ```ruby
-# Generate an image
+# Generate an image using paint method
 image = RubyLLM.paint("A steampunk mechanical owl")
 
-# Save the image to a local file
+# Or get an image from a chat response
+# chat = RubyLLM.chat(model: 'gemini-2.0-flash-preview-image-generation')
+# response = chat.ask("Create a steampunk mechanical owl")
+# image = response.content.attachments.first.image
+
+# Save the image to a local file (works the same for both methods)
 begin
   saved_path = image.save("steampunk_owl.png")
   puts "Image saved to #{saved_path}"
@@ -275,6 +351,6 @@ Image generation can take several seconds (typically 5-20 seconds depending on t
 
 ## Next Steps
 
-*   [Chatting with AI Models]({% link _core_features/chat.md %}): Learn about conversational AI.
+*   [Chatting with AI Models]({% link _core_features/chat.md %}): Learn about conversational AI and using chat for advanced image workflows.
 *   [Embeddings]({% link _core_features/embeddings.md %}): Explore text vector representations.
-*   [Error Handling]({% link _advanced/error-handling.md %}): Master handling API errors.
+*   [Error Handling]({% link _advanced/error-handling.md %}): Master handling API errors.
diff --git a/lib/ruby_llm/content.rb b/lib/ruby_llm/content.rb
@@ -18,6 +18,11 @@ def add_attachment(source, filename: nil)
       self
     end
 
+    def attach(attachment)
+      @attachments << attachment
+      self
+    end
+
     def format
       if @text && @attachments.empty?
         @text

diff --git a/lib/ruby_llm/image_attachment.rb b/lib/ruby_llm/image_attachment.rb
@@ -0,0 +1,22 @@
+# frozen_string_literal: true
+
+module RubyLLM
+  # A class representing a file attachment that is an image generated by an LLM.
+  class ImageAttachment < Attachment
+    attr_reader :image, :content
+
+    def initialize(data:, mime_type:, model_id:)
+      super(nil, filename: nil)
+      @image = Image.new(data:, mime_type:, model_id:)
+      @mime_type = mime_type
+    end
+
+    def image?
+      true
+    end
+
+    def encoded
+      image.data
+    end
+  end
+end
diff --git a/lib/ruby_llm/message.rb b/lib/ruby_llm/message.rb
@@ -44,7 +44,7 @@ def tool_results
     def to_h
       {
         role: role,
-        content: content,
+        content: content.is_a?(Content) ? content.to_h : content,
         tool_calls: tool_calls,
         tool_call_id: tool_call_id,
         input_tokens: input_tokens,

diff --git a/lib/ruby_llm/providers/gemini/capabilities.rb b/lib/ruby_llm/providers/gemini/capabilities.rb
@@ -219,6 +219,9 @@ def modalities_for(model_id)
 
           modalities[:input] << 'audio' if model_id.match?(/audio/)
           modalities[:output] << 'embeddings' if model_id.match?(/embedding|gemini-embedding/)
+
+          modalities[:output] << 'image' if model_id.match?(/image-generation/)
+
           modalities[:output] = ['image'] if model_id.match?(/imagen/)
 
           modalities

diff --git a/lib/ruby_llm/providers/gemini/chat.rb b/lib/ruby_llm/providers/gemini/chat.rb
@@ -15,7 +15,9 @@ def render_payload(messages, tools:, temperature:, model:, stream: false, schema
           @model = model.id
           payload = {
             contents: format_messages(messages),
-            generationConfig: {}
+            generationConfig: {
+              responseModalities: capabilities.modalities_for(model.id)[:output]
+            }
           }
 
           payload[:generationConfig][:temperature] = temperature unless temperature.nil?

diff --git a/lib/ruby_llm/providers/gemini/streaming.rb b/lib/ruby_llm/providers/gemini/streaming.rb
@@ -34,7 +34,21 @@ def extract_content(data)
           return nil unless parts
 
           text_parts = parts.select { |p| p['text'] }
-          text_parts.map { |p| p['text'] }.join if text_parts.any?
+          image_parts = parts.select { |p| p['inlineData'] }
+
+          content = RubyLLM::Content.new(text_parts.map { |p| p['text'] }.join)
+
+          image_parts.map do |p|
+            content.attach(
+              ImageAttachment.new(
+                data: p['inlineData']['data'],
+                mime_type: p['inlineData']['mimeType'],
+                model_id: data['modelVersion']
+              )
+            )
+          end
+
+          content
         end
 
         def extract_input_tokens(data)

diff --git a/lib/ruby_llm/stream_accumulator.rb b/lib/ruby_llm/stream_accumulator.rb
@@ -6,7 +6,7 @@ class StreamAccumulator
     attr_reader :content, :model_id, :tool_calls
 
     def initialize
-      @content = +''
+      @content = nil
       @tool_calls = {}
       @input_tokens = 0
       @output_tokens = 0
@@ -20,7 +20,7 @@ def add(chunk)
       if chunk.tool_call?
         accumulate_tool_calls chunk.tool_calls
       else
-        @content << (chunk.content || '')
+        accumulate_content(chunk.content)
       end
 
       count_tokens chunk
@@ -30,7 +30,7 @@ def add(chunk)
     def to_message(response)
       Message.new(
         role: :assistant,
-        content: content.empty? ? nil : content,
+        content: final_content,
         model_id: model_id,
         tool_calls: tool_calls_from_stream,
         input_tokens: @input_tokens.positive? ? @input_tokens : nil,
@@ -41,6 +41,50 @@ def to_message(response)
 
     private
 
+    def accumulate_content(new_content)
+      return unless new_content
+
+      if @content.nil?
+        @content = new_content.is_a?(String) ? +new_content : new_content
+      else
+        case [@content.class, new_content.class]
+        when [String, String]
+          @content << new_content
+        when [String, Content]
+          @content = Content.new(@content)
+          merge_content(new_content)
+        when [Content, String]
+          @content.instance_variable_set(:@text, (@content.text || '') + new_content)
+        when [Content, Content]
+          merge_content(new_content)
+        end
+      end
+    end
+
+    def merge_content(new_content)
+      current_text = @content.text || ''
+      new_text = new_content.text || ''
+      @content.instance_variable_set(:@text, current_text + new_text)
+
+      existing_encoded = @content.attachments.map(&:encoded)
+      new_content.attachments.each do |attachment|
+        @content.attach(attachment) unless existing_encoded.include?(attachment.encoded)
+      end
+    end
+
+    def final_content
+      case @content
+      when nil
+        nil
+      when String
+        @content.empty? ? nil : @content
+      when Content
+        @content.text.nil? && @content.attachments.empty? ? nil : @content
+      else
+        @content
+      end
+    end
+
     def tool_calls_from_stream
       tool_calls.transform_values do |tc|
         arguments = if tc.arguments.is_a?(String) && !tc.arguments.empty?

diff --git a/...vcr_cassettes/image_gemini_gemini-2_0-flash-preview-image-generation_can_paint_images.yml b/...vcr_cassettes/image_gemini_gemini-2_0-flash-preview-image-generation_can_paint_images.yml
diff --git a/..._gemini_gemini-2_0-flash-preview-image-generation_can_refine_images_in_a_conversation.yml b/..._gemini_gemini-2_0-flash-preview-image-generation_can_refine_images_in_a_conversation.yml
diff --git a/.../image_streaming_functionality_handles_content_objects_in_streaming_without_typeerror.yml b/.../image_streaming_functionality_handles_content_objects_in_streaming_without_typeerror.yml