-
-
Notifications
You must be signed in to change notification settings - Fork 241
Add Content Moderation Feature #383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
iraszl
wants to merge
3
commits into
crmne:main
Choose a base branch
from
iraszl:moderate
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,363 @@ | ||
--- | ||
layout: default | ||
title: Moderation | ||
nav_order: 6 | ||
description: Identify potentially harmful content in text using AI moderation models before sending to LLMs | ||
redirect_from: | ||
- /guides/moderation | ||
--- | ||
|
||
# {{ page.title }} | ||
{: .no_toc } | ||
|
||
{{ page.description }} | ||
{: .fs-6 .fw-300 } | ||
|
||
## Table of contents | ||
{: .no_toc .text-delta } | ||
|
||
1. TOC | ||
{:toc} | ||
|
||
--- | ||
|
||
After reading this guide, you will know: | ||
|
||
* How to moderate text content for harmful material. | ||
* How to interpret moderation results and category scores. | ||
* How to use moderation as a safety layer before LLM requests. | ||
* How to configure moderation models and providers. | ||
* How to integrate moderation into your application workflows. | ||
* Best practices for content safety and user experience. | ||
|
||
|
||
## Why Use Moderation | ||
|
||
Content moderation serves as a crucial safety layer in applications that handle user-generated content. Here's why you should implement moderation before sending content to LLM providers: | ||
|
||
**Enforce Terms of Service**: Automatically screen user submissions against harmful or offensive content categories, ensuring your application maintains community standards and complies with your terms of service without manual review of every message. | ||
|
||
**Protect Provider Relationships**: Maintain good standing with LLM providers by pre-screening content before API calls. Submitting policy-violating content can result in API key suspension or account termination, disrupting your entire application. | ||
|
||
**Enable Proactive Monitoring**: Log and track potentially problematic user activity for review. This creates an audit trail for both automatic filtering and manual moderation workflows, helping you identify patterns and improve your content policies. | ||
|
||
**Reduce Unnecessary Costs**: Save money by avoiding LLM API calls that would be rejected anyway. Since moderation requests are typically free or very low cost compared to chat completions, screening content first prevents expensive calls for content that won't generate useful responses. | ||
|
||
By implementing moderation, you build a more robust, cost-effective, and compliant application that protects both your users and your business relationships. | ||
|
||
## Basic Content Moderation | ||
|
||
The simplest way to moderate content is using the global `RubyLLM.moderate` method: | ||
|
||
```ruby | ||
# Moderate a text input | ||
result = RubyLLM.moderate("This is a safe message about Ruby programming") | ||
|
||
# Check if content was flagged | ||
puts result.flagged? # => false | ||
|
||
# Access the full results | ||
puts result.results | ||
# => [{"flagged" => false, "categories" => {...}, "category_scores" => {...}}] | ||
|
||
# Get basic information | ||
puts "Moderation ID: #{result.id}" # => "modr-ABC123..." | ||
puts "Model used: #{result.model}" # => "omni-moderation-latest" | ||
``` | ||
|
||
The `moderate` method returns a `RubyLLM::Moderate` object containing the moderation results from the provider. | ||
|
||
## Understanding Moderation Results | ||
|
||
Moderation results include categories and confidence scores for different types of potentially harmful content: | ||
|
||
```ruby | ||
result = RubyLLM.moderate("Some user input text") | ||
|
||
# Check overall flagging status | ||
if result.flagged? | ||
puts "Content was flagged for: #{result.flagged_categories.join(', ')}" | ||
else | ||
puts "Content appears safe" | ||
end | ||
|
||
# Examine category scores (0.0 to 1.0, higher = more likely) | ||
scores = result.category_scores | ||
puts "Sexual content score: #{scores['sexual']}" | ||
puts "Harassment score: #{scores['harassment']}" | ||
puts "Violence score: #{scores['violence']}" | ||
|
||
# Get boolean flags for each category | ||
categories = result.categories | ||
puts "Contains hate speech: #{categories['hate']}" | ||
puts "Contains self-harm content: #{categories['self-harm']}" | ||
``` | ||
|
||
### Moderation Categories | ||
|
||
Current moderation models typically check for these categories: | ||
|
||
- **Sexual**: Sexually explicit or suggestive content | ||
- **Hate**: Content that promotes hate based on identity | ||
- **Harassment**: Content intended to harass, threaten, or bully | ||
- **Self-harm**: Content promoting self-harm or suicide | ||
- **Sexual/minors**: Sexual content involving minors | ||
- **Hate/threatening**: Hateful content that includes threats | ||
- **Violence**: Content promoting or glorifying violence | ||
- **Violence/graphic**: Graphic violent content | ||
- **Self-harm/intent**: Content expressing intent to self-harm | ||
- **Self-harm/instructions**: Instructions for self-harm | ||
- **Harassment/threatening**: Harassing content that includes threats | ||
|
||
## Alternative Calling Methods | ||
|
||
You can also use the class method directly: | ||
|
||
```ruby | ||
# Direct class method | ||
result = RubyLLM::Moderate.ask("Your content here") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should be |
||
|
||
# With explicit model specification | ||
result = RubyLLM.moderate( | ||
"User message", | ||
model: "text-moderation-007", | ||
provider: "openai" | ||
) | ||
|
||
# Using assume_model_exists for custom models | ||
result = RubyLLM.moderate( | ||
"Content to check", | ||
provider: "openai", | ||
assume_model_exists: true | ||
) | ||
``` | ||
|
||
## Choosing Models | ||
|
||
By default, RubyLLM uses OpenAI's latest moderation model (`omni-moderation-latest`), but you can specify different models: | ||
|
||
```ruby | ||
# Use a specific OpenAI moderation model | ||
result = RubyLLM.moderate( | ||
"Content to moderate", | ||
model: "text-moderation-007" | ||
) | ||
|
||
# Configure the default moderation model globally | ||
RubyLLM.configure do |config| | ||
config.default_moderation_model = "text-moderation-007" | ||
end | ||
``` | ||
|
||
Refer to the [Available Models Reference]({% link _reference/available-models.md %}) for details on moderation models and their capabilities. | ||
|
||
## Integration Patters | ||
|
||
### Pre-Chat Moderation | ||
|
||
Use moderation as a safety layer before sending user input to LLMs: | ||
|
||
```ruby | ||
def safe_chat_response(user_input) | ||
# Check content safety first | ||
moderation = RubyLLM.moderate(user_input) | ||
|
||
if moderation.flagged? | ||
flagged_categories = moderation.flagged_categories.join(', ') | ||
return { | ||
error: "Content flagged for: #{flagged_categories}", | ||
safe: false | ||
} | ||
end | ||
|
||
# Content is safe, proceed with chat | ||
response = RubyLLM.chat.ask(user_input) | ||
{ | ||
content: response.content, | ||
safe: true | ||
} | ||
end | ||
``` | ||
|
||
### Batch Moderation | ||
|
||
For efficiency, you can moderate multiple messages: | ||
|
||
```ruby | ||
messages = [ | ||
"Hello, how are you?", | ||
"Tell me about Ruby programming", | ||
"What's the weather like?" | ||
] | ||
|
||
results = messages.map { |msg| RubyLLM.moderate(msg) } | ||
safe_messages = messages.zip(results) | ||
.select { |msg, result| !result.flagged? } | ||
.map(&:first) | ||
|
||
puts "#{safe_messages.length} out of #{messages.length} messages are safe" | ||
``` | ||
Comment on lines
+182
to
+199
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. that's not really batch is it? let's remove this section. |
||
|
||
### Custom Threshold Handling | ||
|
||
You might want to implement custom logic based on category scores: | ||
|
||
```ruby | ||
def assess_content_risk(text) | ||
result = RubyLLM.moderate(text) | ||
scores = result.category_scores | ||
|
||
# Custom thresholds for different risk levels | ||
high_risk = scores.any? { |_, score| score > 0.8 } | ||
medium_risk = scores.any? { |_, score| score > 0.5 } | ||
|
||
case | ||
when high_risk | ||
{ risk: :high, action: :block, message: "Content blocked" } | ||
when medium_risk | ||
{ risk: :medium, action: :review, message: "Content flagged for review" } | ||
else | ||
{ risk: :low, action: :allow, message: "Content approved" } | ||
end | ||
end | ||
|
||
# Usage | ||
assessment = assess_content_risk("Some user input") | ||
puts "Risk level: #{assessment[:risk]}" | ||
puts "Action: #{assessment[:action]}" | ||
``` | ||
|
||
## Error Handling | ||
|
||
Handle moderation errors gracefully: | ||
|
||
```ruby | ||
begin | ||
result = RubyLLM.moderate("User content") | ||
|
||
if result.flagged? | ||
handle_unsafe_content(result) | ||
else | ||
process_safe_content(content) | ||
end | ||
rescue RubyLLM::ConfigurationError => e | ||
# Handle missing API key or configuration | ||
logger.error "Moderation not configured: #{e.message}" | ||
# Fallback: proceed with caution or block all content | ||
rescue RubyLLM::RateLimitError => e | ||
# Handle rate limits | ||
logger.warn "Moderation rate limited: #{e.message}" | ||
# Fallback: temporary approval or queue for later | ||
rescue RubyLLM::Error => e | ||
# Handle other API errors | ||
logger.error "Moderation failed: #{e.message}" | ||
# Fallback: proceed with caution | ||
end | ||
``` | ||
|
||
## Configuration Requirements | ||
|
||
Content moderation currently requires an OpenAI API key: | ||
|
||
```ruby | ||
RubyLLM.configure do |config| | ||
config.openai_api_key = ENV['OPENAI_API_KEY'] | ||
|
||
# Optional: set default moderation model | ||
config.default_moderation_model = "omni-moderation-latest" | ||
end | ||
``` | ||
|
||
For more details about OpenAI's moderation capabilities and policies, see the [OpenAI Moderation Guide](https://platform.openai.com/docs/guides/moderation). | ||
|
||
> Moderation API calls are typically less expensive than chat completions and have generous rate limits, making them suitable for screening all user inputs. | ||
{: .note } | ||
|
||
## Best Practices | ||
|
||
### Content Safety Strategy | ||
|
||
- **Always moderate user-generated content** before sending to LLMs | ||
- **Handle false positives gracefully** with human review processes | ||
- **Log moderation decisions** for auditing and improvement | ||
- **Provide clear feedback** to users about content policies | ||
|
||
### Performance Considerations | ||
|
||
- **Batch moderate multiple inputs** when possible for efficiency | ||
- **Cache moderation results** for repeated content (with appropriate TTL) | ||
- **Use background jobs** for non-blocking moderation of large volumes | ||
- **Implement fallbacks** for when moderation services are unavailable | ||
|
||
### User Experience | ||
|
||
```ruby | ||
def user_friendly_moderation(content) | ||
result = RubyLLM.moderate(content) | ||
|
||
return { approved: true } unless result.flagged? | ||
|
||
# Provide specific, actionable feedback | ||
categories = result.flagged_categories | ||
message = case | ||
when categories.include?('harassment') | ||
"Please keep interactions respectful and constructive." | ||
when categories.include?('sexual') | ||
"This content appears inappropriate for our platform." | ||
when categories.include?('violence') | ||
"Please avoid content that promotes violence or harm." | ||
else | ||
"This content doesn't meet our community guidelines." | ||
end | ||
|
||
{ | ||
approved: false, | ||
message: message, | ||
categories: categories | ||
} | ||
end | ||
``` | ||
|
||
## Rails Integration | ||
|
||
When using moderation in Rails applications: | ||
|
||
```ruby | ||
# In a controller or service | ||
class MessageController < ApplicationController | ||
def create | ||
content = params[:message] | ||
|
||
moderation_result = RubyLLM.moderate(content) | ||
|
||
if moderation_result.flagged? | ||
render json: { | ||
error: "Message not allowed", | ||
categories: moderation_result.flagged_categories | ||
}, status: :unprocessable_entity | ||
else | ||
# Process the safe message | ||
message = Message.create!(content: content, user: current_user) | ||
render json: message, status: :created | ||
end | ||
end | ||
end | ||
|
||
# Background job for batch moderation | ||
class ModerationJob < ApplicationJob | ||
def perform(message_ids) | ||
messages = Message.where(id: message_ids) | ||
|
||
messages.each do |message| | ||
result = RubyLLM.moderate(message.content) | ||
message.update!( | ||
moderation_flagged: result.flagged?, | ||
moderation_categories: result.flagged_categories, | ||
moderation_scores: result.category_scores | ||
) | ||
end | ||
end | ||
end | ||
``` | ||
|
||
This allows you to build robust content safety systems that protect both your application and your users while maintaining a good user experience. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to convince people why to use moderation. let's cut this part