This folder contains all the files needed to run the Loki Error Analyzer independently.
run_analyzer.sh- Main script to run the analyzerloki_error_analyzer.py- Python script that performs the analysisllm_error_enhancer.py- AI-powered error analysis enhancerconfig.yaml- Configuration file for the analyzerrequirements.txt- Python dependenciesLLM_SETUP.md- LLM enhancer setup guideREADME.md- This documentation file
Before running the analyzer, ensure you have the following installed:
- Python 3 - Required for running the analyzer script
- kubectl - Required for connecting to Kubernetes clusters
- logcli - Grafana Loki command-line interface
- Install from: https://grafana.com/docs/loki/latest/clients/logcli/
- Configuration guide: Loki LogCLI Configuration
- PyYAML - Python YAML library
- Install with:
pip3 install -r requirements.txt
- Install with:
- Ollama (Optional) - For AI-powered analysis enhancement
- Install with:
brew install ollama - See
LLM_SETUP.mdfor detailed setup
- Install with:
-
Install Python dependencies:
pip3 install -r requirements.txt
-
Ensure kubectl is configured with access to the target Kubernetes cluster
-
Install logcli following the official documentation
-
(Optional) Install Ollama for AI-powered analysis:
brew install ollama
# Run analysis for development environment (default)
./run_analyzer.sh
# Run analysis for production environment
./run_analyzer.sh -e prod
# Run with debug mode enabled
./run_analyzer.sh -e prod -d
# Run without automatic cleanup
./run_analyzer.sh -e prod -c-e, --env ENV- Environment to analyze (dev, prod) [default: dev]-d, --debug- Enable debug mode-c, --no-cleanup- Disable automatic cleanup-t, --timeout SEC- Query timeout validation (for safety warnings) [default: 600]--log-level LEVEL- Log level filter (error, warn, info, debug, all) [default: error]--loki-query QUERY- Custom Loki query (e.g., 'orgId=loki-tutti-prod')--loki-query-params- JSON parameters for custom query-h, --help- Show help message
# Development environment analysis
./run_analyzer.sh
# Production environment analysis
./run_analyzer.sh -e prod
# Production analysis with debug mode
./run_analyzer.sh -e prod -d
# Production analysis without cleanup
./run_analyzer.sh -e prod -c
# Production analysis with custom timeout
./run_analyzer.sh -e prod -t 300
# Production analysis with all log levels (not just errors)
./run_analyzer.sh -e prod --log-level all
# Production analysis with warnings and errors only
./run_analyzer.sh -e prod --log-level warn
# Production analysis with custom Loki query
./run_analyzer.sh -e prod --loki-query 'orgId=loki-tutti-prod' --loki-query-params '{"namespace":"live-tutti-services","detected_level":"info"}'Enhance your error analysis with AI insights:
# Basic AI enhancement (Ollama starts/stops automatically)
python3 llm_error_enhancer.py prod_log.json
# With custom output file
python3 llm_error_enhancer.py prod_log.json --output enhanced_report.md
# With different AI model
python3 llm_error_enhancer.py prod_log.json --model mistral:7b
# With large model and extended timeout (for 70B models)
python3 llm_error_enhancer.py prod_log.json --model llama3.1:70b-instruct-q4_K_M --timeout 600
# Complete workflow: Loki analysis + AI enhancement
python3 loki_error_analyzer.py --env prod
python3 llm_error_enhancer.py prod_log.jsonAI Features:
- π€ Root Cause Analysis: Identifies likely causes of errors
- π Detailed End-User Impact Analysis: Comprehensive business impact assessment for top 3 error services
- π° Financial Impact Assessment: Revenue loss, cost implications, and business metrics
- π― Severity Classification: Critical, High, Medium, Low impact levels
- β‘ Immediate Actions: Emergency fixes and urgent remediation steps
- π Long-term Recommendations: Strategic improvements and process enhancements
- π¬ Communication Strategies: User notification and stakeholder communication plans
- π Service-Specific Intelligence: Pre-built impact templates for key services
- β±οΈ Configurable Timeouts: Support for large models (70B) with extended processing time
The analyzer generates the following files:
log.json- Raw error logs from LokiLOKI_ERROR_ANALYSIS_REPORT_DEV.md- Analysis report for dev environmentLOKI_ERROR_ANALYSIS_REPORT_PROD.md- Analysis report for prod environmentenhanced_analysis_YYYYMMDD_HHMMSS.md- AI-enhanced analysis with detailed end-user impact analysis
- π€ AI-Powered Analysis: LLM-generated insights and recommendations
- π¨ Top 3 Services Impact Analysis: Detailed business impact for highest error services
- π° Financial Impact Assessment: Revenue loss, cost implications, business metrics
- π― Severity Classification: Critical, High, Medium, Low with business justification
- β‘ Actionable Recommendations: Immediate fixes and long-term strategic improvements
- π¬ Communication Strategies: User notification and stakeholder communication plans
- π Executive Summary: CTO/executive-ready insights and decision support
The analyzer includes intelligent safety features to prevent timeouts:
- Default Timeout: 600 seconds (10 minutes) for safety validation
- Safety Warnings: Automatic warnings for short timeouts
- Interactive Confirmation: Prompts for potentially problematic configurations
| Use Case | Timeout | Command |
|---|---|---|
| Quick Analysis | 300s | ./run_analyzer.sh -e prod -t 300 |
| Standard Analysis | 600s | ./run_analyzer.sh -e prod |
| Deep Analysis | 900s | ./run_analyzer.sh -e prod -t 900 |
| Emergency Analysis | 180s | ./run_analyzer.sh -e prod -t 180 |
- Monitor Resources: Large queries can consume significant memory
- Use Time Windows: Consider shorter time ranges for large datasets
- Check kubectl: Ensure port-forward is stable before large queries
The config.yaml file contains all configuration options:
- Loki Connection Settings: Kubernetes context, namespace, service details
- Query Parameters: Time range, log level filters, output limits
- Error Categories: Customizable error classification patterns
- Report Settings: Organization details, report customization options
- Error Filtering Thresholds: Minimum occurrence counts for error inclusion
- Grafana Integration: Clickable query URLs for root cause investigation
- Python 3 not found: Ensure Python 3 is installed and in PATH
- kubectl not found: Install kubectl and configure it for your cluster
- logcli not found: Install logcli from the official documentation
- Permission denied: Ensure the script has execute permissions:
chmod +x run_analyzer.sh - Query timeout: Increase timeout with
-toption - Memory issues: Consider using shorter time ranges for large datasets
- Port-forward issues: Restart kubectl port-forward if connections are unstable
Enable debug mode to get more detailed output:
./run_analyzer.sh -e prod -dIf the script doesn't clean up properly, manually stop kubectl port-forward processes:
pkill -f "kubectl port-forward"- Uses
dev-ricardoorganization ID - Analyzes last 24 hours of logs
- Generates
LOKI_ERROR_ANALYSIS_REPORT_DEV.md
- Uses
prod-ricardoorganization ID - Analyzes yesterday 19:00-22:00 time window
- Generates
LOKI_ERROR_ANALYSIS_REPORT_PROD.md
# 1. Run Loki error analysis
python3 loki_error_analyzer.py --env prod
# 2. View the report
open prod_LOKI_ERROR_ANALYSIS_REPORT.md# 1. Run Loki error analysis
python3 loki_error_analyzer.py --env prod
# 2. Enhance with AI insights (Ollama auto-starts/stops)
python3 llm_error_enhancer.py prod_log.json
# 3. View the enhanced report
open enhanced_analysis_*.mdThe enhanced LLM analyzer automatically generates comprehensive business impact analysis for the top 3 error services in every run. This provides executive-ready insights that go beyond technical error counts.
- Scale of Impact: Error counts, rates, affected pods, percentage of system errors
- Root Cause Analysis: Technical root cause and error distribution patterns
- Business Impact Assessment: Direct and indirect user impact analysis
- Severity Classification: Critical, High, Medium, Low with business justification
- Immediate Actions: Emergency fixes, data investigation, financial reconciliation
- Long-term Recommendations: Strategic improvements and process enhancements
- Communication Strategy: User notification timelines and stakeholder communication
boost-fee-worker: Boost fee refund processing failures and financial impactfrontend-mobile-api-v2: Mobile app functionality disruptions and user engagementimaginary-wrapper: Image processing failures and listing quality impact- Default templates: For any other services with generic impact analysis
## π¨ End User Impact Analysis: boost-fee-worker
### **π Scale of Impact**
- **Total Errors:** 7,317 (12.8% of all system errors)
- **Critical Errors:** 0 (0.0% of service errors)
- **Error Rate:** ~2.0 errors per hour
- **Affected Pods:** 4 pods
### **π Root Cause Analysis**
**Primary Error:** Error handler threw an exception
**Error Distribution:** NullPointerException in boost fee refund processing
### **π° Business Impact Assessment**
#### **Direct User Impact:**
1. **π΄ Boost Fee Refunds Not Processed**
- Users who paid for listing boosts may not receive refunds
- Affects seller experience and platform trust
- **Financial impact**: Direct revenue loss from unprocessed refunds
### **π― Severity Classification**
**π΄ CRITICAL** - Business Critical - Immediate action required
### **β‘ Immediate Actions Required**
1. **π§ Emergency Fix**
- Add null checks for getConsentTime() in ListingServiceAdapter
- Implement fallback logic for missing consent data
- Deploy hotfix immediately- Executive Ready: Provides CTO/executive-level insights immediately
- Actionable: Specific steps for immediate and long-term remediation
- Financial Focus: Quantifies business impact and revenue implications
- User-Centric: Focuses on actual end-user experience and impact
- Communication Ready: Includes stakeholder communication strategies
# Basic analysis
./run_analyzer.sh -e prod
# Then enhance with AI
python3 llm_error_enhancer.py prod_log.jsonFor technical questions or issues, contact the DevOps team.
- LLM Setup Guide: See
LLM_SETUP.mdfor detailed AI enhancement setup - LogCLI Configuration: Loki LogCLI Configuration
- Automatic Top 3 Analysis: Detailed end-user impact for highest error services
- Business Impact Focus: Financial, user experience, and operational impact assessment
- Executive-Ready Reports: CTO/executive-level insights and decision support
- Service-Specific Intelligence: Pre-built templates for key services
- Communication Strategies: User notification and stakeholder communication plans
# Complete analysis with AI enhancement (using shell script)
./run_analyzer.sh -e prod && python3 llm_error_enhancer.py prod_log.json
# Quick analysis with custom timeout
./run_analyzer.sh -e prod -t 300 && python3 llm_error_enhancer.py prod_log.json
# Deep analysis with large model
./run_analyzer.sh -e prod -t 900 && python3 llm_error_enhancer.py prod_log.json --model llama3.1:70b-instruct-q4_K_M --timeout 600- π― Actionable: Specific immediate and long-term recommendations
- π° Business-Focused: Financial impact and revenue implications
- π Data-Driven: Uses actual error metrics for severity classification
- π Automated: No manual intervention required
- π¬ Communication-Ready: Includes stakeholder communication strategies
For different LLM models, use appropriate timeout values:
| Model | Recommended Timeout | Command Example |
|---|---|---|
| llama3.1:8b | 120-300s | --timeout 300 |
| llama3.1:70b-q4_K_M | 600-900s | --timeout 600 |
| mistral:7b | 300-600s | --timeout 300 |
| qwen2.5:7b | 300-600s | --timeout 300 |
# Fast model (8B)
python3 llm_error_enhancer.py prod_log.json --model llama3.1:8b --timeout 300
# Large model (70B) - needs more time
python3 llm_error_enhancer.py prod_log.json --model llama3.1:70b-instruct-q4_K_M --timeout 600
# If 70B still times out, try longer timeout
python3 llm_error_enhancer.py prod_log.json --model llama3.1:70b-instruct-q4_K_M --timeout 900