generated from dynatrace-oss/template-project
-
Notifications
You must be signed in to change notification settings - Fork 40
Add comprehensive observability analysis rules and DQL patterns #74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
christian-kreuzberger-dtx
merged 10 commits into
dynatrace-oss:main
from
dt-benedict:feature/dql-rules-improvements
Jul 29, 2025
Merged
Changes from 2 commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
30ea6c4
Updated rules files with observabilty use cases: problem flow and inv…
cf79899
Apply prettier formatting to rule files
9dd05c1
Address PR review feedback
6b5148a
Apply prettier formatting
ab055b5
Update .gitignore
ebd21a3
Added disclaimer to rules-readme
a0bb615
Run prettier on rules-readme
6fddd68
Remove .DS_Store file from dynatrace-agent-rules folder
5dc54a6
Merge branch 'main' into feature/dql-rules-improvements
dt-benedict 7af9b73
Merge branch 'main' into feature/dql-rules-improvements
dt-benedict File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Binary file not shown.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,384 @@ | ||
# Dynatrace Log Analysis with DQL | ||
|
||
## Overview | ||
|
||
This guide covers comprehensive log analysis using DQL (Dynatrace Query Language) for troubleshooting, debugging, and root cause analysis. Logs are crucial for understanding application behavior, especially during incidents and deployments. | ||
|
||
## Core DQL Pattern for Logs | ||
|
||
### Basic Log Query Structure | ||
|
||
```dql | ||
fetch logs, from:now() - 2h | ||
| filter loglevel == "ERROR" or loglevel == "WARN" | ||
| fields timestamp, content, loglevel, k8s.pod.name, k8s.namespace.name | ||
| sort timestamp desc | ||
| limit 20 | ||
``` | ||
|
||
## Available Data Points | ||
|
||
### Primary Fields | ||
|
||
- **content** - The actual log message content | ||
- **loglevel** - Log level (ERROR, WARN, INFO, DEBUG, TRACE) | ||
- **timestamp** - When the log entry was created | ||
- **message** - Structured message field (alternative to content) | ||
- **log.source** - Source of the log (e.g., "Container Output") | ||
|
||
### Kubernetes Context Fields | ||
|
||
- **k8s.pod.name** - Pod name generating the log | ||
- **k8s.namespace.name** - Kubernetes namespace | ||
- **k8s.container.name** - Container name within the pod | ||
- **k8s.cluster.name** - Kubernetes cluster name | ||
- **k8s.node.name** - Node where pod is running | ||
- **k8s.workload.name** - Workload (deployment/statefulset) name | ||
|
||
### Dynatrace Entity Fields | ||
|
||
- **dt.entity.service** - Service entity ID | ||
- **dt.entity.process_group_instance** - Process group instance entity ID | ||
- **dt.entity.host** - Host entity ID | ||
- **dt.entity.kubernetes_cluster** - Kubernetes cluster entity ID | ||
|
||
### Trace Correlation Fields | ||
|
||
- **trace_id** - Distributed trace ID | ||
- **span_id** - Span ID within the trace | ||
- **dt.trace_id** - Dynatrace trace ID | ||
- **dt.span_id** - Dynatrace span ID | ||
|
||
### Error Context Fields | ||
|
||
- **exception.message** - Exception message text | ||
- **exception.type** - Exception type/class | ||
- **exception.stack_trace** - Full stack trace | ||
- **status** - Log status (often mirrors loglevel) | ||
|
||
## Common Query Patterns | ||
|
||
### 1. Error Logs from Specific Service | ||
|
||
```dql | ||
fetch logs, from:now() - 4h | ||
| filter dt.entity.service == "SERVICE-YOUR-ID" | ||
| filter loglevel == "ERROR" | ||
| fields timestamp, content, exception.message, trace_id | ||
| sort timestamp desc | ||
| limit 15 | ||
``` | ||
|
||
### 2. Application Errors by Pod | ||
|
||
```dql | ||
fetch logs, from:now() - 2h | ||
| filter matchesPhrase(k8s.pod.name, "payment") | ||
| filter loglevel == "ERROR" or matchesPhrase(content, "error") or matchesPhrase(content, "exception") | ||
| fields timestamp, content, k8s.pod.name, k8s.container.name | ||
| sort timestamp desc | ||
| limit 20 | ||
``` | ||
|
||
### 3. Logs with Stack Traces | ||
|
||
```dql | ||
fetch logs, from:now() - 6h | ||
| filter exception.stack_trace != "" | ||
| fields timestamp, exception.message, exception.type, exception.stack_trace, k8s.pod.name | ||
| sort timestamp desc | ||
| limit 10 | ||
``` | ||
|
||
### 4. Deployment-Related Logs | ||
|
||
```dql | ||
fetch logs, from:now() - 1h | ||
| filter matchesPhrase(content, "deployment") or matchesPhrase(content, "restart") or matchesPhrase(content, "startup") | ||
| fields timestamp, content, k8s.pod.name, k8s.namespace.name | ||
| sort timestamp desc | ||
| limit 25 | ||
``` | ||
|
||
### 5. High-Frequency Error Analysis | ||
|
||
```dql | ||
fetch logs, from:now() - 2h | ||
| filter loglevel == "ERROR" | ||
| fields timestamp, exception.message, k8s.pod.name | ||
| summarize error_count = count(), by: {exception.message, k8s.pod.name} | ||
| sort error_count desc | ||
| limit 15 | ||
``` | ||
|
||
### 6. Trace-Correlated Logs | ||
|
||
```dql | ||
fetch logs, from:now() - 1h | ||
| filter trace_id == "your-trace-id" | ||
| fields timestamp, content, span_id, k8s.pod.name | ||
| sort timestamp asc | ||
``` | ||
|
||
## Business Logic Error Detection | ||
|
||
### Credit Card Processing Errors (Real Example) | ||
|
||
```dql | ||
fetch logs, from:now() - 4h | ||
| filter matchesPhrase(content, "credit card") or matchesPhrase(content, "payment") | ||
| filter loglevel == "WARN" or loglevel == "ERROR" | ||
| fields timestamp, content, exception.message, k8s.pod.name | ||
| sort timestamp desc | ||
| limit 20 | ||
``` | ||
|
||
### Payment Service Specific Analysis | ||
|
||
```dql | ||
fetch logs, from:now() - 4h | ||
| filter matchesPhrase(k8s.pod.name, "payment") and matchesPhrase(content, "American Express") | ||
| fields timestamp, content, exception.message, trace_id | ||
| sort timestamp desc | ||
| limit 10 | ||
``` | ||
|
||
## Integration with Problem Analysis | ||
|
||
### Correlating Logs with Problems | ||
|
||
```dql | ||
fetch logs, from:now() - 2h | ||
| filter loglevel == "ERROR" or loglevel == "WARN" | ||
| filter matchesPhrase(k8s.namespace.name, "astroshop") | ||
| fields timestamp, content, k8s.pod.name, dt.entity.service | ||
| sort timestamp desc | ||
| limit 30 | ||
``` | ||
|
||
### Problem Timeline Correlation | ||
|
||
When analyzing a specific problem timeframe (e.g., 11:54 AM - 12:29 PM): | ||
MrManny marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
```dql | ||
fetch logs, from:now() - 4h | ||
| filter matchesPhrase(k8s.pod.name, "payment-6977fffc7-2r2hb") | ||
| filter loglevel == "WARN" or loglevel == "ERROR" | ||
| fields timestamp, content, exception.message | ||
| sort timestamp desc | ||
| limit 20 | ||
``` | ||
|
||
## Advanced Analysis Patterns | ||
|
||
### Log Rate Analysis | ||
|
||
```dql | ||
fetch logs, from:now() - 2h | ||
| filter loglevel == "ERROR" | ||
| fieldsAdd time_bucket = bin(timestamp, 5m) | ||
| summarize log_count = count(), by: {time_bucket, k8s.pod.name} | ||
| sort time_bucket desc | ||
``` | ||
|
||
### Multi-Service Error Correlation | ||
|
||
```dql | ||
fetch logs, from:now() - 2h | ||
| filter loglevel == "ERROR" | ||
| filter matchesPhrase(k8s.namespace.name, "astroshop") | ||
| summarize error_count = count(), latest_error = max(timestamp), by: {k8s.pod.name, exception.message} | ||
| sort error_count desc | ||
``` | ||
|
||
### Performance Issue Detection | ||
|
||
```dql | ||
fetch logs, from:now() - 1h | ||
| filter matchesPhrase(content, "timeout") or matchesPhrase(content, "slow") or matchesPhrase(content, "performance") | ||
| fields timestamp, content, k8s.pod.name, k8s.container.name | ||
| sort timestamp desc | ||
| limit 15 | ||
``` | ||
|
||
## String Matching Best Practices | ||
|
||
### ✅ Correct String Operations for Logs | ||
|
||
```dql | ||
| filter matchesPhrase(content, "payment") // Text search in content | ||
| filter loglevel == "ERROR" // Exact level match | ||
| filter startsWith(k8s.pod.name, "payment-") // Pod prefix match | ||
| filter endsWith(exception.type, "Exception") // Exception type suffix | ||
``` | ||
|
||
### ❌ Unsupported Operations | ||
|
||
```dql | ||
| filter contains(content, "error") // NOT supported | ||
| filter content like "%payment%" // NOT supported | ||
``` | ||
|
||
### Content Search Techniques | ||
|
||
```dql | ||
// Multiple keyword search | ||
| filter matchesPhrase(content, "error") or matchesPhrase(content, "exception") or matchesPhrase(content, "failed") | ||
|
||
// Case variations | ||
| filter matchesPhrase(content, "Error") or matchesPhrase(content, "ERROR") or matchesPhrase(content, "error") | ||
|
||
// Specific error patterns | ||
| filter matchesPhrase(content, "cannot process") or matchesPhrase(content, "validation failed") | ||
``` | ||
|
||
## Real-World Investigation Examples | ||
|
||
### Payment Service American Express Bug Investigation | ||
|
||
**Context**: 57.95% error rate during deployment | ||
**Timeline**: 11:54 AM - 12:29 PM | ||
|
||
```dql | ||
fetch logs, from:now() - 4h | ||
| filter matchesPhrase(k8s.pod.name, "payment") | ||
| filter loglevel == "WARN" and matchesPhrase(content, "American Express") | ||
| fields timestamp, content, exception.message, trace_id | ||
| sort timestamp desc | ||
| limit 10 | ||
``` | ||
|
||
**Key Findings from Real Data**: | ||
|
||
- **Error Location**: `/usr/src/app/charge.js:73:11` | ||
- **Business Logic Bug**: "Sorry, we cannot process American Express credit cards. Only Visa or Mastercard or American Express are accepted." | ||
- **Trace Correlation**: Each error has associated trace_id for transaction tracking | ||
- **Pod Consistency**: All errors from same pod `payment-6977fffc7-2r2hb` | ||
|
||
### Deployment Impact Analysis | ||
|
||
```dql | ||
fetch logs, from:now() - 4h | ||
| filter matchesPhrase(k8s.namespace.name, "astroshop") | ||
| filter matchesPhrase(content, "ArgoCD") or matchesPhrase(content, "deployment") or matchesPhrase(content, "git#") | ||
| fields timestamp, content, k8s.pod.name | ||
| sort timestamp desc | ||
| limit 15 | ||
``` | ||
|
||
## Structured Log Analysis | ||
|
||
### JSON Log Parsing | ||
|
||
For structured JSON logs, the content field contains JSON that can be analyzed: | ||
|
||
```dql | ||
fetch logs, from:now() - 2h | ||
| filter matchesPhrase(content, "level") | ||
| fields timestamp, content, k8s.pod.name | ||
| sort timestamp desc | ||
| limit 10 | ||
``` | ||
|
||
### Extracting Values from JSON Logs | ||
|
||
```dql | ||
fetch logs, from:now() - 1h | ||
| filter matchesPhrase(content, "\"level\":\"warn\"") | ||
| fields timestamp, content, exception.message, k8s.pod.name | ||
| sort timestamp desc | ||
``` | ||
|
||
## Performance Considerations | ||
|
||
### Optimizing Log Queries | ||
|
||
1. **Always include timeframe**: `from:now() - 2h` (avoid overly broad searches) | ||
2. **Filter early**: Apply restrictive filters first | ||
3. **Use entity filters**: Filter by specific pods/services when possible | ||
4. **Limit results**: Always include reasonable limits | ||
5. **Sort efficiently**: Sort by timestamp desc for recent logs | ||
|
||
### Query Timeframe Recommendations | ||
|
||
- **Real-time debugging**: 15-30 minutes | ||
- **Incident investigation**: 2-4 hours | ||
- **Deployment analysis**: 1-2 hours around deployment time | ||
- **Pattern analysis**: 24 hours maximum | ||
- **Historical research**: Use specific time windows, not broad ranges | ||
MrManny marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Troubleshooting Log Queries | ||
|
||
### Common Issues | ||
|
||
1. **Empty Results**: Check timeframe, pod names, and filter criteria | ||
2. **Performance Issues**: Reduce timeframe or add more specific filters | ||
3. **Missing Logs**: Verify log ingestion and entity IDs | ||
4. **Field Access**: Use `| limit 1` to explore available fields | ||
|
||
### Field Discovery Technique | ||
|
||
```dql | ||
fetch logs, from:now() - 1h | ||
| filter matchesPhrase(k8s.pod.name, "your-pod-name") | ||
| limit 1 | ||
``` | ||
|
||
### Debugging Query Performance | ||
|
||
```dql | ||
fetch logs, from:now() - 30m // Shorter timeframe | ||
| filter k8s.pod.name == "specific-pod-name" // Exact match | ||
| filter loglevel == "ERROR" // Specific level | ||
| limit 10 // Small result set | ||
``` | ||
|
||
## Integration with MCP Tools | ||
|
||
### Using Logs After Entity Discovery | ||
|
||
After finding entities with MCP tools, use their names in log queries: | ||
|
||
```dql | ||
fetch logs, from:now() - 2h | ||
| filter dt.entity.service == "SERVICE-BECA49FB15C72B6A" // From MCP entity lookup | ||
| filter loglevel == "ERROR" | ||
| fields timestamp, content, exception.message | ||
| sort timestamp desc | ||
| limit 15 | ||
``` | ||
|
||
### Cross-Referencing with Problems | ||
|
||
1. **Find problems** with `fetch events | filter event.kind == "DAVIS_PROBLEM"` | ||
2. **Extract affected entities** from problem results | ||
3. **Query logs** for those specific entities during problem timeframe | ||
4. **Correlate trace IDs** between problems and logs | ||
|
||
## Follow-up Analysis Workflows | ||
|
||
### After Finding Error Logs | ||
|
||
1. **Extract stack traces** for deeper code analysis | ||
2. **Find related spans** using trace IDs | ||
3. **Check entity health** for affected services | ||
4. **Analyze deployment correlation** with timestamps | ||
5. **Search for similar errors** across other pods/services | ||
|
||
### Log-Driven Root Cause Analysis Process | ||
|
||
1. **Start with problem timeframe** from Davis AI analysis | ||
2. **Filter logs to error level** during that period | ||
3. **Identify error patterns** and affected components | ||
4. **Trace correlation** using trace/span IDs | ||
5. **Business logic validation** through error message analysis | ||
6. **Infrastructure correlation** with deployment events | ||
|
||
## Integration Notes | ||
|
||
- **Always verify DQL syntax** with `verify_dql` before execution | ||
- **Use MCP entity tools** to get precise entity IDs for log filtering | ||
- **Combine with spans** for complete transaction analysis | ||
- **Correlate with metrics** for performance context | ||
- **Reference problem events** for incident context | ||
- **Consider log sampling** for high-volume environments |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.