Skip to content

Commit fc73624

Browse files
committed
Merge branch 'main' into feat/admonition
2 parents 3ec76d8 + 78bda1a commit fc73624

File tree

12 files changed

+486
-8
lines changed

12 files changed

+486
-8
lines changed

components/ArticleMetaDetailsCard/ArticleMetaDetailsCard.tsx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ export default function ArticleMetaDetailsCard({
5151
<Link
5252
href={author.url}
5353
target="_blank"
54-
rel="noopener noreferrer"
54+
rel="noopener noreferrer nofollow"
5555
className="!text-gray-200 transition-colors hover:text-signoz_robin-400"
5656
prefetch={false}
5757
>

components/Link.tsx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ const CustomLink = ({ href, ...rest }: LinkProps & AnchorHTMLAttributes<HTMLAnch
4848
return <a href={href} {...rest} />
4949
}
5050

51-
return <a target="_blank" rel="noopener noreferrer" href={href} {...rest} />
51+
return <a target="_blank" rel="noopener noreferrer nofollow" href={href} {...rest} />
5252
}
5353

5454
export default CustomLink

constants/docsSideNav.ts

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2894,13 +2894,36 @@ const docsSideNav = [
28942894
{
28952895
type: 'doc',
28962896
route: '/docs/ai/signoz-mcp-server',
2897-
label: 'MCP Server',
2897+
label: 'SigNoz MCP Server',
28982898
},
28992899
{
29002900
type: 'doc',
29012901
route: '/docs/ai/agent-skills',
29022902
label: 'Agent Skills',
29032903
},
2904+
{
2905+
type: 'category',
2906+
isExpanded: false,
2907+
route: '/docs/ai/use-cases',
2908+
label: 'MCP Use Cases',
2909+
items: [
2910+
{
2911+
type: 'doc',
2912+
route: '/docs/ai/use-cases/natural-language-log-exploration',
2913+
label: 'Log Exploration',
2914+
},
2915+
{
2916+
type: 'doc',
2917+
route: '/docs/ai/use-cases/latency-spike-explainer',
2918+
label: 'Latency Spike Explainer',
2919+
},
2920+
{
2921+
type: 'doc',
2922+
route: '/docs/ai/use-cases/reconstruct-bug-from-trace-id',
2923+
label: 'Report from Trace ID',
2924+
},
2925+
],
2926+
},
29042927
],
29052928
},
29062929
{

data/docs/ai/overview.mdx

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
date: 2026-03-11
2+
date: 2026-03-30
33
id: overview
44
title: AI Tools and Skills
55
description: Integrate SigNoz with your AI coding assistants using the MCP Server and Agent Skills.
@@ -29,3 +29,13 @@ The <a href="https://github.com/SigNoz/signoz-mcp-server" target="_blank" rel="n
2929
- Install with a single command
3030

3131
[Get started with Agent Skills →](https://signoz.io/docs/ai/agent-skills)
32+
33+
## MCP Use Cases
34+
35+
Once you have the MCP server connected, explore practical workflows:
36+
37+
- Search and analyze logs by asking questions in plain English.
38+
- Ask "why is this slow?" and get a span breakdown with the bottleneck identified.
39+
- Paste a trace ID and reconstruct the full request path with root cause analysis.
40+
41+
[Browse all MCP use cases →](https://signoz.io/docs/ai/use-cases)

data/docs/ai/signoz-mcp-server.mdx

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,10 @@ doc_type: howto
99

1010
The SigNoz MCP Server implements the <a href="https://modelcontextprotocol.io/" target="_blank" rel="noopener noreferrer nofollow">Model Context Protocol (MCP)</a> — an open standard that lets AI assistants interact with your SigNoz observability data. Query metrics, traces, logs, alerts, and dashboards through natural language.
1111

12+
<KeyPointCallout title="Already configured?" defaultCollapsed={true}>
13+
If you've already set up the MCP server, skip ahead to the [use cases](https://signoz.io/docs/ai/use-cases) to see what you can do with it.
14+
</KeyPointCallout>
15+
1216
## Connect to SigNoz's MCP server
1317

1418
<Tabs>

data/docs/ai/use-cases.mdx

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
---
2+
date: 2026-03-30
3+
id: use-cases
4+
title: MCP Use Cases
5+
description: Practical workflows for using the SigNoz MCP Server with AI assistants to debug, explore, and analyze your observability data.
6+
doc_type: explanation
7+
---
8+
9+
Real-world workflows you can run with the [SigNoz MCP Server](https://signoz.io/docs/ai/signoz-mcp-server) and any MCP-compatible AI assistant.
10+
11+
Each guide walks through a specific scenario - the prompt to try, what to expect, and what the MCP server does under the hood.
12+
13+
<DocCardContainer>
14+
15+
<DocCard
16+
title="Natural Language Log Exploration"
17+
description="Search, filter, and analyze logs by asking questions in plain English - no query syntax required."
18+
href="/docs/ai/use-cases/natural-language-log-exploration/"
19+
/>
20+
21+
<DocCard
22+
title="Latency Spike Explainer"
23+
description="Ask 'why is this slow?' and get a full span breakdown identifying the bottleneck service."
24+
href="/docs/ai/use-cases/latency-spike-explainer/"
25+
/>
26+
27+
<DocCard
28+
title="Reconstruct a Bug from a Trace ID"
29+
description="Paste a trace ID from a support ticket and reconstruct the full request path with root cause."
30+
href="/docs/ai/use-cases/reconstruct-bug-from-trace-id/"
31+
/>
32+
33+
</DocCardContainer>
Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
---
2+
date: 2026-03-30
3+
id: latency-spike-explainer
4+
title: Latency Spike Explainer
5+
description: Ask your AI assistant "why is this slow?" and get a full span breakdown identifying the bottleneck service.
6+
doc_type: howto
7+
---
8+
9+
import GetHelp from '@/components/shared/get-help.md'
10+
11+
PagerDuty fires. The alert reads: `checkout-service p99 latency > 2s (currently 4.7s), triggered 3 min ago`. You already know what is slow. You need to know why.
12+
13+
You open your AI assistant, connected to SigNoz via the MCP server, and start asking.
14+
15+
## Prerequisites
16+
17+
- Connect your AI assistant to SigNoz using the [MCP Server guide](https://signoz.io/docs/ai/signoz-mcp-server).
18+
- Make sure your services are instrumented with distributed tracing. See [Instrument Your Application](https://signoz.io/docs/instrumentation/) if you haven't set this up.
19+
20+
## Step 1: Inspect a Slow Trace
21+
22+
```
23+
Show me traces from checkout-service slower than 2 seconds in the last 30 minutes. Break down the spans for the slowest one.
24+
```
25+
26+
The span tree comes back:
27+
28+
```
29+
POST /api/checkout (checkout-service, 4,712ms)
30+
|-- ValidateCart (checkout-service, 8ms)
31+
|-- GetCustomerProfile (customer-service, 41ms)
32+
|-- ProcessPayment (payment-service, 4,480ms) <-- 95% of total
33+
| |-- ChargeCard (stripe-gateway, 4,430ms)
34+
|-- SendConfirmation (notification-service, skipped, upstream failure)
35+
```
36+
37+
95% of the time is in the `ChargeCard` call to the Stripe gateway.
38+
39+
## Step 2: Is This All Requests or Just the Tail?
40+
41+
```
42+
Show me the p50 and p99 latency for checkout-service /api/checkout over the last 2 hours, broken down in 5-minute intervals.
43+
```
44+
45+
Both p50 and p99 were stable at ~400ms until 1:47 AM, then both jumped. p50 is at 3.8s, p99 at 4.7s. This is not tail latency. Nearly every request is affected. Something broke at 1:47 AM.
46+
47+
## Step 3: Compare With a Healthy Trace
48+
49+
```
50+
Find me a trace from checkout-service between 2 and 3 hours ago where duration was under 500ms.
51+
```
52+
53+
A healthy trace from before the spike:
54+
55+
```
56+
POST /api/checkout (checkout-service, 387ms)
57+
|-- ValidateCart (checkout-service, 6ms)
58+
|-- GetCustomerProfile (customer-service, 38ms)
59+
|-- ProcessPayment (payment-service, 291ms)
60+
| |-- ChargeCard (stripe-gateway, 248ms)
61+
|-- SendConfirmation (notification-service, 31ms)
62+
```
63+
64+
Same call chain. The only difference: `ChargeCard` went from 248ms to 4,430ms. The problem is not in your code. It is downstream.
65+
66+
## Step 4: Check the Dependency
67+
68+
```
69+
Show me p99 latency for payment-service over the last 2 hours in 5-minute intervals. Also pull any error or warning logs from payment-service in the last 30 minutes.
70+
```
71+
72+
Payment-service latency spiked at the exact same time. The logs show the cause:
73+
74+
```
75+
01:47:12 WARN Stripe endpoint config reloaded: region changed us-east-1 -> eu-west-1
76+
01:47:14 WARN ChargeCard latency elevated (2,341ms), retrying
77+
01:47:15 ERROR ChargeCard timeout after 5000ms
78+
01:47:18 WARN ChargeCard latency elevated (4,102ms)
79+
```
80+
81+
A config change at 1:47 AM switched the Stripe endpoint to a different region. Every charge request is now making a cross-Atlantic round trip.
82+
83+
## Step 5: Quantify and Decide
84+
85+
```
86+
Show me total request count and error rate for checkout-service over the last 2 hours in 5-minute intervals. What percentage of requests are slower than 2 seconds?
87+
```
88+
89+
847 requests since the spike. 94% are over 2 seconds. Error rate is 12% (timeouts). The trend is flat, not worsening, but nearly every customer is getting a degraded experience. You revert the config change.
90+
91+
## Tips for Your Own Investigations
92+
93+
- **Check percentiles, not just p99.** If p50 is fine but p99 is bad, only a subset of requests are slow. If both are bad, something systemic broke.
94+
- **Follow the dependency chain.** If the bottleneck span is a call to another service, check that service directly. Correlate latency spikes and error logs across both.
95+
- **Quantify before you act.** Know the blast radius before you wake someone up or trigger a rollback.
96+
97+
<details>
98+
<ToggleHeading>
99+
## Under the Hood
100+
</ToggleHeading>
101+
102+
During this investigation, the MCP server called these tools:
103+
104+
| Step | MCP Tool | What It Did |
105+
|------|----------|-------------|
106+
| 1 | `signoz_search_traces` | Found traces matching the duration and time range filter |
107+
| 1 | `signoz_get_trace_details` | Returned the full span tree for the slowest trace |
108+
| 2 | `signoz_aggregate_traces` | Computed p50/p99 latency in time-series buckets |
109+
| 3 | `signoz_search_traces` | Found a healthy baseline trace from before the spike |
110+
| 4 | `signoz_get_service_top_operations` | Got latency breakdown for the downstream service |
111+
| 4 | `signoz_search_logs` | Pulled error and warning logs from payment-service |
112+
| 5 | `signoz_aggregate_traces` | Computed request counts and error rates over time |
113+
114+
</details>
115+
116+
## Next Steps
117+
118+
- [Natural Language Log Exploration](https://signoz.io/docs/ai/use-cases/natural-language-log-exploration) - Search and analyze logs without writing queries.
119+
- [Reconstruct a Bug from a Trace ID](https://signoz.io/docs/ai/use-cases/reconstruct-bug-from-trace-id) - Debug a support ticket with a trace ID.
120+
121+
<GetHelp />
Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
---
2+
date: 2026-03-30
3+
id: natural-language-log-exploration
4+
title: Natural Language Log Exploration
5+
description: Search, filter, and analyze logs in SigNoz by asking questions in plain English through your AI assistant.
6+
doc_type: howto
7+
---
8+
9+
import GetHelp from '@/components/shared/get-help.md'
10+
11+
A product manager posts in #incidents:
12+
13+
> Multiple sellers are complaining that products they updated hours ago still show old prices in search results. The catalog page shows the right data, but search is stale.
14+
15+
You have SigNoz collecting logs and the MCP server connected to your AI assistant. You know search is powered by an indexing pipeline, but you don't know the internals.
16+
17+
## Prerequisites
18+
19+
- Connect your AI assistant to SigNoz using the [MCP Server guide](https://signoz.io/docs/ai/signoz-mcp-server).
20+
- Make sure your services are sending logs to SigNoz. See [Send Logs to SigNoz](https://signoz.io/docs/userguide/logs) if you haven't set this up.
21+
22+
## Step 1: Search for the Symptom
23+
24+
```
25+
Show me recent error or warning logs from any service related to search indexing or index lag in the last 6 hours.
26+
```
27+
28+
Results come back from `search-indexer`:
29+
30+
```
31+
Found 34 logs matching across 2 services:
32+
33+
1. 14:52:11 WARN search-indexer - "Index lag exceeds threshold: 4h12m behind head (threshold: 15m)"
34+
2. 14:47:03 WARN search-indexer - "Index lag exceeds threshold: 4h07m behind head (threshold: 15m)"
35+
3. 14:42:01 WARN search-indexer - "Index lag exceeds threshold: 4h02m behind head (threshold: 15m)"
36+
4. 14:22:18 WARN search-indexer - "Consumer group rebalance completed, partition assignment unchanged"
37+
5. 13:15:44 WARN search-indexer - "Batch processing rate: 12 events/sec (normal: ~340 events/sec)"
38+
...
39+
```
40+
41+
The search indexer is 4+ hours behind. Processing speed has dropped from 340 events/sec to 12. That explains the stale results. But there are no errors, just slowness. Why is it crawling?
42+
43+
## Step 2: Understand Why Throughput Dropped
44+
45+
```
46+
Show me logs from search-indexer in the last 6 hours that mention "skip", "drop", "malformed", "parse", or "invalid".
47+
```
48+
49+
The volume is striking:
50+
51+
```
52+
Found 9,847 logs matching:
53+
54+
1. 14:51:58 WARN "Skipping malformed event: missing required field 'sku_id' (event_source: catalog-pipeline)"
55+
2. 14:51:57 WARN "Skipping malformed event: field 'price' is not numeric: 'USD29.99' (event_source: catalog-pipeline)"
56+
3. 14:51:55 WARN "Parse retry exhausted for event, moving to dead letter queue (event_source: catalog-pipeline)"
57+
...
58+
```
59+
60+
Nearly 10,000 malformed events in 6 hours. The indexer is spending all its time retrying bad data and dead-lettering it. Valid events are stuck behind the flood. Every bad event comes from `catalog-pipeline`.
61+
62+
```
63+
How many "Skipping malformed event" warnings has search-indexer logged per hour over the last 24 hours?
64+
```
65+
66+
The hourly breakdown shows a clear inflection point:
67+
68+
```
69+
Malformed event warnings per hour (search-indexer):
70+
71+
00:00 - 10:00 UTC: 0-3/hour (baseline noise)
72+
10:00 - 10:59: 2
73+
11:00 - 11:59: 1,847 <-- spike
74+
12:00 - 12:59: 1,923
75+
13:00 - 13:59: 1,812
76+
14:00 - 14:59: 1,690 (ongoing)
77+
```
78+
79+
The malformed events started at 11:00 UTC. Something changed in `catalog-pipeline` around that time.
80+
81+
## Step 3: Trace the Root Cause Upstream
82+
83+
```
84+
Show me INFO and WARN logs from catalog-pipeline between 10:45 and 11:15 UTC today. I am looking for deployments, config changes, or schema changes.
85+
```
86+
87+
The deployment logs tell the story:
88+
89+
```
90+
Found 28 logs:
91+
92+
1. 10:52:03 INFO "Deployment started: catalog-pipeline v2.14.0 -> v2.15.0 (deployer: ci-bot)"
93+
2. 10:52:18 INFO "Migration applied: product_event_schema_v3"
94+
3. 10:52:19 INFO "Event format updated: sku_id field moved from root to nested product.identifiers.sku_id"
95+
4. 10:52:19 INFO "Event format updated: price field changed from cents (int) to formatted string (e.g. 'USD29.99')"
96+
5. 10:52:31 INFO "Deployment complete: catalog-pipeline v2.15.0 healthy"
97+
6. 10:53:01 INFO "Backfill started: reprocessing 14,291 products with new schema"
98+
7. 11:01:12 INFO "Backfill complete: 14,291 events published"
99+
```
100+
101+
`catalog-pipeline` v2.15.0 changed the event schema in two breaking ways: it moved `sku_id` into a nested path and changed `price` from integer cents to a formatted string. The search indexer still expects the old schema. Every event from the new version fails validation. On top of that, the backfill re-published 14,291 products in the new format, flooding the indexer with unparseable data.
102+
103+
## Step 4: Scope the Impact
104+
105+
```
106+
How many events has search-indexer moved to the dead letter queue in the last 6 hours? What is the current index lag?
107+
```
108+
109+
The damage:
110+
111+
```
112+
Dead letter queue (last 6 hours):
113+
- Events moved to DLQ: 9,214
114+
- Estimated unique products affected: ~6,800
115+
116+
Current index lag: 4h17m behind head
117+
Indexer throughput: 12 events/sec (normal: 340 events/sec)
118+
```
119+
120+
~6,800 products have stale search data. The lag keeps growing because new events from v2.15.0 continue arriving in the broken format. The fix: either roll back `catalog-pipeline` to v2.14.0, or deploy a hotfix to `search-indexer` to handle both schema versions. The 9,214 dead-lettered events will need to be replayed after the fix.
121+
122+
## Tips for Your Own Investigations
123+
124+
1. **Start with what you know.** The Slack message, the error alert, the customer complaint. Search for that first.
125+
2. **Follow the thread.** When results mention another service, a timeout, or an error code, ask about that next.
126+
3. **Scope before you dig.** Once you know what is failing, check how many errors, when they started, and whether they are increasing.
127+
4. **Find the boundary.** Zoom into the moment errors started. The logs right before the first error often reveal the trigger.
128+
129+
<Admonition type="tip">
130+
If a field like `service.name` is not available, ask the assistant to discover fields: _"What resource attributes are available for logs?"_ Field availability depends on how your services are instrumented.
131+
</Admonition>
132+
133+
<details>
134+
<ToggleHeading>
135+
## Under the Hood
136+
</ToggleHeading>
137+
138+
During this investigation, the MCP server called these tools:
139+
140+
| Step | MCP Tool | What It Did |
141+
|------|----------|-------------|
142+
| 1 | `signoz_search_logs` | Searched across all services for warning/error logs matching search indexing keywords |
143+
| 2 | `signoz_search_logs` | Found malformed event warnings in the indexer, revealing upstream data quality issue |
144+
| 2 | `signoz_aggregate_logs` | Computed malformed event counts per hour to pinpoint when the problem started |
145+
| 3 | `signoz_search_logs` | Found deployment and schema migration logs in catalog-pipeline around the start time |
146+
| 4 | `signoz_aggregate_logs` | Counted dead-lettered events to measure blast radius |
147+
148+
</details>
149+
150+
## Next Steps
151+
152+
- [Latency Spike Explainer](https://signoz.io/docs/ai/use-cases/latency-spike-explainer) - Ask "why is this slow?" and trace the bottleneck.
153+
- [Reconstruct a Bug from a Trace ID](https://signoz.io/docs/ai/use-cases/reconstruct-bug-from-trace-id) - Debug a support ticket with a trace ID.
154+
155+
<GetHelp />

0 commit comments

Comments
 (0)