Skip to content

feat(extract): skip OCR on re-extraction, add extract keybinding, persist extraction metadata#763

Merged
cpcloud merged 4 commits intomainfrom
worktree-glowing-herding-nest
Mar 14, 2026
Merged

feat(extract): skip OCR on re-extraction, add extract keybinding, persist extraction metadata#763
cpcloud merged 4 commits intomainfrom
worktree-glowing-herding-nest

Conversation

@cpcloud
Copy link
Owner

@cpcloud cpcloud commented Mar 13, 2026

Summary

  • Skip OCR/text extraction when a document already has extracted text from a previous run -- feed cached text directly to the LLM, preserving spatial layout data (TSV) for re-extraction
  • Add r keybinding in edit mode on document tabs to trigger extraction on the selected document
  • Improve xdg-open error messages on headless/remote servers: detect missing DISPLAY/WAYLAND_DISPLAY and surface an actionable message instead of a cryptic exit code
  • Persist extraction model name and operations JSON alongside document data for audit/inspection
  • Add visible "Model" column to the documents table showing which LLM model produced the extraction

closes #711
closes #764

Copilot AI review requested due to automatic review settings March 13, 2026 12:18
@codecov
Copy link

codecov bot commented Mar 13, 2026

Codecov Report

❌ Patch coverage is 66.15385% with 44 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.18%. Comparing base (6a0b1d6) to head (3ab7e71).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
internal/data/store.go 14.28% 8 Missing and 4 partials ⚠️
internal/app/docopen.go 77.55% 8 Missing and 3 partials ⚠️
internal/app/extraction.go 81.25% 6 Missing and 3 partials ⚠️
internal/app/model.go 33.33% 8 Missing ⚠️
internal/app/view.go 0.00% 4 Missing ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
internal/app/coldefs.go 100.00% <ø> (ø)
internal/app/forms.go 86.04% <100.00%> (ø)
internal/app/tables.go 98.83% <100.00%> (+<0.01%) ⬆️
internal/data/meta_generated.go 100.00% <ø> (ø)
internal/data/models.go 88.23% <ø> (ø)
internal/app/view.go 86.06% <0.00%> (-0.09%) ⬇️
internal/app/model.go 62.07% <33.33%> (-0.05%) ⬇️
internal/app/extraction.go 76.69% <81.25%> (+5.97%) ⬆️
internal/app/docopen.go 44.00% <77.55%> (+13.23%) ⬆️
internal/data/store.go 73.12% <14.28%> (-0.84%) ⬇️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves the document extraction workflow by avoiding redundant OCR when a document already has extracted text, preserving OCR spatial (TSV) data for re-extraction, and enhancing OS “open file” error messages for headless environments.

Changes:

  • Skip OCR when ExtractedText is already present and pass through cached OCR TSV (ExtractData) to the LLM prompt.
  • Extend startExtractionOverlay to accept/preserve extractData and update all call sites accordingly.
  • Improve xdg-open / opener failure messaging (headless display detection) and add focused tests.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
internal/app/model.go Passes ExtractData into extraction overlay and skips OCR early when ExtractedText already exists.
internal/app/extraction.go Adds extractData parameter; skips OCR when cached text exists; preserves TSV data in LLM sources.
internal/app/extraction_test.go Adds coverage for “skip OCR with existing text” and TSV preservation cases.
internal/app/docopen.go Improves actionable error messages for missing opener and headless display failures; adds hasDisplay().
internal/app/docopen_test.go Adds tests for new opener error wrapping behavior.
Comments suppressed due to low confidence (1)

internal/app/extraction.go:297

  • For cached text on PDFs, the initial TextSource is always labeled as pdftotext with a “digital text” description. If the prior run was OCR-based (scanned PDF), extractData (TSV) will likely be non-empty and the tool/description should reflect OCR (e.g. tesseract) so the LLM prompt/UI aren’t misleading. Consider branching on len(extractData) > 0 here to pick the correct tool/desc (and possibly step detail).
		case mime == extract.MIMEApplicationPDF:
			tool = "pdftotext"
			desc = "Digital text extracted directly from the PDF."

When re-extracting a document that already has text from a previous run,
skip the OCR/text extraction steps and feed the existing text directly to
the LLM for structured data extraction. This avoids redundant OCR work
and makes re-extraction fast when only the LLM pass is needed.

- startExtractionOverlay skips OCR when extractedText is non-empty
- Images with cached OCR text now show the text step (normally hidden)
- Previous ExtractData (TSV) is preserved in sources for spatial layout
- afterDocumentSave also skips OCR in its early bailout check

closes #711

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 13, 2026 12:50
@cpcloud cpcloud force-pushed the worktree-glowing-herding-nest branch from 52b7569 to 1e52267 Compare March 13, 2026 12:50
@cpcloud cpcloud force-pushed the worktree-glowing-herding-nest branch from 1e52267 to 3943739 Compare March 13, 2026 12:54
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the document extraction workflow by reusing previously extracted text to avoid redundant OCR work, adds an “extract” keybinding on document tabs, and makes OS “open document” failures more actionable on headless systems.

Changes:

  • Skip OCR when a document already has extracted text, and pass along existing TSV/layout data into the LLM prompt.
  • Add r keybinding (and status hint) in edit mode on document tabs to trigger extraction for the selected document.
  • Persist and display extraction metadata (extraction model + serialized extraction operations), and improve xdg-open/opener error messages for headless environments.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
internal/data/store.go Extends document listing columns and updates extraction persistence to include model/ops.
internal/data/models.go Adds ExtractionModel and ExtractionOps fields to Document.
internal/data/meta_generated.go Adds generated column constants for extraction_model and extraction_ops.
internal/data/fts_test.go Updates test callsite for new UpdateDocumentExtraction signature.
internal/app/view.go Adds edit-mode status hint for r extract on document tabs.
internal/app/tables.go Adds “Model” column value rendering for document tables.
internal/app/model.go Wires r in edit mode to trigger extraction; passes ExtractData into overlay.
internal/app/extraction_test.go Adds tests for “skip OCR when existing text” and model-used behavior.
internal/app/extraction.go Implements OCR skip when cached text exists; persists extraction model/ops.
internal/app/docopen_test.go Adds cross-platform ExitError helper and new opener error tests.
internal/app/docopen.go Adds extractSelectedDocument and improves opener error messaging/headless detection.
internal/app/columns_generated.go Adds generated document column index for “Model”.
internal/app/coldefs.go Adds “Model” column definition for document tables.
Comments suppressed due to low confidence (1)

internal/data/store.go:1234

  • UpdateDocumentExtraction unconditionally updates ocr_data/extraction_model/extraction_ops even when callers pass zero values (e.g. data=nil). In the new “skip OCR on re-extraction” flow, this will overwrite previously stored TSV layout data with NULL when OCR is skipped, which contradicts the goal of preserving spatial data. Build the updates map conditionally (only include ocr_data when a new value is present / explicitly clearing), and similarly avoid overwriting extraction_model/extraction_ops unless a successful LLM run produced new values.
	updates := map[string]any{
		ColExtractData:     data,
		ColExtractionModel: model,
		ColExtractionOps:   ops,
	}
	if text != "" {
		updates[ColExtractedText] = text
	}
	return s.db.Model(&Document{}).Where(ColID+" = ?", id).Updates(updates).Error

cpcloud and others added 2 commits March 13, 2026 09:00
…ss server

Instead of showing a cryptic "open: exit status 3", detect when
xdg-open fails because no display server is available and show a
message explaining the likely cause and environment.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add an explicit `r` keybinding in edit mode on document tabs to trigger
extraction on the selected document. This makes the OCR-skip feature
from the previous commit discoverable and testable -- previously
extraction only ran as a side effect of saving a document form.

The status bar shows `r extract` on document tabs. If no extraction
tools or LLM are configured, a status error is shown.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cpcloud cpcloud force-pushed the worktree-glowing-herding-nest branch from 3943739 to e725cb2 Compare March 13, 2026 13:00
@cpcloud cpcloud changed the title feat(extract): skip OCR on re-extraction, add extract keybinding, improve headless errors feat(extract): skip OCR on re-extraction, add extract keybinding, persist extraction metadata Mar 13, 2026
Copilot AI review requested due to automatic review settings March 13, 2026 13:08
@cpcloud cpcloud force-pushed the worktree-glowing-herding-nest branch from e725cb2 to 0c2209f Compare March 13, 2026 13:08
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the document extraction workflow by enabling faster re-extraction using cached text/OCR artifacts, adds a dedicated extraction keybinding in the TUI, and persists extraction provenance (model + operations) on documents for inspection/audit.

Changes:

  • Skip OCR when a document already has extracted text, while preserving any previously captured TSV/layout data for LLM re-extraction.
  • Add r keybinding (Edit mode, document tabs) to open the extraction overlay for the selected document, plus footer hint updates.
  • Persist extraction_model and extraction_ops on documents, and display the model in the Documents table.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
internal/data/store.go Includes extraction_model in document list queries; expands UpdateDocumentExtraction to persist model/ops with sparse updates.
internal/data/models.go Adds ExtractionModel and ExtractionOps fields to the Document model.
internal/data/meta_generated.go Adds generated column constants for extraction_model and extraction_ops.
internal/data/fts_test.go Updates FTS test call sites for the new UpdateDocumentExtraction signature.
internal/app/view.go Adds an “extract” help hint on document tabs in Edit mode.
internal/app/tables.go Adds a visible “Model” column value in document row rendering.
internal/app/model.go Adds r handling in edit-mode key dispatch; passes ExtractData into extraction overlay; skips OCR when cached text exists.
internal/app/forms.go Treats the new “Model” column like other read-only columns for inline edit routing.
internal/app/extraction.go Implements cached-text OCR skipping, carries forward TSV data, and persists extraction model + operations metadata.
internal/app/extraction_test.go Adds tests for r keybinding, OCR skip behavior, TSV preservation, and model-used logic.
internal/app/docopen.go Adds extractSelectedDocument; improves opener error messages for headless environments and adds display detection.
internal/app/docopen_test.go Adds cross-platform ExitError helper and tests for improved opener error messaging paths.
internal/app/coldefs.go Adds “Model” to document column definitions.
internal/app/columns_generated.go Updates generated document column enum to include documentColModel.

@cpcloud cpcloud force-pushed the worktree-glowing-herding-nest branch from 0c2209f to 024ed7c Compare March 13, 2026 13:29
Copilot AI review requested due to automatic review settings March 13, 2026 13:32
@cpcloud cpcloud force-pushed the worktree-glowing-herding-nest branch from 024ed7c to 3f51ecb Compare March 13, 2026 13:32
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the document extraction workflow by enabling faster re-extraction (reusing previously extracted text/TSV), adding a dedicated extract keybinding in document edit mode, and persisting/displaying extraction provenance (model + ops) for auditability.

Changes:

  • Skip OCR when a document already has extracted text; preserve prior TSV (layout) data for LLM-only re-extraction.
  • Add r keybinding (+ status hint) on document tabs to open the extraction overlay for the selected document.
  • Persist extraction_model and extraction_ops on documents and show the model in the Documents table; improve xdg-open headless error messaging.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
internal/data/store.go Select extraction_model when listing docs; extend UpdateDocumentExtraction to persist model/ops and avoid zero-value overwrites.
internal/data/models.go Add ExtractionModel and ExtractionOps fields to Document.
internal/data/meta_generated.go Add column constants for extraction_model / extraction_ops.
internal/data/fts_test.go Update call site for new UpdateDocumentExtraction signature.
internal/app/view.go Add edit-mode footer hint for the new extract action on document tabs.
internal/app/tables.go Add “Model” cell to document rows.
internal/app/model.go Bind r in edit mode; pass ExtractData into extraction overlay; skip OCR when cached text exists.
internal/app/forms.go Route clicks on the new Model column to the document edit form (read-only column behavior).
internal/app/extraction_test.go Add tests for r keybinding, OCR-skip behavior, TSV preservation, and model-label selection.
internal/app/extraction.go Implement OCR skip when cached text exists; preserve TSV in sources; persist extraction model + ops; add marshalOps + extractionModelUsed.
internal/app/docopen_test.go Add cross-platform helper-process ExitError generation and new wrapOpenerError test cases.
internal/app/docopen.go Add extractSelectedDocument; improve opener error messages for headless/remote environments; add hasDisplay.
internal/app/columns_generated.go Add documentColModel constant.
internal/app/coldefs.go Add “Model” column definition to document table column specs.

)

Store which LLM model produced the extraction and the raw operations
JSON alongside the document. The model column is visible in the
documents table; the ops blob is stored for future inspection (see #766).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cpcloud cpcloud force-pushed the worktree-glowing-herding-nest branch from 3f51ecb to 3ab7e71 Compare March 13, 2026 18:38
@cpcloud cpcloud merged commit 959b363 into main Mar 14, 2026
25 checks passed
@cpcloud cpcloud deleted the worktree-glowing-herding-nest branch March 14, 2026 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(extract): store extraction model name on documents feat(extract): skip text/OCR when document already has extracted text

2 participants