feat(extract): skip OCR on re-extraction, add extract keybinding, persist extraction metadata by cpcloud · Pull Request #763 · cpcloud/micasa

cpcloud · 2026-03-13T12:18:12Z

Summary

Skip OCR/text extraction when a document already has extracted text from a previous run -- feed cached text directly to the LLM, preserving spatial layout data (TSV) for re-extraction
Add r keybinding in edit mode on document tabs to trigger extraction on the selected document
Improve xdg-open error messages on headless/remote servers: detect missing DISPLAY/WAYLAND_DISPLAY and surface an actionable message instead of a cryptic exit code
Persist extraction model name and operations JSON alongside document data for audit/inspection
Add visible "Model" column to the documents table showing which LLM model produced the extraction

closes #711
closes #764

codecov · 2026-03-13T12:21:52Z

Codecov Report

❌ Patch coverage is 66.15385% with 44 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.18%. Comparing base (6a0b1d6) to head (3ab7e71).
⚠️ Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
internal/data/store.go	14.28%	8 Missing and 4 partials ⚠️
internal/app/docopen.go	77.55%	8 Missing and 3 partials ⚠️
internal/app/extraction.go	81.25%	6 Missing and 3 partials ⚠️
internal/app/model.go	33.33%	8 Missing ⚠️
internal/app/view.go	0.00%	4 Missing ⚠️

Additional details and impacted files

Files with missing lines	Coverage Δ
internal/app/coldefs.go	`100.00% <ø> (ø)`
internal/app/forms.go	`86.04% <100.00%> (ø)`
internal/app/tables.go	`98.83% <100.00%> (+<0.01%)`	⬆️
internal/data/meta_generated.go	`100.00% <ø> (ø)`
internal/data/models.go	`88.23% <ø> (ø)`
internal/app/view.go	`86.06% <0.00%> (-0.09%)`	⬇️
internal/app/model.go	`62.07% <33.33%> (-0.05%)`	⬇️
internal/app/extraction.go	`76.69% <81.25%> (+5.97%)`	⬆️
internal/app/docopen.go	`44.00% <77.55%> (+13.23%)`	⬆️
internal/data/store.go	`73.12% <14.28%> (-0.84%)`	⬇️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

Improves the document extraction workflow by avoiding redundant OCR when a document already has extracted text, preserving OCR spatial (TSV) data for re-extraction, and enhancing OS “open file” error messages for headless environments.

Changes:

Skip OCR when ExtractedText is already present and pass through cached OCR TSV (ExtractData) to the LLM prompt.
Extend startExtractionOverlay to accept/preserve extractData and update all call sites accordingly.
Improve xdg-open / opener failure messaging (headless display detection) and add focused tests.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
internal/app/model.go	Passes `ExtractData` into extraction overlay and skips OCR early when `ExtractedText` already exists.
internal/app/extraction.go	Adds `extractData` parameter; skips OCR when cached text exists; preserves TSV data in LLM sources.
internal/app/extraction_test.go	Adds coverage for “skip OCR with existing text” and TSV preservation cases.
internal/app/docopen.go	Improves actionable error messages for missing opener and headless display failures; adds `hasDisplay()`.
internal/app/docopen_test.go	Adds tests for new opener error wrapping behavior.

Comments suppressed due to low confidence (1)

internal/app/extraction.go:297

For cached text on PDFs, the initial TextSource is always labeled as pdftotext with a “digital text” description. If the prior run was OCR-based (scanned PDF), extractData (TSV) will likely be non-empty and the tool/description should reflect OCR (e.g. tesseract) so the LLM prompt/UI aren’t misleading. Consider branching on len(extractData) > 0 here to pick the correct tool/desc (and possibly step detail).

		case mime == extract.MIMEApplicationPDF:
			tool = "pdftotext"
			desc = "Digital text extracted directly from the PDF."

internal/app/docopen_test.go

When re-extracting a document that already has text from a previous run, skip the OCR/text extraction steps and feed the existing text directly to the LLM for structured data extraction. This avoids redundant OCR work and makes re-extraction fast when only the LLM pass is needed. - startExtractionOverlay skips OCR when extractedText is non-empty - Images with cached OCR text now show the text step (normally hidden) - Previous ExtractData (TSV) is preserved in sources for spatial layout - afterDocumentSave also skips OCR in its early bailout check closes #711 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR improves the document extraction workflow by reusing previously extracted text to avoid redundant OCR work, adds an “extract” keybinding on document tabs, and makes OS “open document” failures more actionable on headless systems.

Changes:

Skip OCR when a document already has extracted text, and pass along existing TSV/layout data into the LLM prompt.
Add r keybinding (and status hint) in edit mode on document tabs to trigger extraction for the selected document.
Persist and display extraction metadata (extraction model + serialized extraction operations), and improve xdg-open/opener error messages for headless environments.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
internal/data/store.go	Extends document listing columns and updates extraction persistence to include model/ops.
internal/data/models.go	Adds `ExtractionModel` and `ExtractionOps` fields to `Document`.
internal/data/meta_generated.go	Adds generated column constants for `extraction_model` and `extraction_ops`.
internal/data/fts_test.go	Updates test callsite for new `UpdateDocumentExtraction` signature.
internal/app/view.go	Adds edit-mode status hint for `r extract` on document tabs.
internal/app/tables.go	Adds “Model” column value rendering for document tables.
internal/app/model.go	Wires `r` in edit mode to trigger extraction; passes ExtractData into overlay.
internal/app/extraction_test.go	Adds tests for “skip OCR when existing text” and model-used behavior.
internal/app/extraction.go	Implements OCR skip when cached text exists; persists extraction model/ops.
internal/app/docopen_test.go	Adds cross-platform ExitError helper and new opener error tests.
internal/app/docopen.go	Adds `extractSelectedDocument` and improves opener error messaging/headless detection.
internal/app/columns_generated.go	Adds generated document column index for “Model”.
internal/app/coldefs.go	Adds “Model” column definition for document tables.

Comments suppressed due to low confidence (1)

internal/data/store.go:1234

UpdateDocumentExtraction unconditionally updates ocr_data/extraction_model/extraction_ops even when callers pass zero values (e.g. data=nil). In the new “skip OCR on re-extraction” flow, this will overwrite previously stored TSV layout data with NULL when OCR is skipped, which contradicts the goal of preserving spatial data. Build the updates map conditionally (only include ocr_data when a new value is present / explicitly clearing), and similarly avoid overwriting extraction_model/extraction_ops unless a successful LLM run produced new values.

	updates := map[string]any{
		ColExtractData:     data,
		ColExtractionModel: model,
		ColExtractionOps:   ops,
	}
	if text != "" {
		updates[ColExtractedText] = text
	}
	return s.db.Model(&Document{}).Where(ColID+" = ?", id).Updates(updates).Error

internal/app/extraction.go

…ss server Instead of showing a cryptic "open: exit status 3", detect when xdg-open fails because no display server is available and show a message explaining the likely cause and environment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add an explicit `r` keybinding in edit mode on document tabs to trigger extraction on the selected document. This makes the OCR-skip feature from the previous commit discoverable and testable -- previously extraction only ran as a side effect of saving a document form. The status bar shows `r extract` on document tabs. If no extraction tools or LLM are configured, a status error is shown. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR improves the document extraction workflow by enabling faster re-extraction using cached text/OCR artifacts, adds a dedicated extraction keybinding in the TUI, and persists extraction provenance (model + operations) on documents for inspection/audit.

Changes:

Skip OCR when a document already has extracted text, while preserving any previously captured TSV/layout data for LLM re-extraction.
Add r keybinding (Edit mode, document tabs) to open the extraction overlay for the selected document, plus footer hint updates.
Persist extraction_model and extraction_ops on documents, and display the model in the Documents table.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
internal/data/store.go	Includes `extraction_model` in document list queries; expands `UpdateDocumentExtraction` to persist model/ops with sparse updates.
internal/data/models.go	Adds `ExtractionModel` and `ExtractionOps` fields to the `Document` model.
internal/data/meta_generated.go	Adds generated column constants for `extraction_model` and `extraction_ops`.
internal/data/fts_test.go	Updates FTS test call sites for the new `UpdateDocumentExtraction` signature.
internal/app/view.go	Adds an “extract” help hint on document tabs in Edit mode.
internal/app/tables.go	Adds a visible “Model” column value in document row rendering.
internal/app/model.go	Adds `r` handling in edit-mode key dispatch; passes `ExtractData` into extraction overlay; skips OCR when cached text exists.
internal/app/forms.go	Treats the new “Model” column like other read-only columns for inline edit routing.
internal/app/extraction.go	Implements cached-text OCR skipping, carries forward TSV data, and persists extraction model + operations metadata.
internal/app/extraction_test.go	Adds tests for `r` keybinding, OCR skip behavior, TSV preservation, and model-used logic.
internal/app/docopen.go	Adds `extractSelectedDocument`; improves opener error messages for headless environments and adds display detection.
internal/app/docopen_test.go	Adds cross-platform ExitError helper and tests for improved opener error messaging paths.
internal/app/coldefs.go	Adds “Model” to document column definitions.
internal/app/columns_generated.go	Updates generated document column enum to include `documentColModel`.

internal/app/extraction.go

Copilot

Pull request overview

This PR improves the document extraction workflow by enabling faster re-extraction (reusing previously extracted text/TSV), adding a dedicated extract keybinding in document edit mode, and persisting/displaying extraction provenance (model + ops) for auditability.

Changes:

Skip OCR when a document already has extracted text; preserve prior TSV (layout) data for LLM-only re-extraction.
Add r keybinding (+ status hint) on document tabs to open the extraction overlay for the selected document.
Persist extraction_model and extraction_ops on documents and show the model in the Documents table; improve xdg-open headless error messaging.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
internal/data/store.go	Select `extraction_model` when listing docs; extend `UpdateDocumentExtraction` to persist model/ops and avoid zero-value overwrites.
internal/data/models.go	Add `ExtractionModel` and `ExtractionOps` fields to `Document`.
internal/data/meta_generated.go	Add column constants for `extraction_model` / `extraction_ops`.
internal/data/fts_test.go	Update call site for new `UpdateDocumentExtraction` signature.
internal/app/view.go	Add edit-mode footer hint for the new extract action on document tabs.
internal/app/tables.go	Add “Model” cell to document rows.
internal/app/model.go	Bind `r` in edit mode; pass `ExtractData` into extraction overlay; skip OCR when cached text exists.
internal/app/forms.go	Route clicks on the new Model column to the document edit form (read-only column behavior).
internal/app/extraction_test.go	Add tests for `r` keybinding, OCR-skip behavior, TSV preservation, and model-label selection.
internal/app/extraction.go	Implement OCR skip when cached text exists; preserve TSV in sources; persist extraction model + ops; add `marshalOps` + `extractionModelUsed`.
internal/app/docopen_test.go	Add cross-platform helper-process ExitError generation and new wrapOpenerError test cases.
internal/app/docopen.go	Add `extractSelectedDocument`; improve opener error messages for headless/remote environments; add `hasDisplay`.
internal/app/columns_generated.go	Add `documentColModel` constant.
internal/app/coldefs.go	Add “Model” column definition to document table column specs.

internal/app/extraction.go

) Store which LLM model produced the extraction and the raw operations JSON alongside the document. The model column is visible in the documents table; the ops blob is stored for future inspection (see #766). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings March 13, 2026 12:18

Copilot started reviewing on behalf of cpcloud March 13, 2026 12:18 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

internal/app/docopen_test.go Show resolved Hide resolved

Copilot AI review requested due to automatic review settings March 13, 2026 12:50

cpcloud force-pushed the worktree-glowing-herding-nest branch from 52b7569 to 1e52267 Compare March 13, 2026 12:50

Copilot started reviewing on behalf of cpcloud March 13, 2026 12:51 View session

cpcloud force-pushed the worktree-glowing-herding-nest branch from 1e52267 to 3943739 Compare March 13, 2026 12:54

Copilot AI reviewed Mar 13, 2026

View reviewed changes

internal/app/extraction.go Show resolved Hide resolved

internal/app/extraction.go Show resolved Hide resolved

cpcloud and others added 2 commits March 13, 2026 09:00

cpcloud force-pushed the worktree-glowing-herding-nest branch from 3943739 to e725cb2 Compare March 13, 2026 13:00

cpcloud changed the title ~~feat(extract): skip OCR on re-extraction, add extract keybinding, improve headless errors~~ feat(extract): skip OCR on re-extraction, add extract keybinding, persist extraction metadata Mar 13, 2026

Copilot AI review requested due to automatic review settings March 13, 2026 13:08

cpcloud force-pushed the worktree-glowing-herding-nest branch from e725cb2 to 0c2209f Compare March 13, 2026 13:08

Copilot started reviewing on behalf of cpcloud March 13, 2026 13:08 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

internal/app/extraction.go Outdated Show resolved Hide resolved

cpcloud force-pushed the worktree-glowing-herding-nest branch from 0c2209f to 024ed7c Compare March 13, 2026 13:29

Copilot AI review requested due to automatic review settings March 13, 2026 13:32

cpcloud force-pushed the worktree-glowing-herding-nest branch from 024ed7c to 3f51ecb Compare March 13, 2026 13:32

Copilot started reviewing on behalf of cpcloud March 13, 2026 13:32 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

internal/app/extraction.go Outdated Show resolved Hide resolved

cpcloud force-pushed the worktree-glowing-herding-nest branch from 3f51ecb to 3ab7e71 Compare March 13, 2026 18:38

cpcloud merged commit 959b363 into main Mar 14, 2026
25 checks passed

cpcloud deleted the worktree-glowing-herding-nest branch March 14, 2026 12:48

BrewTestBot mentioned this pull request Mar 15, 2026

micasa 2.2.0 Homebrew/homebrew-core#272437

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(extract): skip OCR on re-extraction, add extract keybinding, persist extraction metadata#763

feat(extract): skip OCR on re-extraction, add extract keybinding, persist extraction metadata#763
cpcloud merged 4 commits intomainfrom
worktree-glowing-herding-nest

cpcloud commented Mar 13, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 13, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cpcloud commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

codecov bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cpcloud commented Mar 13, 2026 •

edited

Loading

codecov bot commented Mar 13, 2026 •

edited

Loading