PAGE XML renderer / export#4214
Conversation
|
Can you send a result file of this code for some example image? |
stweil
left a comment
There was a problem hiding this comment.
The CI builds fail because some include statements are missing.
|
Thanks for suggestion and fixes @stweil Here are examples for eurotext and hebrew with 'page' and 'page-poly': |
PAGE XML with box coordinates: PAGE XML with polygon: |
|
Added terminating linefeed, removed trailing whitespaces and fixed typos (found by typos-cli). |
|
Maybe we can merge this new feature in a new 5.4.0 pre-release. |
Co-authored-by: Stefan Weil <sw@weilnetz.de>
Unused variables Co-authored-by: Stefan Weil <sw@weilnetz.de>
Remove unused variables Co-authored-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Add PAGE XML export and documentation. To generate PAGE XML output just add 'page' to the tesseract command. The output is outputname + '.page.xml' to avoid conflicts with ALTO export. The output can be customized with the flags: tessedit_create_page_polygon and tessedit_create_page_wordlevel. Co-authored-by: Stefan Weil <sw@weilnetz.de>
|
Should we squash long PRs? Upd.: |
Fixes: 577e8a8 ("Add PAGE XML renderer / export (tesseract-ocr#4214)") Signed-off-by: Stefan Weil <sw@weilnetz.de>
Use also enum names instead of numeric values where possible. Fixes: 577e8a8 ("Add PAGE XML renderer / export (tesseract-ocr#4214)") Signed-off-by: Stefan Weil <sw@weilnetz.de>
| /// Sort baseline points ascending and deleting duplicates | ||
| /// | ||
| Pta *SortBaseline(Pta *baseline_pts, | ||
| tesseract::WritingDirection writing_direction) { |
There was a problem hiding this comment.
@JKamlah, writing_direction is not used in this function. Should this parameter be removed?
There was a problem hiding this comment.
This input var can be removed. Initially, the function depended on the writing direction, but this part has been removed.
| /// | ||
| Pta *RecalcPolygonline(Pta *pts, bool upper) { | ||
| int num_pts, num_bin, index = 0; | ||
| int y, x0, y0, x1, y1; |
There was a problem hiding this comment.
y is assigned a float value, and it is compared to float or double values. Should it be declared as a float value? Or is it an integer, and type casts should be added?
There was a problem hiding this comment.
Related compiler warnings:
../../../src/api/pagerenderer.cpp:139:19: warning: implicit conversion turns floating-point number into integer: 'float' to 'int' [-Wfloat-conversion]
../../../src/api/pagerenderer.cpp:157:21: warning: implicit conversion turns floating-point number into integer: 'float' to 'int' [-Wfloat-conversion]
../../../src/api/pagerenderer.cpp:159:13: warning: implicit conversion increases floating-point precision: 'l_float32' (aka 'float') to 'double' [-Wdouble-promotion]
../../../src/api/pagerenderer.cpp:163:13: warning: implicit conversion increases floating-point precision: 'l_float32' (aka 'float') to 'double' [-Wdouble-promotion]
../../../src/api/pagerenderer.cpp:175:11: warning: implicit conversion turns floating-point number into integer: 'l_float32' (aka 'float') to 'int' [-Wfloat-conversion]
../../../src/api/pagerenderer.cpp:184:11: warning: implicit conversion turns floating-point number into integer: 'l_float32' (aka 'float') to 'int' [-Wfloat-conversion]
There was a problem hiding this comment.
It is indeed an int and type cast should be applied.
There was a problem hiding this comment.
Done now. In addition I fixed several memory leaks which were detected by a recently added new test case which was contributed by Copilot.
Hi everyone,
I've created a PAGE-XML renderer/export.
It's not just a simple PAGE-XML export, it can also produce a textline polygon instead of a simple bounding box, and it can output up to word level.
The output can be customised with three bool parameters
After installing tesseract you can use the preconfigured settings: 'page' .
page -> Output page.xml file with polygon and line-level
As word-level is more of a niche requirement, you need to enable it via -c:
-c page_xml_polygon=1 // True polygon or False bounding boxes
-c page_xml_level=1 // 0 line or 1 word level
The output is valid for PAGE XML version 2019-07-15.
If the textlines contains only ltr or rtl characters, the output is correct, but for mixed lines (BiDi) I am not quite sure.
Can anyone help me here?