Skip to content

Commit

Permalink
Jj/2027 float no attr strip (#2048)
Browse files Browse the repository at this point in the history
Closes #2027 

Tables or pages that contain only numbers are returned as floats in a
pandas.DataFrame when the image or page is converted from
`.image_to_data()`. An AttributeError was raised downstream when trying
to `.strip()` the floats. This update converts those floats if needed
and otherwise strips the text.

Testing (note: the document used for testing is new, so you will have to
copy it to the main branch in order to see that this snippet raises an
AttributeError on the main branch, but works on this branch)
```
from unstructured.partition.pdf import partition_pdf
filename = "example-docs/all-number-table.pdf"
partition_pdf(filename, strategy="ocr_only")
```

---------

Co-authored-by: cragwolfe <[email protected]>
  • Loading branch information
Coniferish and cragwolfe authored Nov 10, 2023
1 parent fa27408 commit f8c180a
Show file tree
Hide file tree
Showing 4 changed files with 11 additions and 1 deletion.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@

* **Fix ingest partition parameters not being passed to the api.** When using the --partition-by-api flag via unstructured-ingest, none of the partition arguments are forwarded, meaning that these options are disregarded. With this change, we now pass through all of the relevant partition arguments to the api. This allows a user to specify all of the same partition arguments they would locally and have them respected when specifying --partition-by-api.
* **Support tables in section-less DOCX.** Generalize solution for MS Chat Transcripts exported as DOCX by including tables in the partitioned output when present.
* **Support tables that contain only numbers when partitioning via `ocr_only`** Tables that contain only numbers are returned as floats in a pandas.DataFrame when the image is converted from `.image_to_data()`. An AttributeError was raised downstream when trying to `.strip()` the floats.
* **Improve DOCX page-break detection.** DOCX page breaks are reliably indicated by `w:lastRenderedPageBreak` elements present in the document XML. Page breaks are NOT reliably indicated by "hard" page-breaks inserted by the author and when present are redundant to a `w:lastRenderedPageBreak` element so cause over-counting if used. Use rendered page-breaks only.

## 0.10.29
Expand Down
Binary file added example-docs/all-number-table.pdf
Binary file not shown.
7 changes: 7 additions & 0 deletions test_unstructured/partition/pdf_image/test_pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -950,3 +950,10 @@ def test_partition_pdf_with_ocr_only_strategy(
# check detection origin
if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
assert {element.metadata.detection_origin for element in elements} == {"ocr_tesseract"}


def test_partition_pdf_with_all_number_table_and_ocr_only_strategy():
# AttributeError was previously being raised when partitioning documents that contained only
# numerical values with `strategy="ocr_only"`
filename = example_doc_path("all-number-table.pdf")
assert pdf.partition_pdf(filename, strategy="ocr_only")
4 changes: 3 additions & 1 deletion unstructured/partition/ocr.py
Original file line number Diff line number Diff line change
Expand Up @@ -533,7 +533,9 @@ def parse_ocr_data_tesseract(ocr_data: pd.DataFrame, zoom: float = 1) -> List[Te
text = idtx.text
if not text:
continue
cleaned_text = text.strip()

cleaned_text = str(text) if not isinstance(text, str) else text.strip()

if cleaned_text:
x1 = idtx.left / zoom
y1 = idtx.top / zoom
Expand Down

0 comments on commit f8c180a

Please sign in to comment.