Jj/2027 float no attr strip (#2048)

Closes #2027 Tables or pages that contain only numbers are returned as floats in a pandas.DataFrame when the image or page is converted from `.image_to_data()`. An AttributeError was raised downstream when trying to `.strip()` the floats. This update converts those floats if needed and otherwise strips the text. Testing (note: the document used for testing is new, so you will have to copy it to the main branch in order to see that this snippet raises an AttributeError on the main branch, but works on this branch) ``` from unstructured.partition.pdf import partition_pdf filename = "example-docs/all-number-table.pdf" partition_pdf(filename, strategy="ocr_only") ``` --------- Co-authored-by: cragwolfe <[email protected]>
Unstructured-IO · Nov 10, 2023 · f8c180a · f8c180a
1 parent fa27408
commit f8c180a
Show file tree

Hide file tree

Showing 4 changed files with 11 additions and 1 deletion.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -14,6 +14,7 @@
 
 * **Fix ingest partition parameters not being passed to the api.** When using the --partition-by-api flag via unstructured-ingest, none of the partition arguments are forwarded, meaning that these options are disregarded. With this change, we now pass through all of the relevant partition arguments to the api. This allows a user to specify all of the same partition arguments they would locally and have them respected when specifying --partition-by-api.
 * **Support tables in section-less DOCX.** Generalize solution for MS Chat Transcripts exported as DOCX by including tables in the partitioned output when present.
+* **Support tables that contain only numbers when partitioning via `ocr_only`** Tables that contain only numbers are returned as floats in a pandas.DataFrame when the image is converted from `.image_to_data()`. An AttributeError was raised downstream when trying to `.strip()` the floats.
 * **Improve DOCX page-break detection.** DOCX page breaks are reliably indicated by `w:lastRenderedPageBreak` elements present in the document XML. Page breaks are NOT reliably indicated by "hard" page-breaks inserted by the author and when present are redundant to a `w:lastRenderedPageBreak` element so cause over-counting if used. Use rendered page-breaks only.
 
 ## 0.10.29

diff --git a/example-docs/all-number-table.pdf b/example-docs/all-number-table.pdf
diff --git a/test_unstructured/partition/pdf_image/test_pdf.py b/test_unstructured/partition/pdf_image/test_pdf.py
@@ -950,3 +950,10 @@ def test_partition_pdf_with_ocr_only_strategy(
     # check detection origin
     if UNSTRUCTURED_INCLUDE_DEBUG_METADATA:
         assert {element.metadata.detection_origin for element in elements} == {"ocr_tesseract"}
+
+
+def test_partition_pdf_with_all_number_table_and_ocr_only_strategy():
+    # AttributeError was previously being raised when partitioning documents that contained only
+    # numerical values with `strategy="ocr_only"`
+    filename = example_doc_path("all-number-table.pdf")
+    assert pdf.partition_pdf(filename, strategy="ocr_only")
diff --git a/unstructured/partition/ocr.py b/unstructured/partition/ocr.py
@@ -533,7 +533,9 @@ def parse_ocr_data_tesseract(ocr_data: pd.DataFrame, zoom: float = 1) -> List[Te
         text = idtx.text
         if not text:
             continue
-        cleaned_text = text.strip()
+
+        cleaned_text = str(text) if not isinstance(text, str) else text.strip()
+
         if cleaned_text:
             x1 = idtx.left / zoom
             y1 = idtx.top / zoom