Skip to content

Commit

Permalink
fix: preserve text after line breaks in PowerPoint table cells
Browse files Browse the repository at this point in the history
This commit addresses an issue where text after line breaks in PowerPoint table cells was lost during processing. The issue is resolved by handling cell content similarly to how it is processed for Word documents, using space separation: https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/docx.py#L494
  • Loading branch information
yamazombie authored Jan 18, 2025
1 parent 27cd53b commit bd14590
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion unstructured/partition/pptx.py
Original file line number Diff line number Diff line change
Expand Up @@ -252,7 +252,7 @@ def _iter_table_element(self, graphfrm: GraphicFrame) -> Iterator[Table]:
return

html_text = htmlify_matrix_of_cell_texts(
[[cell.text for cell in row.cells] for row in rows]
[[cell.text.replace("\n", " ") for cell in row.cells] for row in rows]
)
html_table = HtmlTable.from_html_text(html_text)

Expand Down

0 comments on commit bd14590

Please sign in to comment.