Skip to content

Commit

Permalink
chunk: relax table segregation during chunking (#3812)
Browse files Browse the repository at this point in the history
**Summary**
Relax table-segregation rule applied during chunking such that a `Table`
and `Text`-subtype elements can be combined into a single chunk when the
chunking window allows.

**Additional Context**
Until now, `Table` elements have always been segregated during chunking,
i.e. a chunk that contained a table would never contain any other
element. In certain scenarios, especially when a large chunking window
of say 2000 characters is used, this behavior can reduce retrieval
effectiveness by isolating the table from surrounding context.

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: scanny <[email protected]>
  • Loading branch information
3 people authored Dec 9, 2024
1 parent 18d6c81 commit 4379d88
Show file tree
Hide file tree
Showing 15 changed files with 1,049 additions and 907 deletions.
4 changes: 3 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
## 0.16.11-dev0
## 0.16.11-dev1

### Enhancements

- **Enhance quote standardization tests** with additional Unicode scenarios
- **Relax table segregation rule in chunking.** Previously a `Table` element was always segregated into its own pre-chunk such that the `Table` appeared alone in a chunk or was split into multiple `TableChunk` elements, but never combined with `Text`-subtype elements. Allow table elements to be combined with other elements in the same chunk when space allows.
- **Compute chunk length based solely on `element.text`.** Previously `.metadata.text_as_html` was also considered and since it is always longer that the text (due to HTML tag overhead) it was the effective length criterion. Remove text-as-html from the length calculation such that text-length is the sole criterion for sizing a chunk.

### Features

Expand Down
1,027 changes: 468 additions & 559 deletions test_unstructured/chunking/test_base.py

Large diffs are not rendered by default.

40 changes: 20 additions & 20 deletions test_unstructured/chunking/test_basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,69 +25,69 @@ def test_it_chunks_a_document_when_basic_chunking_strategy_is_specified_on_parti
assert chunks == [
CompositeElement(
"US Trustee Handbook\n\nCHAPTER 1\n\nINTRODUCTION\n\nCHAPTER 1 – INTRODUCTION"
"\n\nA.\tPURPOSE"
"\n\nA. PURPOSE"
),
CompositeElement(
"The United States Trustee appoints and supervises standing trustees and monitors and"
" supervises cases under chapter 13 of title 11 of the United States Code. 28 U.S.C."
" § 586(b). The Handbook, issued as part of our duties under 28 U.S.C. § 586,"
" supervises cases under chapter 13 of title 11 of the United States Code. 28 U.S.C."
" § 586(b). The Handbook, issued as part of our duties under 28 U.S.C. § 586,"
" establishes or clarifies the position of the United States Trustee Program (Program)"
" on the duties owed by a standing trustee to the debtors, creditors, other parties in"
" interest, and the United States Trustee. The Handbook does not present a full and"
" interest, and the United States Trustee. The Handbook does not present a full and"
),
CompositeElement(
"complete statement of the law; it should not be used as a substitute for legal"
" research and analysis. The standing trustee must be familiar with relevant"
" research and analysis. The standing trustee must be familiar with relevant"
" provisions of the Bankruptcy Code, Federal Rules of Bankruptcy Procedure (Rules),"
" any local bankruptcy rules, and case law. 11 U.S.C. § 321, 28 U.S.C. § 586,"
" 28 C.F.R. § 58.6(a)(3). Standing trustees are encouraged to follow Practice Tips"
" any local bankruptcy rules, and case law. 11 U.S.C. § 321, 28 U.S.C. § 586,"
" 28 C.F.R. § 58.6(a)(3). Standing trustees are encouraged to follow Practice Tips"
" identified in this Handbook but these are not considered mandatory."
),
CompositeElement(
"Nothing in this Handbook should be construed to excuse the standing trustee from"
" complying with all duties imposed by the Bankruptcy Code and Rules, local rules, and"
" orders of the court. The standing trustee should notify the United States Trustee"
" orders of the court. The standing trustee should notify the United States Trustee"
" whenever the provision of the Handbook conflicts with the local rules or orders of"
" the court. The standing trustee is accountable for all duties set forth in this"
" Handbook, but need not personally perform any duty unless otherwise indicated. All"
" the court. The standing trustee is accountable for all duties set forth in this"
" Handbook, but need not personally perform any duty unless otherwise indicated. All"
),
CompositeElement(
"statutory references in this Handbook refer to the Bankruptcy Code, 11 U.S.C. § 101"
" et seq., unless otherwise indicated."
),
CompositeElement(
"This Handbook does not create additional rights against the standing trustee or"
" United States Trustee in favor of other parties.\n\nB.\tROLE OF THE UNITED STATES"
" United States Trustee in favor of other parties.\n\nB. ROLE OF THE UNITED STATES"
" TRUSTEE"
),
CompositeElement(
"The Bankruptcy Reform Act of 1978 removed the bankruptcy judge from the"
" responsibilities for daytoday administration of cases. Debtors, creditors, and"
" responsibilities for daytoday administration of cases. Debtors, creditors, and"
" third parties with adverse interests to the trustee were concerned that the court,"
" which previously appointed and supervised the trustee, would not impartially"
" adjudicate their rights as adversaries of that trustee. To address these concerns,"
" judicial and administrative functions within the bankruptcy system were bifurcated."
),
CompositeElement(
"Many administrative functions formerly performed by the court were placed within the"
" Department of Justice through the creation of the Program. Among the administrative"
" Department of Justice through the creation of the Program. Among the administrative"
" functions assigned to the United States Trustee were the appointment and supervision"
" of chapter 13 trustees./ This Handbook is issued under the authority of the"
" Program’s enabling statutes. \n\nC.\tSTATUTORY DUTIES OF A STANDING TRUSTEE\t"
" of chapter 13 trustees./ This Handbook is issued under the authority of the"
" Program’s enabling statutes.\n\nC. STATUTORY DUTIES OF A STANDING TRUSTEE"
),
CompositeElement(
"The standing trustee has a fiduciary responsibility to the bankruptcy estate. The"
" standing trustee is more than a mere disbursing agent. The standing trustee must"
" be personally involved in the trustee operation. If the standing trustee is or"
"The standing trustee has a fiduciary responsibility to the bankruptcy estate. The"
" standing trustee is more than a mere disbursing agent. The standing trustee must"
" be personally involved in the trustee operation. If the standing trustee is or"
" becomes unable to perform the duties and responsibilities of a standing trustee,"
" the standing trustee must immediately advise the United States Trustee."
" 28 U.S.C. § 586(b), 28 C.F.R. § 58.4(b) referencing 28 C.F.R. § 58.3(b)."
" 28 U.S.C. § 586(b), 28 C.F.R. § 58.4(b) referencing 28 C.F.R. § 58.3(b)."
),
CompositeElement(
"Although this Handbook is not intended to be a complete statutory reference, the"
" standing trustee’s primary statutory duties are set forth in 11 U.S.C. § 1302, which"
" incorporates by reference some of the duties of chapter 7 trustees found in"
" 11 U.S.C. § 704. These duties include, but are not limited to, the"
" 11 U.S.C. § 704. These duties include, but are not limited to, the"
" following:\n\nCopyright"
),
]
Expand Down
75 changes: 57 additions & 18 deletions test_unstructured/chunking/test_title.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

import pytest

from test_unstructured.unit_utils import FixtureRequest, Mock, function_mock
from test_unstructured.unit_utils import FixtureRequest, Mock, function_mock, input_path
from unstructured.chunking.base import CHUNK_MULTI_PAGE_DEFAULT
from unstructured.chunking.title import _ByTitleChunkingOptions, chunk_by_title
from unstructured.documents.coordinates import CoordinateSystem
Expand All @@ -20,10 +20,12 @@
ElementMetadata,
ListItem,
Table,
TableChunk,
Text,
Title,
)
from unstructured.partition.html import partition_html
from unstructured.staging.base import elements_from_json

# ================================================================================================
# INTEGRATION-TESTS
Expand All @@ -33,7 +35,53 @@
# ================================================================================================


def test_it_splits_a_large_element_into_multiple_chunks():
def test_it_chunks_text_followed_by_table_together_when_both_fit():
elements = elements_from_json(input_path("chunking/title_table_200.json"))

chunks = chunk_by_title(elements, combine_text_under_n_chars=0)

assert len(chunks) == 1
assert isinstance(chunks[0], CompositeElement)


def test_it_chunks_table_followed_by_text_together_when_both_fit():
elements = elements_from_json(input_path("chunking/table_text_200.json"))

# -- disable chunk combining so we test pre-chunking behavior, not chunk-combining --
chunks = chunk_by_title(elements, combine_text_under_n_chars=0)

assert len(chunks) == 1
assert isinstance(chunks[0], CompositeElement)


def test_it_splits_oversized_table():
elements = elements_from_json(input_path("chunking/table_2000.json"))

chunks = chunk_by_title(elements)

assert len(chunks) == 5
assert all(isinstance(chunk, TableChunk) for chunk in chunks)


def test_it_starts_new_chunk_for_table_after_full_text_chunk():
elements = elements_from_json(input_path("chunking/long_text_table_200.json"))

chunks = chunk_by_title(elements, max_characters=250)

assert len(chunks) == 2
assert [type(chunk) for chunk in chunks] == [CompositeElement, Table]


def test_it_starts_new_chunk_for_text_after_full_table_chunk():
elements = elements_from_json(input_path("chunking/full_table_long_text_250.json"))

chunks = chunk_by_title(elements, max_characters=250)

assert len(chunks) == 2
assert [type(chunk) for chunk in chunks] == [Table, CompositeElement]


def test_it_splits_a_large_text_element_into_multiple_chunks():
elements: list[Element] = [
Title("Introduction"),
Text(
Expand Down Expand Up @@ -68,29 +116,26 @@ def test_it_splits_elements_by_title_and_table():

chunks = chunk_by_title(elements, combine_text_under_n_chars=0, include_orig_elements=True)

assert len(chunks) == 4
assert len(chunks) == 3
# --
chunk = chunks[0]
assert isinstance(chunk, CompositeElement)
assert chunk.metadata.orig_elements == [
Title("A Great Day"),
Text("Today is a great day."),
Text("It is sunny outside."),
Table("Heading\nCell text"),
]
# --
chunk = chunks[1]
assert isinstance(chunk, Table)
assert chunk.metadata.orig_elements == [Table("Heading\nCell text")]
# ==
chunk = chunks[2]
assert isinstance(chunk, CompositeElement)
assert chunk.metadata.orig_elements == [
Title("An Okay Day"),
Text("Today is an okay day."),
Text("It is rainy outside."),
]
# --
chunk = chunks[3]
chunk = chunks[2]
assert isinstance(chunk, CompositeElement)
assert chunk.metadata.orig_elements == [
Title("A Bad Day"),
Expand Down Expand Up @@ -119,9 +164,8 @@ def test_chunk_by_title():

assert chunks == [
CompositeElement(
"A Great Day\n\nToday is a great day.\n\nIt is sunny outside.",
"A Great Day\n\nToday is a great day.\n\nIt is sunny outside.\n\nHeading Cell text"
),
Table("Heading\nCell text"),
CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."),
CompositeElement(
"A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.",
Expand Down Expand Up @@ -150,10 +194,7 @@ def test_chunk_by_title_separates_by_page_number():
CompositeElement(
"A Great Day",
),
CompositeElement(
"Today is a great day.\n\nIt is sunny outside.",
),
Table("Heading\nCell text"),
CompositeElement("Today is a great day.\n\nIt is sunny outside.\n\nHeading Cell text"),
CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."),
CompositeElement(
"A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.",
Expand All @@ -178,9 +219,8 @@ def test_chuck_by_title_respects_multipage():
chunks = chunk_by_title(elements, multipage_sections=True, combine_text_under_n_chars=0)
assert chunks == [
CompositeElement(
"A Great Day\n\nToday is a great day.\n\nIt is sunny outside.",
"A Great Day\n\nToday is a great day.\n\nIt is sunny outside.\n\nHeading Cell text"
),
Table("Heading\nCell text"),
CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."),
CompositeElement(
"A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.",
Expand All @@ -206,9 +246,8 @@ def test_chunk_by_title_groups_across_pages():

assert chunks == [
CompositeElement(
"A Great Day\n\nToday is a great day.\n\nIt is sunny outside.",
"A Great Day\n\nToday is a great day.\n\nIt is sunny outside.\n\nHeading Cell text"
),
Table("Heading\nCell text"),
CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."),
CompositeElement(
"A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.",
Expand Down
2 changes: 1 addition & 1 deletion test_unstructured/partition/test_json.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ def test_it_chunks_elements_when_a_chunking_strategy_is_specified():
"example-docs/spring-weather.html.json", chunking_strategy="basic", max_characters=1500
)

assert len(chunks) == 10
assert len(chunks) == 9
assert all(isinstance(ch, CompositeElement) for ch in chunks)


Expand Down
32 changes: 32 additions & 0 deletions test_unstructured/testfiles/chunking/full_table_long_text_250.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
[
{
"type": "Table",
"element_id": "ca96108263324e9d865a98f19cf7c940",
"text": "RFP Number: 2024-PMO-01 RFP Title: PMO Services RFP RFP Due Date and Time: Number of Pages: #189 05/30/2024 by 5:00pm Central Time",
"metadata": {
"category_depth": 1,
"page_number": 1,
"parent_id": "747587de72444235a68c768d544ff5f3",
"text_as_html": "<table class=\"Table\" id=\"ca96108263324e9d865a98f19cf7c940\"> <tbody> <tr> <td>RFP Number: 2024-PMO-01</td><td>RFP Title: PMO Services RFP</td></tr><tr> <td>RFP Due Date and Time:</td><td>Number of Pages: #189</td></tr><tr> <td>05/30/2024 by 5:00pm Central Time</td><td></td></tr></tbody></table>",
"languages": [
"eng"
],
"filetype": "text/html"
}
},
{
"type": "NarrativeText",
"element_id": "5bc93ad5828445f98cac824c750cacfd",
"text": "Format: CSV file for Export and Download Contact: Charles Stringham [email protected] to arrange secure data transfer OR with technical questions [email protected] for other questions",
"metadata": {
"category_depth": 2,
"page_number": 1,
"parent_id": "d8fa364bbfdf42d7b37c7a1dcb90ecf5",
"text_as_html": "<p class=\"NarrativeText\" id=\"5bc93ad5828445f98cac824c750cacfd\">Format: CSV file for Export and Download </p> <p class=\"NarrativeText\" id=\"875c1820b6cd4736a7e699571896b568\">Contact: Charles Stringham [email protected] to arrange secure data transfer OR with technical questions </p> <p class=\"NarrativeText\" id=\"ac41c15812e64e918cbb07c2bc68b5d2\">[email protected] for other questions </p>",
"languages": [
"eng"
],
"filetype": "text/html"
}
}
]
32 changes: 32 additions & 0 deletions test_unstructured/testfiles/chunking/long_text_table_200.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
[
{
"type": "NarrativeText",
"element_id": "5bc93ad5828445f98cac824c750cacfd",
"text": "Format: CSV file for Export and Download Contact: Charles Stringham [email protected] to arrange secure data transfer OR with technical questions [email protected] for other questions",
"metadata": {
"category_depth": 2,
"page_number": 1,
"parent_id": "d8fa364bbfdf42d7b37c7a1dcb90ecf5",
"text_as_html": "<p class=\"NarrativeText\" id=\"5bc93ad5828445f98cac824c750cacfd\">Format: CSV file for Export and Download </p> <p class=\"NarrativeText\" id=\"875c1820b6cd4736a7e699571896b568\">Contact: Charles Stringham [email protected] to arrange secure data transfer OR with technical questions </p> <p class=\"NarrativeText\" id=\"ac41c15812e64e918cbb07c2bc68b5d2\">[email protected] for other questions </p>",
"languages": [
"eng"
],
"filetype": "text/html"
}
},
{
"type": "Table",
"element_id": "ca96108263324e9d865a98f19cf7c940",
"text": "RFP Number: 2024-PMO-01 RFP Title: PMO Services RFP RFP Due Date and Time: Number of Pages: #189 05/30/2024 by 5:00pm Central Time",
"metadata": {
"category_depth": 1,
"page_number": 1,
"parent_id": "747587de72444235a68c768d544ff5f3",
"text_as_html": "<table class=\"Table\" id=\"ca96108263324e9d865a98f19cf7c940\"> <tbody> <tr> <td>RFP Number: 2024-PMO-01</td><td>RFP Title: PMO Services RFP</td></tr><tr> <td>RFP Due Date and Time:</td><td>Number of Pages: #189</td></tr><tr> <td>05/30/2024 by 5:00pm Central Time</td><td></td></tr></tbody></table>",
"languages": [
"eng"
],
"filetype": "text/html"
}
}
]
Loading

0 comments on commit 4379d88

Please sign in to comment.