Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] The latest unstructured-inference version can't extract table normally, while old version can. #400

Open
hardfish82 opened this issue Jan 8, 2025 · 1 comment

Comments

@hardfish82
Copy link

hardfish82 commented Jan 8, 2025

The bug exists on the following version:

unstructured                             0.16.12
unstructured-inference                   0.8.1

Code:

from unstructured.partition.pdf import partition_pdf
input_path = "../input/"
output_path = "../output/"
file_path = input_path + 'attention.pdf'

chunks = partition_pdf(
    filename=file_path,
    infer_table_structure=True,            # extract tables
    strategy="hi_res",                     # mandatory to infer tables

    extract_image_block_types=["Image", 'Table'],   # Add 'Table' to list to extract image of tables
    # image_output_dir_path=output_path,   # if None, images and tables will saved in base64

    extract_image_block_to_payload=True,   # if true, will extract base64 for API usage

    chunking_strategy="by_title",          # or 'basic'
    max_characters=10000,                  # defaults to 500
    combine_text_under_n_chars=2000,       # defaults to 0
    new_after_n_chars=6000,

    # extract_images_in_pdf=True,          # deprecated
)

No tables found in the chunks:

[<unstructured.documents.elements.CompositeElement at 0x7f86226bc0d0>,
 <unstructured.documents.elements.CompositeElement at 0x7f86226bc2e0>,
 <unstructured.documents.elements.CompositeElement at 0x7f86226bc160>,
 <unstructured.documents.elements.CompositeElement at 0x7f86226bc280>,
 <unstructured.documents.elements.CompositeElement at 0x7f86226bc3a0>,
 <unstructured.documents.elements.CompositeElement at 0x7f86226bc3d0>,
 <unstructured.documents.elements.CompositeElement at 0x7f86226bc580>,
 <unstructured.documents.elements.CompositeElement at 0x7f86226bc5b0>,
 <unstructured.documents.elements.CompositeElement at 0x7f8621e8e530>,
 <unstructured.documents.elements.CompositeElement at 0x7f86226bc640>,
 <unstructured.documents.elements.CompositeElement at 0x7f86226bc310>,
 <unstructured.documents.elements.CompositeElement at 0x7f8621e8d870>]

The SAME code works well with the following version:

unstructured                             0.11.5
unstructured-inference                   0.7.19

Four tables found:

[<unstructured.documents.elements.CompositeElement at 0x7fdb74e00dc0>,
 <unstructured.documents.elements.CompositeElement at 0x7fdb74d35060>,
 <unstructured.documents.elements.CompositeElement at 0x7fdb74e018d0>,
 <unstructured.documents.elements.CompositeElement at 0x7fdb74e012a0>,
 <unstructured.documents.elements.CompositeElement at 0x7fdb74e028c0>,
 <unstructured.documents.elements.CompositeElement at 0x7fdb74e011e0>,
 <unstructured.documents.elements.Table at 0x7fdb6c1e02e0>,
 <unstructured.documents.elements.CompositeElement at 0x7fdb74ccfa00>,
 <unstructured.documents.elements.CompositeElement at 0x7fdb74e03250>,
 <unstructured.documents.elements.Table at 0x7fdb6c210ac0>,
 <unstructured.documents.elements.CompositeElement at 0x7fdb74e024d0>,
 <unstructured.documents.elements.CompositeElement at 0x7fdb74e02830>,
 <unstructured.documents.elements.Table at 0x7fdb6c3f49a0>,
 <unstructured.documents.elements.CompositeElement at 0x7fdb74ebda20>,
 <unstructured.documents.elements.Table at 0x7fdb74d37730>,
 <unstructured.documents.elements.CompositeElement at 0x7fdb74e01150>,
 <unstructured.documents.elements.CompositeElement at 0x7fdb74e00be0>]
@hardfish82
Copy link
Author

I tested different params in the latest version, and found that the problem only exists when the following params are included.

chunking_strategy="by_title",
max_characters=10000, # defaults to 500
combine_text_under_n_chars=2000, # defaults to 0
new_after_n_chars=6000,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant