Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: TypeError: UnstructuredYoloXModel.initialize() got an unexpected keyword argument 'extract_images_in_pdf' #265

Closed
2710932616 opened this issue Oct 20, 2023 · 2 comments

Comments

@2710932616
Copy link

Question

Using the example provided by langchain official:Semi_structured_multi_modal_RAG_LLaMA2, due to the addition of extract_images_in_pdf and image_output_dir_path parameters will be reported wrong.

Env

python 3.11.4
langchain 0.0.319
unstructured 0.10.24
unstructured-inference 0.7.7
unstructured.pytesseract 0.3.12

Code

from unstructured.partition.pdf import partition_pdf

# Path to save images
path = "./"

# Get elements
raw_pdf_elements = partition_pdf(filename=path+"LLaVA.pdf",
                                 # Using pdf format to find embedded image blocks
                                 extract_images_in_pdf=True,
                                 # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
                                 # Titles are any sub-section of the document 
                                 infer_table_structure=True, 
                                 # Post processing to aggregate text once we have the title 
                                 chunking_strategy="by_title",
                                 # Chunking params to aggregate text blocks
                                 # Attempt to create a new chunk 3800 chars
                                 # Attempt to keep chunks > 2000 chars 
                                 # Hard max on chunks
                                 max_characters=4000, 
                                 new_after_n_chars=3800, 
                                 combine_text_under_n_chars=2000,
                                 image_output_dir_path=path)

Traceback

      4 path = "./"
      6 # Get elements
----> 7 raw_pdf_elements = partition_pdf(filename=path+"LLaVA.pdf",
      8                                  # Using pdf format to find embedded image blocks
      9                                  extract_images_in_pdf=True,
     10                                  # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
     11                                  # Titles are any sub-section of the document 
     12                                  infer_table_structure=True, 
     13                                  # Post processing to aggregate text once we have the title 
     14                                  chunking_strategy="by_title",
     15                                  # Chunking params to aggregate text blocks
     16                                  # Attempt to create a new chunk 3800 chars
     17                                  # Attempt to keep chunks > 2000 chars 
     18                                  # Hard max on chunks
     19                                  max_characters=4000, 
     20                                  new_after_n_chars=3800, 
     21                                  combine_text_under_n_chars=2000,
     22                                  image_output_dir_path=path)

File ~/miniconda3/lib/python3.11/site-packages/unstructured/documents/elements.py:306, in process_metadata.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    304 @functools.wraps(func)
    305 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> 306     elements = func(*args, **kwargs)
    307     sig = inspect.signature(func)
    308     params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)

File ~/miniconda3/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:551, in add_metadata_with_filetype.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    549 @functools.wraps(func)
    550 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> 551     elements = func(*args, **kwargs)
    552     sig = inspect.signature(func)
    553     params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)

File ~/miniconda3/lib/python3.11/site-packages/unstructured/chunking/title.py:277, in add_chunking_strategy.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    275 @functools.wraps(func)
    276 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> 277     elements = func(*args, **kwargs)
    278     sig = inspect.signature(func)
    279     params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)

File ~/miniconda3/lib/python3.11/site-packages/unstructured/partition/pdf.py:157, in partition_pdf(filename, file, include_page_breaks, strategy, infer_table_structure, ocr_languages, languages, max_partition, min_partition, include_metadata, metadata_filename, metadata_last_modified, chunking_strategy, links, **kwargs)
    151         languages = convert_old_ocr_languages_to_languages(ocr_languages)
    152         logger.warning(
    153             "The ocr_languages kwarg will be deprecated in a future version of unstructured. "
    154             "Please use languages instead.",
    155         )
--> 157 return partition_pdf_or_image(
    158     filename=filename,
    159     file=file,
    160     include_page_breaks=include_page_breaks,
    161     strategy=strategy,
    162     infer_table_structure=infer_table_structure,
    163     languages=languages,
    164     max_partition=max_partition,
    165     min_partition=min_partition,
    166     metadata_last_modified=metadata_last_modified,
    167     **kwargs,
    168 )

File ~/miniconda3/lib/python3.11/site-packages/unstructured/partition/pdf.py:287, in partition_pdf_or_image(filename, file, is_image, include_page_breaks, strategy, infer_table_structure, ocr_languages, languages, max_partition, min_partition, metadata_last_modified, **kwargs)
    285 with warnings.catch_warnings():
    286     warnings.simplefilter("ignore")
--> 287     _layout_elements = _partition_pdf_or_image_local(
    288         filename=filename,
    289         file=spooled_to_bytes_io_if_needed(file),
    290         is_image=is_image,
    291         infer_table_structure=infer_table_structure,
    292         include_page_breaks=include_page_breaks,
    293         languages=languages,
    294         metadata_last_modified=metadata_last_modified or last_modification_date,
    295         **kwargs,
    296     )
    297     layout_elements = []
    298     for el in _layout_elements:

File ~/miniconda3/lib/python3.11/site-packages/unstructured/utils.py:178, in requires_dependencies.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    169 if len(missing_deps) > 0:
    170     raise ImportError(
    171         f"Following dependencies are missing: {', '.join(missing_deps)}. "
    172         + (
   (...)
    176         ),
    177     )
--> 178 return func(*args, **kwargs)

File ~/miniconda3/lib/python3.11/site-packages/unstructured/partition/pdf.py:377, in _partition_pdf_or_image_local(filename, file, is_image, infer_table_structure, include_page_breaks, languages, ocr_mode, model_name, metadata_last_modified, **kwargs)
    373         process_with_model_kwargs[key] = value
    375 if file is None:
    376     # NOTE(christine): out_layout = extracted_layout + inferred_layout
--> 377     out_layout = process_file_with_model(
    378         filename,
    379         is_image=is_image,
    380         extract_tables=infer_table_structure,
    381         model_name=model_name,
    382         pdf_image_dpi=pdf_image_dpi,
    383         **process_with_model_kwargs,
    384     )
    385     if model_name.startswith("chipper"):
    386         # NOTE(alan): We shouldn't do OCR with chipper
    387         final_layout = out_layout

File ~/miniconda3/lib/python3.11/site-packages/unstructured_inference/inference/layout.py:481, in process_file_with_model(filename, model_name, is_image, fixed_layouts, extract_tables, pdf_image_dpi, **kwargs)
    469 def process_file_with_model(
    470     filename: str,
    471     model_name: Optional[str],
   (...)
    476     **kwargs,
    477 ) -> DocumentLayout:
    478     """Processes pdf file with name filename into a DocumentLayout by using a model identified by
    479     model_name."""
--> 481     model = get_model(model_name, **kwargs)
    482     if isinstance(model, UnstructuredObjectDetectionModel):
    483         detection_model = model

File ~/miniconda3/lib/python3.11/site-packages/unstructured_inference/models/base.py:73, in get_model(model_name, **kwargs)
     71 else:
     72     raise UnknownModelException(f"Unknown model type: {model_name}")
---> 73 model.initialize(**initialize_params)
     74 models[model_name] = model
     75 return model

TypeError: UnstructuredYoloXModel.initialize() got an unexpected keyword argument 'extract_images_in_pdf'
@2710932616 2710932616 changed the title TypeError: UnstructuredYoloXModel.initialize() got an unexpected keyword argument 'extract_images_in_pdf' BUG: TypeError: UnstructuredYoloXModel.initialize() got an unexpected keyword argument 'extract_images_in_pdf' Oct 20, 2023
@2710932616
Copy link
Author

reason

Excess transfer parameters are not processed.

solution

in unstructured-inference/unstructured_inference/models/yolox.py line 67,
This can be solved by adding **kwargs to initialize method.like this:

# original
    def initialize(self, model_path: str, label_map: dict):
# new
    def initialize(self, model_path: str, label_map: dict, **kwargs):

The same thing happens in

  1. unstructured_inference/models/super_gradients.py line 35:
    def initialize(
        self,
        model_arch: str,
        model_path: str,
        dataset_yaml_path: str,
        callback: Callable[[np.ndarray, "_sgmodels.sg_module.SgModule"], "sv.Detections"],
    ):
  1. unstructured_inference/models/detectron2.py line 35:
    def initialize(
        self,
        config_path: Union[str, Path, LayoutModelConfig],
        model_path: Optional[Union[str, Path]] = None,
        label_map: Optional[Dict[int, str]] = None,
        extra_config: Optional[list] = None,
        device: Optional[str] = None,
    ):
  1. unstructured_inference/models/chipper.py line 51,
  2. unstructured_inference/models/detectron2onnx.py line 93.

github-merge-queue bot pushed a commit to Unstructured-IO/unstructured that referenced this issue Oct 23, 2023
Closes `unstructured-inference` issue
[#265](Unstructured-IO/unstructured-inference#265).

Cleaned up the kwarg handling, taking opportunities to turn instances of
handling kwargs as dicts to just using them as normal in function
signatures.

#### Testing:

Should just pass CI.
@qued
Copy link
Contributor

qued commented Oct 23, 2023

Resolved by unstructured #1810.

@qued qued closed this as completed Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants