BUG: TypeError: UnstructuredYoloXModel.initialize() got an unexpected keyword argument 'extract_images_in_pdf' #265

2710932616 · 2023-10-20T18:14:14Z

Question

Using the example provided by langchain official:Semi_structured_multi_modal_RAG_LLaMA2, due to the addition of extract_images_in_pdf and image_output_dir_path parameters will be reported wrong.

Env

python 3.11.4
langchain 0.0.319
unstructured 0.10.24
unstructured-inference 0.7.7
unstructured.pytesseract 0.3.12

Code

from unstructured.partition.pdf import partition_pdf

# Path to save images
path = "./"

# Get elements
raw_pdf_elements = partition_pdf(filename=path+"LLaVA.pdf",
                                 # Using pdf format to find embedded image blocks
                                 extract_images_in_pdf=True,
                                 # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
                                 # Titles are any sub-section of the document 
                                 infer_table_structure=True, 
                                 # Post processing to aggregate text once we have the title 
                                 chunking_strategy="by_title",
                                 # Chunking params to aggregate text blocks
                                 # Attempt to create a new chunk 3800 chars
                                 # Attempt to keep chunks > 2000 chars 
                                 # Hard max on chunks
                                 max_characters=4000, 
                                 new_after_n_chars=3800, 
                                 combine_text_under_n_chars=2000,
                                 image_output_dir_path=path)

Traceback

      4 path = "./"
      6 # Get elements
----> 7 raw_pdf_elements = partition_pdf(filename=path+"LLaVA.pdf",
      8                                  # Using pdf format to find embedded image blocks
      9                                  extract_images_in_pdf=True,
     10                                  # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
     11                                  # Titles are any sub-section of the document 
     12                                  infer_table_structure=True, 
     13                                  # Post processing to aggregate text once we have the title 
     14                                  chunking_strategy="by_title",
     15                                  # Chunking params to aggregate text blocks
     16                                  # Attempt to create a new chunk 3800 chars
     17                                  # Attempt to keep chunks > 2000 chars 
     18                                  # Hard max on chunks
     19                                  max_characters=4000, 
     20                                  new_after_n_chars=3800, 
     21                                  combine_text_under_n_chars=2000,
     22                                  image_output_dir_path=path)

File ~/miniconda3/lib/python3.11/site-packages/unstructured/documents/elements.py:306, in process_metadata.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    304 @functools.wraps(func)
    305 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> 306     elements = func(*args, **kwargs)
    307     sig = inspect.signature(func)
    308     params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)

File ~/miniconda3/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:551, in add_metadata_with_filetype.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    549 @functools.wraps(func)
    550 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> 551     elements = func(*args, **kwargs)
    552     sig = inspect.signature(func)
    553     params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)

File ~/miniconda3/lib/python3.11/site-packages/unstructured/chunking/title.py:277, in add_chunking_strategy.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    275 @functools.wraps(func)
    276 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> 277     elements = func(*args, **kwargs)
    278     sig = inspect.signature(func)
    279     params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)

File ~/miniconda3/lib/python3.11/site-packages/unstructured/partition/pdf.py:157, in partition_pdf(filename, file, include_page_breaks, strategy, infer_table_structure, ocr_languages, languages, max_partition, min_partition, include_metadata, metadata_filename, metadata_last_modified, chunking_strategy, links, **kwargs)
    151         languages = convert_old_ocr_languages_to_languages(ocr_languages)
    152         logger.warning(
    153             "The ocr_languages kwarg will be deprecated in a future version of unstructured. "
    154             "Please use languages instead.",
    155         )
--> 157 return partition_pdf_or_image(
    158     filename=filename,
    159     file=file,
    160     include_page_breaks=include_page_breaks,
    161     strategy=strategy,
    162     infer_table_structure=infer_table_structure,
    163     languages=languages,
    164     max_partition=max_partition,
    165     min_partition=min_partition,
    166     metadata_last_modified=metadata_last_modified,
    167     **kwargs,
    168 )

File ~/miniconda3/lib/python3.11/site-packages/unstructured/partition/pdf.py:287, in partition_pdf_or_image(filename, file, is_image, include_page_breaks, strategy, infer_table_structure, ocr_languages, languages, max_partition, min_partition, metadata_last_modified, **kwargs)
    285 with warnings.catch_warnings():
    286     warnings.simplefilter("ignore")
--> 287     _layout_elements = _partition_pdf_or_image_local(
    288         filename=filename,
    289         file=spooled_to_bytes_io_if_needed(file),
    290         is_image=is_image,
    291         infer_table_structure=infer_table_structure,
    292         include_page_breaks=include_page_breaks,
    293         languages=languages,
    294         metadata_last_modified=metadata_last_modified or last_modification_date,
    295         **kwargs,
    296     )
    297     layout_elements = []
    298     for el in _layout_elements:

File ~/miniconda3/lib/python3.11/site-packages/unstructured/utils.py:178, in requires_dependencies.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    169 if len(missing_deps) > 0:
    170     raise ImportError(
    171         f"Following dependencies are missing: {', '.join(missing_deps)}. "
    172         + (
   (...)
    176         ),
    177     )
--> 178 return func(*args, **kwargs)

File ~/miniconda3/lib/python3.11/site-packages/unstructured/partition/pdf.py:377, in _partition_pdf_or_image_local(filename, file, is_image, infer_table_structure, include_page_breaks, languages, ocr_mode, model_name, metadata_last_modified, **kwargs)
    373         process_with_model_kwargs[key] = value
    375 if file is None:
    376     # NOTE(christine): out_layout = extracted_layout + inferred_layout
--> 377     out_layout = process_file_with_model(
    378         filename,
    379         is_image=is_image,
    380         extract_tables=infer_table_structure,
    381         model_name=model_name,
    382         pdf_image_dpi=pdf_image_dpi,
    383         **process_with_model_kwargs,
    384     )
    385     if model_name.startswith("chipper"):
    386         # NOTE(alan): We shouldn't do OCR with chipper
    387         final_layout = out_layout

File ~/miniconda3/lib/python3.11/site-packages/unstructured_inference/inference/layout.py:481, in process_file_with_model(filename, model_name, is_image, fixed_layouts, extract_tables, pdf_image_dpi, **kwargs)
    469 def process_file_with_model(
    470     filename: str,
    471     model_name: Optional[str],
   (...)
    476     **kwargs,
    477 ) -> DocumentLayout:
    478     """Processes pdf file with name filename into a DocumentLayout by using a model identified by
    479     model_name."""
--> 481     model = get_model(model_name, **kwargs)
    482     if isinstance(model, UnstructuredObjectDetectionModel):
    483         detection_model = model

File ~/miniconda3/lib/python3.11/site-packages/unstructured_inference/models/base.py:73, in get_model(model_name, **kwargs)
     71 else:
     72     raise UnknownModelException(f"Unknown model type: {model_name}")
---> 73 model.initialize(**initialize_params)
     74 models[model_name] = model
     75 return model

TypeError: UnstructuredYoloXModel.initialize() got an unexpected keyword argument 'extract_images_in_pdf'

The text was updated successfully, but these errors were encountered:

2710932616 · 2023-10-20T18:29:07Z

reason

Excess transfer parameters are not processed.

solution

in unstructured-inference/unstructured_inference/models/yolox.py line 67,
This can be solved by adding **kwargs to initialize method.like this:

# original
    def initialize(self, model_path: str, label_map: dict):
# new
    def initialize(self, model_path: str, label_map: dict, **kwargs):

The same thing happens in

unstructured_inference/models/super_gradients.py line 35:

    def initialize(
        self,
        model_arch: str,
        model_path: str,
        dataset_yaml_path: str,
        callback: Callable[[np.ndarray, "_sgmodels.sg_module.SgModule"], "sv.Detections"],
    ):

unstructured_inference/models/detectron2.py line 35:

    def initialize(
        self,
        config_path: Union[str, Path, LayoutModelConfig],
        model_path: Optional[Union[str, Path]] = None,
        label_map: Optional[Dict[int, str]] = None,
        extra_config: Optional[list] = None,
        device: Optional[str] = None,
    ):

unstructured_inference/models/chipper.py line 51,
unstructured_inference/models/detectron2onnx.py line 93.

Closes `unstructured-inference` issue [#265](Unstructured-IO/unstructured-inference#265). Cleaned up the kwarg handling, taking opportunities to turn instances of handling kwargs as dicts to just using them as normal in function signatures. #### Testing: Should just pass CI.

qued · 2023-10-23T13:57:23Z

Resolved by unstructured #1810.

2710932616 changed the title ~~TypeError: UnstructuredYoloXModel.initialize() got an unexpected keyword argument 'extract_images_in_pdf'~~ BUG: TypeError: UnstructuredYoloXModel.initialize() got an unexpected keyword argument 'extract_images_in_pdf' Oct 20, 2023

This was referenced Oct 20, 2023

fix:excess transfer parameters are not processed #266

Closed

bug/unstructured-inference bug: TypeError: UnstructuredYoloXModel.initialize() got an unexpected keyword argument 'extract_images_in_pdf' Unstructured-IO/unstructured#1834

Closed

qued mentioned this issue Oct 22, 2023

chore: improve kwarg handling Unstructured-IO/unstructured#1810

Merged

qued closed this as completed Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: TypeError: UnstructuredYoloXModel.initialize() got an unexpected keyword argument 'extract_images_in_pdf' #265

BUG: TypeError: UnstructuredYoloXModel.initialize() got an unexpected keyword argument 'extract_images_in_pdf' #265

2710932616 commented Oct 20, 2023

2710932616 commented Oct 20, 2023

qued commented Oct 23, 2023

BUG: TypeError: UnstructuredYoloXModel.initialize() got an unexpected keyword argument 'extract_images_in_pdf' #265

BUG: TypeError: UnstructuredYoloXModel.initialize() got an unexpected keyword argument 'extract_images_in_pdf' #265

Comments

2710932616 commented Oct 20, 2023

Question

Env

Code

Traceback

2710932616 commented Oct 20, 2023

reason

solution

qued commented Oct 23, 2023