Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No inferences found by Textract #87

Closed
nekogeko opened this issue Jan 13, 2025 · 7 comments
Closed

No inferences found by Textract #87

nekogeko opened this issue Jan 13, 2025 · 7 comments
Labels
bug Something isn't working

Comments

@nekogeko
Copy link

Describe the bug
When attempting to process the sample documents packaged in the solution, files are uploaded but no raw data is visible after a file has been processed. Key/value pairs are visible

To Reproduce
Deploy the application using one of the provided workflows (single-doc-textract.json, default.json), create a case and upload the management.png document with a document type (generic or Passport, depending on the choosen workflow config). Manually start the job. Wait for the status of the case to show that processing is complete. Then go to the playground view to see the extracted data.

Expected behavior
Since Textract is in the selected workflow, data is expected to be displayed in the Raw data section.

Please complete the following information about the solution:

  • [1.1.7 ] Version
  • [ us-east-1] Region: [e.g. us-east-1]
  • [ no] Was the solution modified from the version published on this repository?
  • [ no] If the answer to the previous question was yes, are the changes available on GitHub?
  • [ no] Have you checked your service quotas for the sevices this solution uses?
  • [ no errors seen] Were there any errors in the CloudWatch Logs?

Screenshots
If applicable, add screenshots to help explain your problem (please DO NOT include sensitive information).

Additional context
Add any other context about the problem here.

@nekogeko nekogeko added the bug Something isn't working label Jan 13, 2025
@knihit
Copy link
Member

knihit commented Jan 13, 2025

@nekogeko thank you for reaching out. Can you please confirm if you are uploading the documents to process in the S3 bucket directly or in the UI?

A few things to check in addition to the above

  1. there are few step functions that the solution creates. Can you please check if all them execute with no failures.
  2. there is a lambda function https://github.com/aws-solutions/enhanced-document-understanding-on-aws/tree/main/source/lambda/text-extract. Can you check the CW logs for this lambda to see if it has any errors? For the lambda you can also set LOG_LEVEL as DEBUG as one of the lambda environment variables to view more verbose logging.

@nekogeko
Copy link
Author

Hi,

I'm uploading the document from the UI. I've been testing with management.png, document type generic.

The cloudwatch logs for the textract lambda does not show errors

2025-01-14T16:02:07.501Z INIT_START Runtime Version: nodejs:20.v51 Runtime Version ARN: arn:aws:lambda:us-east-1::runtime:cb6527bfb6726a080a367eca00e49765ca5abd8cd1a17783fbee683313121ece 2025-01-14T16:02:08.575Z START RequestId: 2b14993e-a95f-5b93-8bdc-f93e25e1973c Version: $LATEST 2025-01-14T16:02:08.577Z 2b14993e-a95f-5b93-8bdc-f93e25e1973c DEBUG S3_MULTI_PAGE_PDF_PREFIX is: multi-page-pdf 2025-01-14T16:02:09.585Z 2b14993e-a95f-5b93-8bdc-f93e25e1973c INFO namespace received: Workflows 2025-01-14T16:02:12.824Z 2b14993e-a95f-5b93-8bdc-f93e25e1973c INFO Publishing cw metrics with params: {"MetricData":[{"MetricName":"TextractWorkflow","Dimensions":[{"Name":"TextractAPI","Value":"Textract-DetectTextSync"},{"Name":"serviceName","Value":"eDUS-2cb98cb7"}],"Timestamp":"2025-01-14T16:02:12.824Z","Unit":"Count","Value":1}],"Namespace":"Workflows"} 2025-01-14T16:02:13.264Z 2b14993e-a95f-5b93-8bdc-f93e25e1973c INFO Published cw metrics to Workflows. 2025-01-14T16:02:13.264Z 2b14993e-a95f-5b93-8bdc-f93e25e1973c DEBUG Textract Sync - Processing Generic Document Type for taskToken AQDEAAAAKgAAAAMAAAAAAAAAAQVEHY50GiW8OK+LNCCV9FGoLbR416Vy9Lf968CemDKwGgycSwFibsDQ9q8Ctz9unGOr7GXgNjlp2CRqsnM5gM8TgsaqOnv/nlZ+6SeCSPkJpGFzEexM8IE+2EaZgCR/2f6pzsulspjiLQs6uqVGspOdo7FHtbReh4T6RVskBT8smA==Mp2FJq+z6+LIujWTpZttI5fS+JbeFdWy98NVMCKU+9gefMO/9JgU3VoihOINvLCWqMIlk5A6nkKL5pqUqQWSDhdm9P5rnS+KkLBlOd0SmLNObzxTM6/FLWcM4T5cis0xy9Z7vyNNMs5eGN9Ov23oHb1Cd2BoJad4rLK1eKikOBEHY9XBQBqoeWt+7q9Bti+JntSU5PHC68Zb0Kw/qhNJVCr0eoNiDtwtLm0SUD+1CJTbRLTwt4XddvR2ZIgrzPqOR7YmhNp65Mcm9qvy4B5yCkw7zH9sGylvkeUgEo49LnVXnRbStkI5TP7pYz+P2WVgg3jJ2VYkcTu7b0E62UVxTWvHVnKNyPfm7ramyYrTEa1PaQrqX94qiv3PXeBrali523y8OefYFp2YwFQbAyw2plF5vXS2+jXOqREWwW/mS1IG5gpKn0jUaxqz2B/Ak468pT0tRTKVzVE8TS/aMEFYOcXNWooezg6YPH8vcSui0/n197waUqurceGyHceDh+SAyHRCvgAH6yGtEamYv6L4 2025-01-14T16:02:13.264Z 2b14993e-a95f-5b93-8bdc-f93e25e1973c INFO Textract AnalyzeDocument request parameters missing or invalid. Using defaults 2025-01-14T16:02:13.265Z 2b14993e-a95f-5b93-8bdc-f93e25e1973c INFO namespace received: Workflows 2025-01-14T16:02:17.504Z 2b14993e-a95f-5b93-8bdc-f93e25e1973c INFO Publishing cw metrics with params: {"MetricData":[{"MetricName":"TextractWorkflow","Dimensions":[{"Name":"TextractAPI","Value":"Textract-AnalyzeDocumentSync"},{"Name":"serviceName","Value":"eDUS-2cb98cb7"}],"Timestamp":"2025-01-14T16:02:17.504Z","Unit":"Count","Value":1}],"Namespace":"Workflows"} 2025-01-14T16:02:17.704Z 2b14993e-a95f-5b93-8bdc-f93e25e1973c INFO Published cw metrics to Workflows. 2025-01-14T16:02:18.485Z 2b14993e-a95f-5b93-8bdc-f93e25e1973c DEBUG S3_INFERENCE_BUCKET_NAME is: docunderstanding-requestprocessorinferences13166f8-pa8r00l6y4cg 2025-01-14T16:02:19.385Z 2b14993e-a95f-5b93-8bdc-f93e25e1973c DEBUG CASE_DDB_TABLE_NAME is: DocUnderstanding-RequestProcessorCaseManagerCreateRecordsLambdaDDbDynamoTable94F42CFC-1E7F3H9RWF15 END RequestId: 2b14993e-a95f-5b93-8bdc-f93e25e1973c REPORT RequestId: 2b14993e-a95f-5b93-8bdc-f93e25e1973c Duration: 11168.38 ms Billed Duration: 11169 ms Memory Size: 128 MB Max Memory Used: 103 MB Init Duration: 1070.70 ms XRAY TraceId: 1-67868a7f-80b8fb5a3b725a63520242f0 SegmentId: f7fb34c2aceeda5c Sampled: true

The entry in DynamoDB has a case status set to success

I see inferences being created in S3 under a folder hierarchy that looks like

docunderstanding-requestprocessorinferences13166f8-pa8r00l6y4cg
    /nekogeko:f40fa734-dda2-45b7-b365-ddecbbd0bd4d
        /doc-d915c25d-f274-42b6-a96e-233072068ee4

            textract-detectText.json

alongside this file are also other files including entity-*.json, and textract-analyze.json
The 2 textract files have data in them, with node elements such as
{ "BlockType": "WORD", "Confidence": 96.8800277709961, "Text": "REPORT", "TextType": "PRINTED", "Geometry": { "BoundingBox": { "Width": 0.06701218336820602, "Height": 0.007605433464050293, "Left": 0.5365458130836487, "Top": 0.9628257751464844 }, "Polygon": [ { "X": 0.5365458130836487, "Y": 0.9628257751464844 }, { "X": 0.6035550832748413, "Y": 0.9628543853759766 }, { "X": 0.6035580039024353, "Y": 0.9704312086105347 }, { "X": 0.5365484356880188, "Y": 0.970402717590332 } ] }

The inferences files are requested and retrieved from the UI, and the key-value pairs tab shows 10 key-value pairs found, but the Raw Text section says "No Raw Text detected"

@knihit
Copy link
Member

knihit commented Jan 14, 2025

Thank you for the additional detail. Investigating the issue.

@nekogeko
Copy link
Author

nekogeko commented Jan 15, 2025

It appears that the issue may be in the javascript code responsible of retrieving the number of pages from the back-end response in document.js

@OmarRad
Copy link
Member

OmarRad commented Jan 15, 2025

@nekogeko thank you for reaching out. You're right, getDocumentPageCount() is returning undefined in document.ts. This is due to a mistake in the reducer logic in inferenceApiSlice.ts causing the textractDetectResponse object to look like { data: detectTextResponse } instead of simply being detectTextResponse as was expected by the rest of the code.

While we work on releasing the fix for this, we will share it with you here so that you can add it to your code in the meantime.

In inferenceApiSlice.ts in the following lines 25-28:

                if (validInferences.includes(InferenceName.TEXTRACT_DETECT_TEXT)) {
                    unformattedtextractDetectResponse = await baseQuery(
                        `${INFERENCES_PATH}/${arg.selectedCaseId}/${arg.selectedDocumentId}/${InferenceName.TEXTRACT_DETECT_TEXT}`
                    );
                }

you'll need to make these changes:

Line 26

- unformattedtextractDetectResponse = await baseQuery(
+ const response = await baseQuery(

and after line 28 add

+ unformattedtextractDetectResponse = response.data as any;

I've attached a screenshot of what this change should look like
image (1)

@nekogeko
Copy link
Author

thanks, I will test this and get back to you

@nekogeko
Copy link
Author

I confirm that the issue is resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants