Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can anyone explain the mapping of plone.richtext behavior? #35

Open
5 of 6 tasks
1letter opened this issue Jan 15, 2025 · 4 comments
Open
5 of 6 tasks

Can anyone explain the mapping of plone.richtext behavior? #35

1letter opened this issue Jan 15, 2025 · 4 comments
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@1letter
Copy link

1letter commented Jan 15, 2025

I have setup the opensearch container with the ingest plugin. i have setup collective.elastic.ingest in a python venv locally.
I have setup a simple Plone 6.1 site, no multilingual, language german, no content, collective.elastic.plone is installed. the communication between the instances works.

I use the mappings.json file from the example docker-os directory in this package.

I add a Page with:

  • title: Himbeere
  • description: Birne
  • richtext: Apfel

I add a PDF File with:

  • title: Rot
  • description: Gelb
  • the PDF contains only one word: Grün

Now i use a Rest Client for better debugging and send a request to http://localhost:9200/plone/_search

{
  "_source": true,
  "query": {
    "multi_match": {
      "query": "the word i search",
      "fields": [
        "title*^1.9",
        "description*^1.5",
        "file__extracted.content",
        "text__extracted.content"
      ],
      "analyzer": "german",
      "operator": "or",
      "fuzziness": "AUTO",
      "prefix_length": 2,
      "type": "most_fields",
      "minimum_should_match": "80%"
    }
  }
}

My search tests:

  • Rot
  • Gelb
  • Grün
  • Himbeere
  • Birne
  • Apfel -> no hits

I investigate the query with term "Himbeere" (that is the plone page) i see the term "Apfel", but not as plain text, the HTML is inside the field text__extracted.content

{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 2.491389,
    "hits": [
      {
        "_source": {          
          "text__extracted": {
            "content_type": "text/plain; charset=ISO-8859-1",
            "language": "mt",
            "content": "<p>Apfel</p>",
            "content_length": 13
          },
          "text": {
            "data": "<p>Apfel</p>",
            "content-type": "text/html",
            "encoding": "utf-8"
          },          
        }
      }
    ]
  }
}

I investigate the query with term "grün" (that is the pdf file in my plone site) i see the term "grün" in the field file__extracted.content

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.6019437,
    "hits": [
      {
        "_source": {          
          "file__extracted": {
            "content_type": "text/plain; charset=UTF-8",
            "language": "de",
            "content": "Grün",
            "content_length": 6
          },
          "file": {
            "download": "http://carusnet.local/farben.pdf/@@download/file",
            "filename": "farben.pdf",
            "size": 6,
            "content-type": "application/pdf"
          },
        }
      }
    ]
  }
}

Two Problems:

  • the term in the richtext field is not found
  • shouldn't the HTML code strip in the ‘text__extracted.content’ field be removed? Perhaps this solve the first problem?

Any hints @jensens or @ksuess ?

@1letter 1letter added help wanted Extra attention is needed question Further information is requested labels Jan 15, 2025
@jensens
Copy link
Member

jensens commented Jan 15, 2025

Indeed, text__extracted should not contain any markup. The Opensearch ingest-attachment plugin (installed in the https://github.com/collective/collective.elastic.ingest/blob/main/examples/docker-os/Dockerfile) should extract it from HTML to text/plain (it claims it is text/plain, but it is not). I guess this is the source of the problem. I have no good clue at the moment what actually went wrong here. I never run into the problem myself.

@1letter
Copy link
Author

1letter commented Jan 15, 2025

I found a working solution:

the mappings.json need a new processor for html_strip in the pipeline for plone.app.textfield.RichText definition and a new target field for the result:

"processors": [
    {
        "attachment": {
            "field": "{source}",
            "target_field": "{target}",
            "ignore_missing": true
        }
    },
    {
        "html_strip": {
            "field": "{target}.content",
            "target_field": "stripped_text"
        }
    },
    {
        "remove": {
            "field": "{source}",
            "ignore_missing": true
        }
    }
]

the search query:

{
  "_source": true,
  "query": {
    "multi_match": {
      "query": "Obst",
      "fields": [
        "title*^1.9",
        "description*^1.5",
        "stripped_text*^1.1",
        "file__extracted.content"
      ],
      "analyzer": "german",
      "operator": "or",
      "fuzziness": "AUTO",
      "prefix_length": 2,
      "type": "most_fields",
      "minimum_should_match": "30%"
    }
  }
}

@jensens
Copy link
Member

jensens commented Jan 15, 2025

Interesting! May you provide a PR for this?

@1letter
Copy link
Author

1letter commented Jan 15, 2025

Yes i can, but it i think more investigation is needed. My search result have this structure:

{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.12654667,
    "hits": [
      {
        "_index": "plone",
        "_id": "314ffcf4deb642a3bb39396ff7eaded0",
        "_score": 0.12654667,
        "_source": {
          "stripped_text": "\nKirsche Apfel und noch mehr Obst. Aber kein Gemüse! Auto, Benz und Fahrrad\n",
          "creators": [
            "admin"
          ],
          "description": "Banane",
          "language": "de",
          "section": "suchseite",
          "title": "Himbeere3",
          "rid": -1235753158,
          "portal_type": "Document",
          "text__extracted": {
            "content_type": "text/plain; charset=UTF-8",
            "language": "de",
            "content": "<p>Kirsche <strong>Apfel</strong> und noch mehr Obst. Aber kein <strong>Gemüse</strong>! Auto, Benz und Fahrrad</p>",
            "content_length": 116
          },
          "effective": "2025-01-14T13:17:00",
          "allow_discussion": false,
          "modified": "2025-01-15T13:50:07+00:00",
          "@id": "http://carusnet.local/suchseite",
          "id": "suchseite",
          "text": {
            "data": "<p>Kirsche <strong>Apfel</strong> und noch mehr Obst. Aber kein <strong>Gemüse</strong>! Auto, Benz und Fahrrad</p>",
            "content-type": "text/html",
            "encoding": "utf-8"
          },
          "created": "2025-01-14T12:17:48+00:00",
          "review_state": "published",
          "is_folderish": false,
          "layout": "document_view",
          "UID": "314ffcf4deb642a3bb39396ff7eaded0",
          "type_title": "Seite",
          "allowedRolesAndUsers": [
            "Anonymous"
          ],
          "exclude_from_nav": false
        }
      }
    ]
  }
}

If i search the term Gemüse, no hits are provided. That's a little bit wired. Perhaps the umlaut is not correct or a missing analyzer file for german. I will do more tests before i will make an PR

But "file__extracted.content*" or "text__extracted.content*" is definitly wrong, but this should be corrected in collective.elastic.plone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants