Can anyone explain the mapping of plone.richtext behavior? #35

1letter · 2025-01-15T07:27:46Z

I have setup the opensearch container with the ingest plugin. i have setup collective.elastic.ingest in a python venv locally.
I have setup a simple Plone 6.1 site, no multilingual, language german, no content, collective.elastic.plone is installed. the communication between the instances works.

I use the mappings.json file from the example docker-os directory in this package.

I add a Page with:

title: Himbeere
description: Birne
richtext: Apfel

I add a PDF File with:

title: Rot
description: Gelb
the PDF contains only one word: Grün

Now i use a Rest Client for better debugging and send a request to http://localhost:9200/plone/_search

{
  "_source": true,
  "query": {
    "multi_match": {
      "query": "the word i search",
      "fields": [
        "title*^1.9",
        "description*^1.5",
        "file__extracted.content",
        "text__extracted.content"
      ],
      "analyzer": "german",
      "operator": "or",
      "fuzziness": "AUTO",
      "prefix_length": 2,
      "type": "most_fields",
      "minimum_should_match": "80%"
    }
  }
}

My search tests:

I investigate the query with term "Himbeere" (that is the plone page) i see the term "Apfel", but not as plain text, the HTML is inside the field text__extracted.content

{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 2.491389,
    "hits": [
      {
        "_source": {          
          "text__extracted": {
            "content_type": "text/plain; charset=ISO-8859-1",
            "language": "mt",
            "content": "<p>Apfel</p>",
            "content_length": 13
          },
          "text": {
            "data": "<p>Apfel</p>",
            "content-type": "text/html",
            "encoding": "utf-8"
          },          
        }
      }
    ]
  }
}

I investigate the query with term "grün" (that is the pdf file in my plone site) i see the term "grün" in the field file__extracted.content

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.6019437,
    "hits": [
      {
        "_source": {          
          "file__extracted": {
            "content_type": "text/plain; charset=UTF-8",
            "language": "de",
            "content": "Grün",
            "content_length": 6
          },
          "file": {
            "download": "http://carusnet.local/farben.pdf/@@download/file",
            "filename": "farben.pdf",
            "size": 6,
            "content-type": "application/pdf"
          },
        }
      }
    ]
  }
}

Two Problems:

the term in the richtext field is not found
shouldn't the HTML code strip in the ‘text__extracted.content’ field be removed? Perhaps this solve the first problem?

Any hints @jensens or @ksuess ?

The text was updated successfully, but these errors were encountered:

jensens · 2025-01-15T12:30:34Z

Indeed, text__extracted should not contain any markup. The Opensearch ingest-attachment plugin (installed in the https://github.com/collective/collective.elastic.ingest/blob/main/examples/docker-os/Dockerfile) should extract it from HTML to text/plain (it claims it is text/plain, but it is not). I guess this is the source of the problem. I have no good clue at the moment what actually went wrong here. I never run into the problem myself.

1letter · 2025-01-15T13:54:03Z

I found a working solution:

the mappings.json need a new processor for html_strip in the pipeline for plone.app.textfield.RichText definition and a new target field for the result:

"processors": [
    {
        "attachment": {
            "field": "{source}",
            "target_field": "{target}",
            "ignore_missing": true
        }
    },
    {
        "html_strip": {
            "field": "{target}.content",
            "target_field": "stripped_text"
        }
    },
    {
        "remove": {
            "field": "{source}",
            "ignore_missing": true
        }
    }
]

the search query:

{
  "_source": true,
  "query": {
    "multi_match": {
      "query": "Obst",
      "fields": [
        "title*^1.9",
        "description*^1.5",
        "stripped_text*^1.1",
        "file__extracted.content"
      ],
      "analyzer": "german",
      "operator": "or",
      "fuzziness": "AUTO",
      "prefix_length": 2,
      "type": "most_fields",
      "minimum_should_match": "30%"
    }
  }
}

jensens · 2025-01-15T14:03:44Z

Interesting! May you provide a PR for this?

1letter · 2025-01-15T14:16:17Z

Yes i can, but it i think more investigation is needed. My search result have this structure:

{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.12654667,
    "hits": [
      {
        "_index": "plone",
        "_id": "314ffcf4deb642a3bb39396ff7eaded0",
        "_score": 0.12654667,
        "_source": {
          "stripped_text": "\nKirsche Apfel und noch mehr Obst. Aber kein Gemüse! Auto, Benz und Fahrrad\n",
          "creators": [
            "admin"
          ],
          "description": "Banane",
          "language": "de",
          "section": "suchseite",
          "title": "Himbeere3",
          "rid": -1235753158,
          "portal_type": "Document",
          "text__extracted": {
            "content_type": "text/plain; charset=UTF-8",
            "language": "de",
            "content": "<p>Kirsche <strong>Apfel</strong> und noch mehr Obst. Aber kein <strong>Gemüse</strong>! Auto, Benz und Fahrrad</p>",
            "content_length": 116
          },
          "effective": "2025-01-14T13:17:00",
          "allow_discussion": false,
          "modified": "2025-01-15T13:50:07+00:00",
          "@id": "http://carusnet.local/suchseite",
          "id": "suchseite",
          "text": {
            "data": "<p>Kirsche <strong>Apfel</strong> und noch mehr Obst. Aber kein <strong>Gemüse</strong>! Auto, Benz und Fahrrad</p>",
            "content-type": "text/html",
            "encoding": "utf-8"
          },
          "created": "2025-01-14T12:17:48+00:00",
          "review_state": "published",
          "is_folderish": false,
          "layout": "document_view",
          "UID": "314ffcf4deb642a3bb39396ff7eaded0",
          "type_title": "Seite",
          "allowedRolesAndUsers": [
            "Anonymous"
          ],
          "exclude_from_nav": false
        }
      }
    ]
  }
}

If i search the term Gemüse, no hits are provided. That's a little bit wired. Perhaps the umlaut is not correct or a missing analyzer file for german. I will do more tests before i will make an PR

But "file__extracted.content*" or "text__extracted.content*" is definitly wrong, but this should be corrected in collective.elastic.plone

1letter added help wanted Extra attention is needed question Further information is requested labels Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can anyone explain the mapping of plone.richtext behavior? #35

Can anyone explain the mapping of plone.richtext behavior? #35

1letter commented Jan 15, 2025

jensens commented Jan 15, 2025

1letter commented Jan 15, 2025

jensens commented Jan 15, 2025

1letter commented Jan 15, 2025

Can anyone explain the mapping of plone.richtext behavior? #35

Can anyone explain the mapping of plone.richtext behavior? #35

Comments

1letter commented Jan 15, 2025

jensens commented Jan 15, 2025

1letter commented Jan 15, 2025

jensens commented Jan 15, 2025

1letter commented Jan 15, 2025