Uploaded image for project: 'eZ Publish / Platform'
  1. eZ Publish / Platform
  2. EZP-8765

strip_tags in ezsearch strips too much

    XMLWordPrintable

Details

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Medium Medium
    • 4.3.0beta2
    • None
    • Misc
    • None
    • Version: All
      PHP Version: NA
      Webserver: All
      Database: All

    Description

      The PHP function strip_tags is used in the default ez search plugin to strip html tags from the metadata parts. In our experience, this leads in most cases to massive loss of text in the index. This happens in text axtraction from binary files as well as text fields. For example the character '<' means often the loss of everything that follows

      It is in my opinion not necessary at all: xml text blocks already return plain text, and it won't break indexinging/searching.

      So my suggestion: disable strip_tags or make it configurable. In our lucene search plugin we disabled this alltogether, as the parser is intelligent enough to cope with even the presence of html/xml code.

      BTW: when ez publish is used as a DMS (indexing lots of binary files), the indexed word table grows typically by 50% when the strip_tags is disabled: another proof that in general it really strips too much!

      --paul

      Attachments

        Activity

          People

            pborgerm pborgerm
            pborgerm pborgerm
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: