Details
-
Bug
-
Resolution: Fixed
-
Medium
-
None
-
None
-
Version: All
PHP Version: NA
Webserver: All
Database: All
Description
The PHP function strip_tags is used in the default ez search plugin to strip html tags from the metadata parts. In our experience, this leads in most cases to massive loss of text in the index. This happens in text axtraction from binary files as well as text fields. For example the character '<' means often the loss of everything that follows
It is in my opinion not necessary at all: xml text blocks already return plain text, and it won't break indexinging/searching.
So my suggestion: disable strip_tags or make it configurable. In our lucene search plugin we disabled this alltogether, as the parser is intelligent enough to cope with even the presence of html/xml code.
BTW: when ez publish is used as a DMS (indexing lots of binary files), the indexed word table grows typically by 50% when the strip_tags is disabled: another proof that in general it really strips too much!
--paul