Word Splitting Filter in Solr

Solr has a fantastic set of options for processing text. One of them is the WordDelimiterFilterFactory that allows you to turn a single word into multiple words. Here is an example of the filter applied to a TextField:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory"
        splitOnCaseChange="1"
        splitOnNumerics="1"
        generateWordParts="1"
        stemEnglishPossessive="0"
        generateNumberParts="0"
        catenateWords="0"
        catenateNumbers="0"
        catenateAll="0"
        preserveOriginal="0"/>
  </analyzer>
</fieldType>

This would turn the following token:

SpecialWords123More-words

Into several tokens based on the options that are turned on (splitOnCaseChange, splitOnNumerics, generateWordParts):‍

Special, Words, 123, More, words

‍We recently had a request to turn off the splitOnNumerics option so that letter to number transitions don't cause a word to split. This seemed like a simple request, but we ended up spending a fair amount of time because the Solr docs are inaccurate. Our client was still using the 1.3 release of Solr, and this particular feature was not introduced until the 1.4 release. Solr happily accepts the non-existent option and never warns you that it isn't valid! Getting this simple toggle of an option released required an upgrade to the 1.4 release. While this brings a lot of bug fixes and features, it was an unexpected set back.

If you are new to Solr, I highly recommend reading the book Solr 1.4 Enterprise Search Server. The authors do an excellent job explaining how all the pieces of Solr work.

Word Splitting Filter in Solr

Table of Contents

Related Posts

Contact Us

HEAR FROM US