Solr has a fantastic set of options for processing text. One of them is the WordDelimiterFilterFactory
that allows you to turn a single word into multiple words. Here is an example of the filter applied to a TextField
:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange="1"
splitOnNumerics="1"
generateWordParts="1"
stemEnglishPossessive="0"
generateNumberParts="0"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
preserveOriginal="0"/>
</analyzer>
</fieldType>
This would turn the following token:
SpecialWords123More-words
Into several tokens based on the options that are turned on (splitOnCaseChange
, splitOnNumerics
, generateWordParts
):
Special, Words, 123, More, words
We recently had a request to turn off the splitOnNumerics
option so that letter to number transitions don't cause a word to split. This seemed like a simple request, but we ended up spending a fair amount of time because the Solr docs are inaccurate. Our client was still using the 1.3 release of Solr, and this particular feature was not introduced until the 1.4 release. Solr happily accepts the non-existent option and never warns you that it isn't valid! Getting this simple toggle of an option released required an upgrade to the 1.4 release. While this brings a lot of bug fixes and features, it was an unexpected set back.
If you are new to Solr, I highly recommend reading the book Solr 1.4 Enterprise Search Server. The authors do an excellent job explaining how all the pieces of Solr work.