How to use Apache SolR SQL integration and not get hurt
August 16, 2012 1 Comment
Recently I’ve spent more than one day fixing crazy issues in SolR SQL database integration on my project. You can set everything up using SolR documentation here: http://wiki.apache.org/solr/DataImportHandler. It’s not that difficult and probably is enough to handle data loading for many applications.
We have quite complex query that is feeding SolR with data. It has few sub-selects, group concats etc. We also use SQL database to store original content of documents we’re feeding SolR (so we can recreate index whenever it’s needed). Everything worked fine with simple varchars or integers, but when we wanted to process CLOB/LONGTEXT fields it didn’t work. First for CLOB data SolR was not indexing it’s content, but class name and address of database CLOB handler (i think it was something like org.h2.jdbc.Clob@1c341a for H2 database, oracle.sql.CLOB@24d12a for Oracle). It was Object.toString() call as you probably already guessed and database API was not returning String for CLOB, but some internal representation that SolR should read data from.
Everything should be fixed by using ClobTransformer. Just few changes in data-config.xml and it should be fine… but it wasn’t. I spent quite few hours to find out, that it won’t work if data column you’re feeding ClobTransformer is not written all in uppercase. Yes, it was just that. Adding alias to column name that made it in named in upper case fixed everything.
select col as COLUMN from table
This and sourceColName=”COLUMN” in entity mapping helped. So first advice how to not get hurt is:
Use only upper case names in query result table. Use aliases when needed.
Second issue we had was that after switching database to MySql CLOBs (it’s named LONGTEXT in MySql) again stopped working and again it was some crazy issue nowhere documented. After some time spent in debugger and SolR source I’ve found out that it was not using field name from data-config.xml to map it to schema field, but sourceColName. It also has some logic that was trying to resolve using sourceColName.toLowerCase when it couldn’t match it’s name. I have no idea why it does that way for CLOBs as other fields worked fine. Also switching back to H2 database worked fine. So next advice is:
Use same name as schema field name for query result table columns, but still in upper case. It will keep you safe from first issue described here and work fine because of toLowerCase logic in Solr
Hope it will save you few hours of searching. Will keep posting new crazy things about SolR if I find them.That’s all I have found for now.
Ok, this is not really true. Done some live debugging in SolR transformers and what you really have to do is use exactly same naming as your database will return.
SolR transformers use Map with table column names (as String in case returned by database) as keys and data as values. So make sure you declare
sourceColNames in mapping in exactly same case as your database returns or map.get() won’t match it.