Search

Data search is wonderful; for the user, it is a simple input field, however on the back-end a somewhat magical process takes place. 

 We can design, build, deploy and support massive Solr clusters that help you dig through your data. Solr is an open source, enterprise grade search engine based on Apache Lucene. It packs a punch right out of the box and can be easily extended. 

Some key features include:

  • Advanced Full-Text Search Capabilities
  • Optimized for High Volume Web Traffic
  • Standards Based Open Interfaces - XML and HTTP
  • Comprehensive HTML Administration Interfaces
  • Server statistics exposed over JMX for monitoring
  • Scalability - Efficient Replication to other Solr Search Servers
  • Flexible and Adaptable with XML configuration
  • Extensible Plugin Architecture

 

Solr uses the Lucene Search Library and extends it!

  • A Real Data Schema, with Numeric Types, Dynamic Fields, Unique Keys
  • Powerful Extensions to the Lucene Query Language
  • Support for Dynamic Faceted Browsing and Filtering
  • Advanced, Configurable Text Analysis
  • Highly Configurable and User Extensible Caching
  • Performance Optimizations
  • External Configuration via XML
  • An Administration Interface
  • Monitorable Logging
  • Fast Incremental Updates and Snapshot Distribution
  • Distributed search with shared index on multiple hosts
  • XML and CSV/delimited-text update formats
  • Easy ways to pull in data from databases and XML files from local disk and HTTP sources
  • Multiple search indices

 

Detailed Features Schema

  • Defines the field types and fields of documents
  • Can drive more intelligent processing
  • Declarative Lucene Analyzer specification
  • Dynamic Fields enables on-the-fly addition of new fields
  • CopyField functionality allows indexing a single field multiple ways, or combining multiple fields into a single searchable field
  • Explicit types eliminates the need for guessing types of fields
  • External file-based configuration of stopword lists, synonym lists, and protected word lists
  • Many additional text analysis components including word splitting, regex and sounds-like filters

 

Query

  • HTTP interface with configurable response formats (XML/XSLT, JSON, Python, Ruby)
  • Sort by any number of fields
  • Advanced DisMax query parser for high relevancy results from user-entered queries
  • Highlighted context snippets
  • Faceted Searching based on unique field values and explicit queries
  • Spelling suggestions for user queries
  • More Like This suggestions for given document
  • Constant scoring range and prefix queries - no idf, coord, or lengthNorm factors, and no restriction on the number of terms the query matches.
  • Function Query - influence the score by a function of a field's numeric value or ordinal
  • Date Math - specify dates relative to "NOW" in queries and updates
  • Performance Optimizations

 

Core

  • Pluggable query handlers and extensible XML data format
  • Document uniqueness enforcement based on unique key field
  • Batches updates and deletes for high performance
  • User configurable commands triggered on index changes
  • Searcher concurrency control
  • Correct handling of numeric types for both sorting and range queries
  • Ability to control where docs with the sort field missing will be placed
  • "Luke" request handler for corpus information

 

Caching

  • Configurable Query Result, Filter, and Document cache instances
  • Pluggable Cache implementations
  • Cache warming in background
  • When a new searcher is opened, configurable searches are run against it in order to warm it up to avoid slow first hits. During warming, the current searcher handles live requests.
  • Autowarming in background
  • The most recently accessed items in the caches of the current searcher are re-populated in the new searcher, enabing high cache hit rates across index/searcher changes.
  • Fast/small filter implementation
  • User level caching with autowarming support

 

Replication

  • Efficient distribution of index parts that have changed via rsync transport
  • Pull strategy allows for easy addition of searchers
  • Configurable distribution interval allows tradeoff between timeliness and cache utilization

 

Admin Interface

  • Comprehensive statistics on cache utilization, updates, and queries
  • Text analysis debugger, showing result of every stage in an analyzer
  • Web Query Interface w/ debugging output
  • parsed query output
  • Lucene explain() document score detailing
  • explain score for documents outside of the requested range to debug why a given document wasn't ranked higher.