FAQ
Contents
- What service properties are available?
- What index properties are available?
- How do I perform queries using UTF-8 character encoding?
What service properties are available?
- service.title
- The title of the web service
- service.defaultfield
- The service-wide default field used when performing searches.
- service.defaultoperator
- The service-wide default operator used when performing searches ('OR', 'AND').
- service.debugging
- A boolean property which determines whether or not the service is in debugging mode.
- query.expand
- A boolean property determining whether or not the expansion mechanism should be enabled for providing related search queries. Defaults to "false".
- query.spellcheck
- A boolean property determining whether or not the spell checking mechanism should be enabled for providing related search queries. Defaults to "false".
- query.suggest
- A boolean property determining whether or not the similar document mechanism should be enabled for providing related search queries. Defaults to "false".
What index properties are available?
- index.analyzer
- The class name of the preferred analyzer for the index.
- index.author
- Given any document from this index, this field will contain the name of its author.
- index.defaultoperator
- The preferred default operator when performing search requests on this index. ('OR', 'AND')
- index.image
- The url of an image representing the index. Typically a small icon.
- index.image.height
- The pixel height of the image representing the index.
- index.image.width
- The pixel width of the image representing the index.
- index.image.type
- The content type of the image representing the index.
- index.readonly
- Specifies whether or not the index may only be read.
- index.title
- The title of the index. This is typically set if the index name is not satisfactory.
- document.defaultfield
- The default field used when searching this index.
- document.identifier
- Given any document (within this index), this field contains its unique identifier.
- document.identifier.validator
- A regular expression for the purposes of validating identifiers being associated with documents.
- document.author
- A template for building the title of a particular document. (i.e. - '[last_name], [first_name]')
- document.title
- A template for building the title of a particular document. (i.e. - '[item_id]: [title]')
- document.updated
- The name of a document field containing an epoch number corresponding to the last time it was updated.
How do I perform queries using UTF-8 character encoding?
Because of the limitations of the Java Servlet API, the Lucene Web Service is not responsible for parsing the incoming HTTP requests into meaningful text. This is handled by whatever deployment platform (i.e. - Tomcat) it has been deployed with. Please see your deployment platform's documentation to see how to enable UTF-8 encoding on incoming requests.
How to enable UTF-8 encoding with Apache Tomcat
By default, Tomcat will not parse your GET requests properly, but here is how we make it so:
- Open the file at TomcatDirectory/conf/server.xml. This is the configuration file for the Tomcat server. Scan down to around line 77 (in my copy of the file) until you see an XML tag called "Connnector". It should already have several attributes such as "port", "maxHttpHeaderSize", "maxThreads", etc. You must add another attribute called "URIEncoding", setting its value to "UTF-8". My copy looks something like this:
<Connector port="8080" maxHttpHeaderSize="8192" maxThreads="150" minSpareThreads="25" maxSpareThreads="75" enableLookups="false" redirectPort="8443" acceptCount="100" connectionTimeout="20000" disableUploadTimeout="true" URIEncoding="UTF-8"/> - Save the file and restart your Tomcat server
Now, whenever Tomcat receives a request via HTTP, it will attempt to parse UTF-8 encoded strings from it. What you must do is take whatever query you're searching for and break it down into its underlying bytes (the UTF-8 bytes representing the string). Your requested query must now be the HTML encoded representation of those bytes (not necessarily the character codes).
For example, suppose I want to search for "gâteau". The non-ASCII character in question is the "â". According to the Unicode standard, this character's code is 0xE2 (226 in our number system). According to the UTF-8 standard, this character is stored as two bytes: 0xC3 0xA2 (195 162 in our number system). The query that gets sent to the server must look as follows:
GET /lucene/some_index?query=g%C3%A2teau
This way, Tomcat will understand that what you're submitting to it are UTF-8 encoded strings and the Lucene Web Service will behave correctly.