"The way to find a needle in a haystack is to sit down." - Beryl Markham, West with the Night
Do your customer's clients find what they need? Or is finding the relevant information a pain in the you know what?
Google Search Appliance (GSA)
The GSA is a device which delivers search results. These results are based on the documents and its metadata in your CMS. The software is produced by Google, and can be considered as a black box for the user. (It actually is a yellow box, see below).
Functional Design Decisions
The GSA is used for example for a company's clients to be able to find relevant information quickly, and to maintain content in one place. Because the relevant content can be found fast and easily, content only needs to be created once. Also, the need for frequently asked questions is much less, as the relevant page(s) is quickly found.
Another advantage is when content is already optimized for either google.com or for your own search implementation - both will show similar results.
Technical Design Decisions
Feed vs Crawl
When publishing a document from a Web Content Management system (WCM), the index in the GSA is automatically updated. There are actually two options: either the GSA crawls the whole website, or you can "feed" the indexdata to the GSA.
An advantage of feeding the data to the GSA is that extra metadata can be made available to the GSA - metadata which you do not want to show on the website. In case you want to add extra metadata to change the order of the search results.
Based on the URL, different collections can be created within the GSA.
It is generally recommended to use as few collections as possible. The GSA is fast enough, and filtering on metadata works fine. Only when the content is very different, it can be useful to create an extra collection, e.g. for frequently asked questions.
An example of a collection is to create one collection which includes information for 'business' as well as 'consumer' segment. This way the client can directly see all results of both segments, and has the option to filter by segment.
Each item in the GSA must have a unique URL. Do not use a name which can be changed by editing the item in the WCM, as this item will then have a different URL. This results in having an extra new item in the GSA.
It is recommended to use an internal unique id, e.g. the internal id of the item in the WCM. This will make sure the item in the GSA remains unique when updating it in the WCM.
How to Integrate GSA With SDL Tridion
The best way to integrate GSA with SDL Tridion is to use an existing framework like Search Integration for Tridion (SI4T). SI4T needs to be modified to support GSA, a few changes will be highlighted below.
Feeding to the GSA is done by sending an XML feed with both the text content of the document (as html) and the relevant metadata to the GSA feed URL. The storage extension must be modified in order to create XML feeds.
XML Feed Example
Example how to send an XML feed "feed.xml" to the GSA, using the commandline:
SI4T is a storage extension. Therefore, the configuration for the GSA must be added to the cd_storage_conf.xml file.
XML Format (DD4T-specific)
For DD4T the format in which the data is sent from the templates to the deployer is XML. Therefore, the "indexdata" (the information needed for the GSA) is an XML comment. In XML comments, it is not allowed to use "--" (two minuses). Therefore, this has to be replaced by another symbol, and also translated to "--" at the deployer side.
If binaries, e.g. pdf documents, need to be indexed by the GSA, an extra file has to be added to the zip package containing the indexdata xml. This due to the fact that pdfs are not in XML format.
This indexdata file must then be read by the deployer when committing deployment for the binary in order to create the correct feed.
Note that when publishing pages with many large binaries, the feeds must not be completely created in memory, as the memory can be used up by large binaries quickly.
For feeding, for example, a pdf, the complete feed is best created just before sending to the GSA. This can be done by reading the pdf at that moment, and removing the whole feed from memory after sending.
Google has decided to discontinue the hardware-based Google Search Appliance (GSA) and focus their engineering efforts on cloud-based solutions. The GSA will remain supported for the next two years.
In the meantime, you can evaluate whether Google's cloud offering or a different enterprise search is best for your customer.