Indexing Search Engine Spider Features
How The Search Engine Works
Indexed vs. Unindexed Searching: Distributed Searching, Email Filtering, Security Classifications, Forensics
- The Spider can index and search publicly available sites, secure content HTTPS sites, and password-accessible sites. The Spider also supports forms-based authentication.
- A single search request can return fully-integrated search results, spanning local and remote content, including:
- hit-highlighted display of Web-ready file types such as HTML, PDF and XML, including display of images, formatting and links.
- conversion of other file types ("Office," Unicode, ZIP, etc.) to HTML for browser display with highlighted hits.
- support for dynamically-generated content (ASP.NET, MS CMS, SharePoint, etc.) with highlighted hits.
- The Spider can perform "vertical" searching of pages linked from a URL, as well as "horizontal" crawling of sites linked to a URL.
- The Spider can limit indexed data by file size, file number, time on a Web site, etc.
This software module can instantly search terabytes of text because it builds a search index that stores the location of words in documents.
Indexing is easy - simply select folders or entire drives to index and the software does the rest.
The search engine can instantly search terabytes of text across a desktop, network, Internet or Intranet site.
- Once it has built an index, it can automatically update it using the Windows Task Scheduler to reflect additions, deletions and modifications to your document collection.
- Updating an index is even faster, since it will check each file, and only reindex files that have been added or changed.
- The indexer automatically recognizes and supports all popular file formats, and never alters original files.
- A single index can hold over a terabyte of text, and it can create - and search with a single search request - an unlimited number of indexes.
- Since you may sometimes want to search files that the software has not indexed, the software also does unindexed as well as "combination" searching.
- Searching and document display (like indexing) do not in any way affect original files.
- When the search engine does an indexed search, it searches directly on the index that it has built.
- An unindexed search, in contrast, searches directly through the documents.
- In either case, when software displays a retrieved document, it refers to the original document, using information in the index to highlight hits.
This search engine also serve as tools for publishing, with instant text searching, large document collections to Web sites or CD/DVDs.
Supported file formats
- over two dozen indexed, unindexed, fielded and full-text search options
- highlights hits in HTML, XML and PDF, while displaying embedded links, formatting and images
- converts other file types - word processor, database, spreadsheet, email and full-text of email attachments, ZIP, Unicode, etc. - to HTML for display with highlighted hits
- Spider supports Web-based content (HTML, PDF, XML, etc.) as well as dynamically-generated content (ASP.NET, MS CMS, SharePoint, etc.)
- Adobe Acrobat (*.pdf)
- Ami Pro (*.sam)
- Ansi Text (*.txt)
- ASCII Text (See note 3)
- ASF media files (metadata only) (*.asf)
- CSV (Comma-separated values) (*.csv)
- DBF (*.dbf)
- EML files (emails saved by Outlook Express) (*.eml)
- Enhanced Metafile Format (*.emf)
- Eudora MBX message files (*.mbx)
- GZIP (*.gz)
- HTML (*.htm, *.html)
- JPEG (*.jpg)
- Lotus 1-2-3 (*.123, *.wk?)
- MBOX email archives (including Thunderbird) (*.mbx)
- MHT archives (HTML archives saved by Internet Explorer) (*.mht)
- MIME messages
- MSG files (emails saved by Outlook) (*.msg)
- Microsoft Access MDB files (see note 1) (*.mdb)
- Microsoft Document Imaging (*.mdi)
- Microsoft Excel (*.xls)
- Microsoft Excel 2003 XML (*.xml)
- Microsoft Excel 2007 (*.xlsx)
- Microsoft Outlook/Exchange (See note 2)
- Microsoft Outlook Express 5 and 6 (*.dbx) message stores
- Microsoft PowerPoint
- Microsoft PowerPoint 2007 (*.pptx)
- Microsoft Rich Text Format (*.rtf)
- Microsoft Searchable Tiff (*.tiff)
- Microsoft Word for DOS (*.doc)
- Microsoft Word for Windows (*.doc)
- Microsoft Word 2003 XML (*.xml)
- Microsoft Word 2007 (*.docx)
- Microsoft Works (*.wks)
- MP3 (metadata only) (*.mp3)
- Multimate Advantage II (*.dox)
- Multimate version 4 (*.doc)
- OpenOffice 2.x and 1.x documents, spreadsheets, and presentations (*.sxc, *.sxd, *.sxi, *.sxw, *.sxg, *.stc, *.sti, *.stw, *.stm, *.odt, *.ott, *.odg, *.otg, *.odp, *.otp, *.ods, *.ots, *.odf) (includes OASIS Open Document Format for Office Applications)
- Quattro Pro (*.wb1, *.wb2, *.wb3, *.qpw)
- TAR (*.tar)
- TIFF (*.tif)
- TNEF (winmail.dat files)
- Treepad HJT files (*.hjt)
- Unicode (UCS16, Mac or Windows byte order, or UTF-8)
- Windows Metafile Format (*.wmf)
- WMA media files (metadata only) (*.wma)
- WMV video files (metadata only) (*.wmv)
- WordPerfect 4.2 (See note 3) (*.wpd, *.wpf)
- WordPerfect (5.0 and later) (*.wpd, *.wpf)
- WordStar version 1, 2, 3 (See note 3) (*.ws)
- WordStar versions 4, 5, 6 (*.ws)
- WordStar 2000
- Write (*.wri)
- XBase (including FoxPro, dBase, and other XBase-compatible formats) (*.dbf)
- XML (*.xml)
- XML Paper Specification (*.xps) (version 7.40)
- XyWrite (See note 3)
- ZIP (*.zip)
 Databases. Using ODBC, the software can also index and display records in Access databases. Each record is treated as a separate document. XBase databases are indexed without using ODBC. For information on indexing SQL databases, click here.
 Outlook and Exchange. The software can index Outlook and Exchange message stores using MAPI. For more information, click here.
 Older Word Processor Formats. The software can index and display, but cannot automatically recognize, documents in the following formats:
Other File Formats
- WordPerfect 4.2
- WordStar versions before 4
- Ascii Text
- This browser based search engine will index, search, and display other file formats, but they will be treated as binary file types. In other words, all binary codes, etc. will be displayed along with the text.
- This browser based search engine can display images in the following formats:
- WPG (WPG version 1.0 only)
- When viewing multipage images, use PgUp and PgDn to navigate between the pages. The image viewer also includes viewing options such as Zoom In, Zoom Out, Invert, Rotate, etc.
Basic Search Types
- All search options on this page work with indexed, unindexed and "combination" indexed/unindexed searching.
- Phrase searching finds phrases like: due process of law.
- Boolean operators like and/or/not can join words and phrases: due process of law and not (equal protection or civil rights).
- Proximity searching finds a word or phrase within "n" words of another word or phrase: apple pie w/38 peach cobbler.
- Directed Proximity searching finds a word or phrase "n" words before another word or phrase: apple pie pre/38 peach cobbler.
- Phonic searching finds words that sound alike, like Smythe in a search for Smith.
- Stemming finds variations on endings, like applies, applied, applying in a search for apply.
- Numeric range searching finds any number between two numbers, such as between 6 and 36.
- Macro capabilities make it easy to include frequently used items in a search request.
- Wildcard support allows ? to hold a single letter place, and * to hold multiple letter places: apple* and not appl?sauce.
Concept / Synonym / Thesaurus Searching
- Fuzzy searching uses a proprietary algorithm to find search terms even if they are misspelled.
- Search fuzziness adjusts from 0 to 10 so you can fine-tune fuzziness to the level of OCR or typographical errors in your files.
- A search for alphabet with a fuzziness of 1 would find alphaqet; with a fuzziness of 3, it would find both alphaqet and alpkaqet.
- Fuzziness is not built into the index, so you can vary fuzziness at the time of each search.
Combining Search Types
- Concept searching lets you look for fast and find quick, speedy, etc.
- The search engine offers variable levels of automatic synonym expansion based on a comprehensive semantic network of the English language.
- You can also add your own thesaurus terms.
Search Features - Relevancy-Ranking
- Nearly all search types are combinable.
- You can make your search request as complex as you want
- The search engine can sort and instantly re-sort searches by relevancy with respect to number of hits, file name, file date, etc.
- Natural language algorithms provide automatic term weighting, following a "plain English" or unstructured indexed search request.
- Automatic term weighting is based on the frequency and density of hits in your files.
- For example, in the search request get me Sam's memo on the 1999 CorpX takeover, if 1999 appeared in 3,000 files, and Sam appeared in only two files, then Sam would get a much higher relevancy rating, taking you straight to the most "relevant" files.
- A positional scoring option works with the search engine's natural language relevancy ranking to rank documents more highly when hits are near the top of a file, or otherwise clustered in a file.
- It also includes variable term weighting options for both indexed and unindexed searches:
- Positive term weighting can place extra emphasis on one or more words: soup:8 or recipe:3
- Negative term weighting can assign negative emphasis to one or more words: red or green or yellow:-7
- Variable term weighting can also apply to fields: (description:5 contains (apple and pear)) or (author:2 contains smith)