Full-Text Search
What is Full-Text Search?
Overview of Full-Text Search
Full-text search is a method of searching for information in a database that looks at the actual content of the data. A search engine examines all of the words in every stored document while attempting to match search criteria. This allows you to configure search features to search for specific terms within the text of stored documents, making it much easier to find the information you need.
String Search vs Full-Text Search
A string search, or a serial scan, reads an entire document in the physical order it is constructed in response to a query and finds matching strings and substrings. In SQL, this comes in the form of the LIKE operator. For large datasets, this methodology can be much too inefficient. On the other hand, full-text indexing can be used to catalog the contents of a document or database in a way that optimizes the efficiency of data retrieval. A good index can do much to improve the search capabilities of a database.
How does Full-Text Indexing work?
Full-text indexing is the process of implementing a search index that allows a full-text search engine to quickly search through all of the text in a document. This is done by breaking the text down into individual word boundaries (called tokens) and storing them in a special glossary. When you run a full-text search, the search engine can quickly look through this index to find all of the documents that contain your desired terms.
Inverted Index
An inverted index is a data structure that is used to index documents for full-text search. It is an optimization technique that stores a list of all the terms in a document along with the location of each instance of each term. After a forward index is created, which keeps lists of words per document, it is inverted to create an inverted index. To verify a matching document, iterating through each document and each word in the forward index would take too long, consume too much memory, and require far more processing energy. Rather than displaying the terms by document in the forward index, an inverted index data structure is constructed in a way that lists the documents by term. This allows the search engine to quickly find all of the documents that contain a specific term without having to read through the entire document.
Common Indexing Tools
There are a few general components that are involved in a full text indexing process:
- Tokenization or word breaking: This is the process of breaking down text into individual word boundaries. This is important because it allows the search engine to track each instance of each word in a document.
- Stop word Removal: This is the process of removing common words from the text that are not likely to be useful in a search. This can help to optimize the efficiency of the search index, by accounting for the limitations of natural language (e.g. including but not limited to articles like ‘a’ and ‘the’ and common conjunctions like ‘and’).
- Thesaurus: This is a tool that can be used to group together synonyms and related terms. This can be helpful in optimizing search results, by accounting for different ways that users might search for the same information.
- Stemming: This is the process of reducing certain words (such as “running”, “ran”, and “run”) to their base form (“run”). This can also help to optimize the efficiency of the search index.
- Index writing: This is the process of creating a searchable index of all of the terms in a document.
- Filtering: Filters within a full text index entail a process of transforming the text data in a document into a data type (char, varchar, nchar, nvarchar, text, ntext, image, xml, etc) that can be read by the indexing tool. This can involve things like removing formatting, converting character encodings, or extracting text from images.
How does Full-Text Querying work?
Once you have created a full-text index, you can start querying it to find the information you need. A full-text search query is simply a request for information that contains one or more terms that you want to search for. The search engine will then return all of the documents that contain those terms, making it easy to find what you are looking for.
Common Querying Tools
There are many different types of querying algorithms that can be used to improve the search functionality of a website or application. Each of these tools has its own advantages and disadvantages, so it is important to choose the right tool for the job.
- Fuzzy search is a type of search syntax that allows you to find documents that contain terms that are similar to the ones you are searching for. This can be helpful if you are not sure of the exact term you are looking for, or if you want to find all of the terms that are related to a specific topic. This tool can be utilized in the use case of misspelled search terms. For example, a query for "prescription grasses" with fuzzy search would also include matches for "prescription glasses".
- Regular expressions are a set of symbols that you can use to create patterns that describe the text you are looking for. This can be helpful if you want to find all of the documents that contain a specific format, or if you want to find all of the terms that match certain criteria.
- Boolean queries are a type of query that allows you to combine multiple terms to create a more complex search. This can be helpful if you want to find all of the documents that contain multiple terms, or if you want to find all of the documents that contain one term but not another. Boolean operators are typically used to create Boolean queries, and the most common ones are AND, OR, and NOT.
- Wildcard search is a type of search where the user can use a wildcard character to represent one or more other characters. This can be useful when the user is unsure of the spelling of a word, or if they are looking for multiple variations of a word. For example, if a user wanted to find all articles that contain the word “query”, they could use a wildcard search for “quer*”. This would return results for “query”, “queries”, “querying”, etc.
- Proximity search is a type of search that allows you to find documents where the terms you are searching for are close to each other. This can be helpful if you want to find documents where the terms are used in a specific order, or if you want to find documents where the terms are used in a certain context. Phrase search is a type of proximity search that allows you to find documents where the terms you are searching for are used in a specific order.
- Term weighting is a technique that is used to assign a numeric value to each term in a document. This value represents the importance of the term in the document, and it is used to determine which documents are more relevant to a specific query.
Summary
A full text search engine can be very beneficial for webpages and applications because it allows the user to find information quickly and easily. Full text search engines can be used to find documents that contain specific terms, or to find documents that contain multiple terms. The variety of functions provided by a full text search engine allows for more relevant searches, in turn increasing the efficiency of the database.