Similarity Search
Matching documents via their Semantic Signatures is similar, in principle, to matching documents via the terms contained in them. Instead of using the terms contained in the documents and their frequencies, TSV matching uses the dimensions in the documents’ Semantic Signatures and their associated weights. Matching determines the distance between the documents’ Semantic Signatures, as vectors of semantic dimensions in an N-dimensional space.
Much of the math involved in the comparison is performed in advance during the process of generating the semantic dictionary and the Semantic Signatures. The steps to matching are:
- Finding the semantic dimensions that are shared between two documents.
- Calculating a weight for the dimension (a similarity factor) for each shared semantic dimension.
- Calculating a matching score. The higher the score, the more similar are the documents in the semantic space.
The Index
The Semantic Signatures matching logic uses techniques that are similar to those used in a text retrieval system. The system uses an index that contains:
- A list of the semantic dimensions.
- The list of document IDs that contain it for each semantic dimensions.
- The weight associated with the semantic dimension in the Semantic Signature of the document for each document in each list.
A list of our current indexes can be found in the Documentation.
Matching and The Match Query
When searching in an index for the documents that best match an incoming document, the following operations are performed:
- The document is filtered to remove HTML and boilerplate, if the incoming document is a Web page.
- A Semantic Signature® is generated for the incoming document.
- The Semantic Signature® is converted to a weighted term query: each dimension ID is used as a query term, and the dimension’s weight in the signature is used as a weighting factor in the query.
- At least one semantic dimension must be shared between two documents in order for them to have a match score greater than zero.
To achieve adequate performance, the TextWise matching system limits the numbers of semantic dimensions used for each document to the top 30 and uses a non standard, weight-ordered index to perform the search.
The following is an example of a match between two Web pages about the Hubble space telescope:


The Matching API Service
The match service call enables the ability to match the content provided in the call against a set of content contained within one of our matching indexes. The content is converted into a Semantic Signature and compared against all Semantic Signatures for the items contained within the specified index id. The result returned by the service is a list of content items from the index that are most relevant to the content provided. The items are listed by their "match score" from highest to lowest.