Similarity Tool


The similarity tool computes the similarity (or match score; relevance) between two Semantic Signatures®. Similarity score is a value between 0 and 1. Except for floating point error, a signature matches itself with a score of exactly 1. A score below .4 should be considered suspicious, while a score above .8 should be considered very good.

Pseudocode

The following is pseudocode for computing the similarity between two signatures.

float score = 0;
for (dimension in signature1) {
	if (dimension in signature2) {
		score += signature1[dimension] * signature2[dimension];
	}
}
return score;
		

In other words, take the sum of the products of the weights of the intersecting dimensions of two signatures (the dimensions they have in common).

Help Text for similarity

$ java -jar sh-tools.jar similarity
Usage: java com.semantichacker.api.tools.Similarity [OPTIONS] file1 file2
Compute the similarity score of two Semantic Signatures

Option Summary:
	file1        	The first signature (XML from API) to compute the
	             	similarity of                                      
	file2        	The second signature (XML from API) to compute the
	             	similarity of                                      
	-v, --verbose	Print each matching dimension and rank the score
	--labels FILE	The list of labels from the SemanticHacker
	             	Datafile, this allows labels without needing to get
	             	them from the API                                  
	--nolabels   	Do not show labels in dimension printout.
	-h, --help   	Display this help

Homepage: http://www.semantichacker.com
		

Examples

Here is a sample of getting two signatures from the API and computing their similarity score.


$ java -jar sh-tools.jar signature -t TOKEN -c java --xmlout --outfile java.xml
$ java -jar sh-tools.jar signature -t TOKEN -c jdk --xmlout --outfile jdk.xml
$ java -jar sh-tools.jar similarity -v java.xml jdk.xml
Dim ID	Sig1		Sig2		Weight		Label
 9442	0.301207	0.287189	0.086503	Computers/Programming/Languages/Java/Resources
 9465	0.291356	0.142665	0.041566	Computers/Programming/Languages/Java/News_and_Media/Books
 9443	0.234836	0.087548	0.020559	Computers/Programming/Languages/Java/Resources/Certification
 9422	0.233443	0.126003	0.029415	Computers/Programming/Languages/Java
 9467	0.209163	0.225452	0.047156	Computers/Programming/Languages/Java/Official_Documentation
 9427	0.201700	0.160814	0.032436	Computers/Programming/Languages/Java/Development_Tools/Performance_and_Testing
 9440	0.200207	0.398539	0.079791	Computers/Programming/Languages/Java/Implementations
 9446	0.187670	0.242634	0.045535	Computers/Programming/Languages/Java/FAQs,_Help,_and_Tutorials/Tutorials
 9445	0.185879	0.288379	0.053604	Computers/Programming/Languages/Java/FAQs,_Help,_and_Tutorials/FAQs
 9423	0.173341	0.316496	0.054862	Computers/Programming/Languages/Java/Development_Tools
 9441	0.172246	0.115069	0.019820	Computers/Programming/Languages/Java/Personal_Pages
 9474	0.162494	0.087845	0.014274	Computers/Programming/Languages/Java/Applications
 9456	0.156624	0.105102	0.016461	Computers/Programming/Languages/Java/Class_Libraries/Data_Formats
 9449	0.152345	0.103168	0.015717	Computers/Programming/Languages/Java/Mailing_Lists
 9453	0.151250	0.148467	0.022456	Computers/Programming/Languages/Java/Class_Libraries/Graphics
 9452	0.150852	0.148467	0.022397	Computers/Programming/Languages/Java/Class_Libraries
 9466	0.139011	0.081895	0.011384	Computers/Programming/Languages/Java/News_and_Media/Magazines_and_E-zines/Articles
 9755	0.138812	0.181120	0.025142	Computers/Programming/Threads/Java
 9433	0.133239	0.090449	0.012051	Computers/Programming/Languages/Java/Server-Side/JavaServer_Pages
0.6511292
		

The following shows the output without the -v parameter.

$ java -jar sh-tools.jar similarity java.xml jdk.xml
0.6511292