Category Tool


The category computes the category for a signature. Note that the algorithm used here is not necessarily the 'perfect' algorithm, but a very simple one. The algorithm that works best for you depends on your application.

Algorithm

The algorithm used is fairly simple, as seen below. Note that 'blocked' refers to the blocked item list, as documented below.

//First determine the label to use
for each dimension in the signature
	label = the label of the dimension
	parts = label.split("/")
	if (parts[0] is blocked)
		parts = null
		continue to the next label
	else
		stop
end

if parts is null //All of the labels start with a blocked name
	return "Other"
else
	//Now build the label
	category = parts[0]
	for (i = 1; i < parts.length && i < maxLen; i++)
		if parts[i] in blocked
			stop
		else
			category += "/" + parts[i]
	end
	print category
end
		

Help Text for category

$ java -jar sh-tools.jar category
Usage: java com.semantichacker.api.tools.Categorize [OPTIONS] signature
Compute the category of a Semantic Signature

Option Summary:
	signature     	XML Signature file from the SemanticHacker API
	--labels FILE 	Labels file from the SemanticHacker datafile
	--max-len MAX 	Maximum number of elements in output category
	--blocked FILE	File containing a list of dimension components
	              	which are not to be considered                    
	-h, --help    	Display this help

Homepage: http://www.semantichacker.com
		

The --max-len option

The --max-len option changes the maximum length of the category to be returned. The default is 4.


$ java -jar sh-tools.jar signature -t TOKEN -c "hot apple pie" --xmlout > hotapplepie.xml
$ java -jar sh-tools.jar category hotapplepie.xml
Business/Consumer_Goods_and_Services/Home_and_Garden/Decor_and_Design
$ java -jar sh-tools.jar category --max-len 2 hotapplepie.xml
Business/Consumer_Goods_and_Services
		

Removing Categories

For some applications, certain category names (such as News and Media) do not make sense or result in poor categories. The category tool can take a list of label parts (the values delimited by / characters in a label) to block as a file passed via the --blocked parameter. If the first part of a label is blocked, then the next label (in the order of the signature) is chosen. If a later part of a label is blocked, the category is chopped before that part.

The blocked file should contain a list of label parts to be removed, one part per line.

$ java -jar sh-tools.jar category hotapplepie.xml
Business/Consumer_Goods_and_Services/Home_and_Garden/Decor_and_Design
$ echo Home_and_Garden > blocked
$ java -jar sh-tools.jar category hotapplepie.xml --blocked blocked
Business/Consumer_Goods_and_Services
$ echo Business >> blocked
$ java -jar sh-tools.jar category hotapplepie.xml --blocked blocked
Home/Cooking/Baking_and_Confections/Pies_and_Pastry