HTTP Request Reference
The SemanticHacker API uses the HTTP protocol as its transport, via a REST pattern for web services. Several HTTP request types and structures are accepted by the API. Any HTTP client should be able to work with the API.
Each call to the API will use the following base URL:
http://api.semantichacker.com/sh/api
API Parameters
Text Types
There are three ways to have text processed once it reaches the API. The type chosen can greatly affect the quality of the result. The html type causes the API to strip out all HTML tags so just the non markup language of the document remains. The text type indicates the API should not process the input in any special way, as it is already plain text. The wp type indicates that the provided content is MediaWiki source, and should be stripped of tags and formatting. The wp scraper was optimized for the English Wikipedia but may work on other wikis, particularly MediaWiki wikis.
Accepted Methods
There are six ways to send us text from which a signature is generated. All methods must include token parameter. Replace the TOKEN in the examples below with the access token provided in the email received when you signed up for access to the SemantichHacker API.
- GET request with a URI parameter. We'll crawl the URI for text content.
- POST request with a URI parameter as application/x-www-form-urlencoded. We'll crawl the URI for text content.
- GET with a content parameter.
- POST with content parameter as application/x-www-form-urlencoded.
- POST with content as multipart/form-data.
- POST or PUT with content as the request body.
1) GET request with a URI parameter
This method lets our system do the work of getting the text from the URL behind the scenes. If the type parameter is not passed, HTML is assumed, and tags will be stripped away before processing. The method is easy and fast because you don't have to upload the content. Here is a simple example:
GET /sh/api?token=TOKEN&showLabels=true&uri=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FNeil_young HTTP/1.1 Host: api.semantichacker.com
2) POST request with a URI parameter as application/x-www-form-urlencoded
This is similar to #1 above, just using POST and a form-urlencoded content type. Again, type always defaults to HTML unless you explicitly provide a type parameter.
POST /sh/api HTTP/1.1 Host: api.semantichacker.com Content-Type: application/x-www-form-urlencoded token=TOKEN&showLabels=true&uri=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FNeil_young
3) GET with a content parameter
This method can be used for text that is shorter then 1000 characters. Although there is no limit to the size of a GET parameter in the RFC, we limit this type of request for performance reasons.
GET /sh/api?token=TOKEN&showLabels=true&content=the%20art%20of%20computer%20science HTTP/1.1 Host: api.semantichacker.com
4) POST with content parameter as application/x-www-form-urlencoded
This method is similar to #3. It is also limited to text shorter then 1000 characters for performance reasons.
POST /sh/api HTTP/1.1 Host: api.semantichacker.com Content-Type: application/x-www-form-urlencoded token=TOKEN&showLabels=true&content=the%20art%20of%20computer%20science
5) POST with content as multipart/form-data.
This method can, and should, be used to upload larger content. It also easily integrates with existing tools that upload files using multipart forms. The content length is capped at 100,000 characters, again, mostly for performance reasons. Note that the Content-Type does not affect how the API itself processes the text. The default is to treat all incoming text as HTML and thus remove any mark up tags.
POST /sh/api HTTP/1.1 Host: api.semantichacker.com Content-Type: multipart/form-data; boundary=x42x --x42x Content-Disposition: form-data; name="token" TOKEN --x42x Content-Disposition: form-data; name="file"; filename="content" Content-Type: text/plain the art of computer science --x42x--
6) POST or PUT with content as the request body.
This method is also for larger content. Up to 100,000 characters is acceptable.
PUT /sh/api?token=TOKEN HTTP/1.1 Host: api.semantichacker.com Content-Type: text/plain the art of computer science