Over the last week I was trying to represent Twitter spatial data on a map of some sort, I originally opted for using open source solutions that I’d just have to configure to fit my needs, however and as it turns out it was going to take me longer to customize the solution than it’d take me to write some kind of plugin from scratch. Especially that the best solution I was able to find (Swiftriver) wasn’t built for that and thus it lacked a certain features that I gravely needed, also due to the way it was built I’d have to sift through Gigabytes of data to extract a week’s worth of spatial information.
The first issue I needed to solve with tweet clustering, that I want tweets that are in the same logical area to be grouped together, using longitude and latitude wouldn’t mean anything or I’d end up using a resource intensive algorithm to calculate the proximity of tweets, thus I decided to use Google Reverse Geocoding API, which supports both json and xml, you feed it coordinates and it returns an address with several levels of granularity.
My logic works as follows, every set period Twitter_crawler would go in, collecting tweets with locations attached to them, placing them into an arraylist that’ll include the tweet and its meta-data (which would include the raw location), once the crawler is done, address_translator would work through the arraylist converting the raw location to an address. Once done, the array list would then be inserted into the database.
I completed it and the result was that only 1% of the tweets were geo-tagged, This piece of information alone is worth the trouble, I’m currently running it over several Egyptian related search phrases hoping to collect enough info to be able to make some sense out of it.