Difference between revisions of "URLTeam"
(Monkeyshines scraper could help here.) |
|||
Line 9: | Line 9: | ||
If these TinyURL services go away, there's not much content here. See [http://en.wikipedia.org/wiki/Link_rot Link Rot]. | If these TinyURL services go away, there's not much content here. See [http://en.wikipedia.org/wiki/Link_rot Link Rot]. | ||
So, the project, scrape the TinyURL (and similar) services. It's actually not as hard as it sounds, because we don't need to scrape any web pages or parse any html, since the services just send a Location: header when queried for the hash, we just ask the service for the hash and parse the headers for the redirect url: | So, the project, scrape the TinyURL (and similar) services. | ||
STATUS (as of mid-April, 2009): | |||
* tinyurl.com: 1M urls ripped | |||
* ff.im: 1M urls ripped | |||
* bit.ly: just started mid-April, 2009 | |||
* NOTE: ripping is going slowly so I don't get banned and/or overwhelm the service. ff.im banned me for 24 hours once for ripping too quickly. Also, I'm ripping random URLs, not sequential. | |||
* This looks like it would be a good task for distributed computing. [http://www.majestic12.co.uk/ Majestic-12] is a project whose main bottleneck is bandwidth, and they are doing quite well. You'd just need to give people a block of URLs to check, and have them report back the results. | |||
== HOWTO == | |||
It's actually not as hard as it sounds, because we don't need to scrape any web pages or parse any html, since the services just send a Location: header when queried for the hash, we just ask the service for the hash and parse the headers for the redirect url: | |||
(18) swebb@swebb.cluster Wed 11:10am [~] % curl -LLIs http://tinyurl.com/6dvm2t | grep Location | (18) swebb@swebb.cluster Wed 11:10am [~] % curl -LLIs http://tinyurl.com/6dvm2t | grep Location | ||
Line 18: | Line 29: | ||
Walk through all possible hash tags, check for errors, and we're good-to-go. | Walk through all possible hash tags, check for errors, and we're good-to-go. | ||
'''Monkeyshines''' | |||
The [http://github.com/mrflip/monkeyshines Monkeyshines algorithmic scraper] has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, I've gathered about 6M valid URLs pulled from twitter messages so far. | |||
== URL shortening services: == | |||
1link.in | 1link.in |
Revision as of 18:48, 19 July 2009
Too many people using TinyURL and similar services
Twitter is a great example of what's wrong with trusting an online service with something of value. Check out some 'tweets':
- Hah, I'm a Zombie! http://tinyurl.com/8gnnb7 Ahh, the fun we all have with each other. about 1 hour ago from web
- Health privacy is dead. Here's why: http://ff.im/GMpx about 14 hours ago from FriendFeed
- Hmm, friendfeed released a new "import Twitter" feature today. It is taking a LONG time on my account. I wonder why.... http://ff.im/GM5W about 14 hours ago from FriendFeed
If these TinyURL services go away, there's not much content here. See Link Rot.
So, the project, scrape the TinyURL (and similar) services.
STATUS (as of mid-April, 2009): * tinyurl.com: 1M urls ripped * ff.im: 1M urls ripped * bit.ly: just started mid-April, 2009
- NOTE: ripping is going slowly so I don't get banned and/or overwhelm the service. ff.im banned me for 24 hours once for ripping too quickly. Also, I'm ripping random URLs, not sequential.
- This looks like it would be a good task for distributed computing. Majestic-12 is a project whose main bottleneck is bandwidth, and they are doing quite well. You'd just need to give people a block of URLs to check, and have them report back the results.
HOWTO
It's actually not as hard as it sounds, because we don't need to scrape any web pages or parse any html, since the services just send a Location: header when queried for the hash, we just ask the service for the hash and parse the headers for the redirect url:
(18) swebb@swebb.cluster Wed 11:10am [~] % curl -LLIs http://tinyurl.com/6dvm2t | grep Location Location: http://www.readwriteweb.com/archives/too_many_people_use_tinyurl.php (19) swebb@swebb.cluster Wed 11:10am [~] % curl -LLIs http://ff.im/GMpx | grep Location Location: http://friendfeed.com/e/08954685-00fe-4e55-b28f-4b99f83bfb0d/Health-privacy-is-dead-Here-s-why/
Walk through all possible hash tags, check for errors, and we're good-to-go.
Monkeyshines
The Monkeyshines algorithmic scraper has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, I've gathered about 6M valid URLs pulled from twitter messages so far.
URL shortening services:
1link.in 4url.cc 6url.com adjix.com ad.vu bellypath.com bit.ly bkite.com budurl.com canurl.com chod.sk cli.gs decenturl.com dn.vc doiop.com dwarfurl.com easyuri.com easyurl.net ff.im go2cut.com gonext.org hulu.com hypem.com ifood.tv ilix.in is.gd ix.it jdem.cz jijr.com kissa.be kurl.us litturl.com lnkurl.com memurl.com metamark.net miklos.dk minilien.com minurl.org muhlink.com myurl.in myurl.us notlong.com ow.ly plexp.com poprl.com qurlyq.com redirx.com s3nt.com shorterlink.com shortlinks.co.uk short.to shorturl.com shrinklink.co.uk shrinkurl.us shrt.st shurl.net simurl.com shorl.com smarturl.eu snipr.com snipurl.com snurl.com sn.vc starturl.com surl.co.uk tighturl.com timesurl.at tiny123.com tiny.cc tinylink.com tinyurl.com tobtr.com traceurl.com tr.im tweetburner.com twitpwr.com twitthis.com twurl.nl u.mavrev.com ur1.ca url9.com urlborg.com urlbrief.com urlcover.com urlcut.com urlhawk.com url-press.com urlsmash.com urltea.com urlvi.be vimeo.com wlink.us xaddr.com xil.in xrl.us x.se xs.md yatuc.com yep.it yweb.com zi.ma w3t.org