Difference between revisions of "URLTeam"

Revision as of 06:40, 31 December 2011

Urlteam
url shortening was a fucking awful idea url shortening was a fucking awful idea
URL	http://urlte.am
Status	Online!
Archiving status	In progress...
Archiving type	Unknown
IRC channel	#urlteam (on hackint)

TinyURL, bit.ly and other similar services allow long URLs to be converted to smaller ones on their specific service; the small URL is visited by a consumer and their web browser is redirected to the long URL.

Such services are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see Wikipedia: Link Rot). Archive.org/301Works is acting as an escrow for URL shortener databases, but they rely on URL shorteners to actually give them their databases. Even 301Works founding member bit.ly does not actually share their databases and most other big shorteners don't share theirs either.

Who did this?

You can join us in our IRC channel: #urlteam on EFNet

User:Scumola started this wiki page
User:Chronomex started the Urlteam scraping effort
User:Soult Helps with scraping
User:Jeroenz0r Helps with scraping (and stalking Soult)

301Work cooperation

The fine folks at archive.org have provides us with upload permissions to the 301Works archive: http://www.archive.org/details/301utm. They unfortunately do not want to make them downloadable.

Tools

TinyBack (written in ruby by User:Soult)
User:Chronomex wrote his own Perl-based scraper: [1]
The Monkeyshines algorithmic scraper has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, User:Mrflip gathered about 6M valid URLs pulled from twitter messages so far.

Or just ask!

Here's a template that worked for me at least once. Well, data is pending but the site owner is gung-ho.

Try sending an email to the website owner:

Hello!

I'm working with Jason Scott of textfiles.org and other members of the
Archive Team.

Since the recent scare involving http://tr.im/'s announced (and then
retracted) imminent demise, we've been working to archive all the
links from URL shorteners around the Internet.

If I'm not mistaken, you operate urlx.org.  Would you be so kind as to
share with us a copy of your URL database?  We'll do our best to
preserve this data forever in a useful way.

We are already very far along in scraping links from tr.im, but it's
faster (and friendlier!) to contact site owners asking for a copy of
their data than it is to scrape.

We've got a domain registered, urlte.am, and all links will be
available for redirect in the format:

http://urlx.org.urlte.am/av3

If you could help us, that would be excellent!

Thank you,

URL shorteners

New table

The new table includes shorteners we have already started to scrape.

Name	Est. number of shorturls	Scraping done by	Status	Comments
TinyURL	1000000000	User:Soult	5-letter codes done, on halt due to being banned (2010-12-20)	non-sequential, bans IP for requesting too many non-existing shorturls
bit.ly	4000000000	User:Soult	lots and lots of scraping needed (2011-03-25)	non-sequential
goo.gl	??	User:Scumola	started (2011-03-04)	goo.gl throttles pulls
is.gd	534183259	User:Chronomex/User:Soult	probably got about 95% before switch to non-sequential	now non-sequential, new software version added crappy rate limiting
ff.im	?	User:Chronomex		only used by FriendFeed, no interface to shorten new URLs
4url.cc	1279 (2009-08-14)^[1]	User:Chronomex	done (2009-08-14)	dead (2011-02-15)
litturl.com	17096^[2]	User:Chronomex	done	dead (2010-11-18)
xs.md	3084 (2009-08-15)^[3]	User:Chronomex	done	dead (2010-11-18)
url.0daymeme.com	14867 (2009-08-14)^[4]	User:Chronomex	done	dead (2010-11-18)
tr.im	1990425	User:Soult	got what we could	dead (2011-12-31)
adjix.com	?	User:Jeroenz0r	Already done: 00-zz, 000-zzz, 0000-izzz.	case-insensitive, incremental
rod.gs	?	User:Jeroenz0r	Done: 00-ZZ, 000-2Qc	case-sensitive, incremental, server can't keep up with all the requests.
biglnk.com	?	User:Jeroenz0r	Done: 0-Z, 00-ZZ, 000-ZZZ	case-sensitive, incremental
go.to	60000	User:Asiekierka	Done: ~45000 (go.to network links only: goto_dump.zip)	no codes, only names, google-fu only gives the first 1000 results for each, thankfully most domains have less
Name	Number of shorturls	Scraping done by	Status	Comments

Old list^[5]

List last updated 2009-08-14.

6url.com - HTML redirect
ad.vu - mirror of adjix.com
awe.sm
budurl.com - Appears non-incremental
cli.gs - Appears non-incremental
decenturl.com - Not at all easy to scrape.
dlvr.it
doiop.com - Appears non-incremental
easyurl.net - Appears non-incremental: http://easyurl.net/afd2f
ilix.in - HTML redirect
imfy.us - requires a recaptcha to get to the linked site, and avast goes nuts.
jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388
metamark.net / xrl.us - ? http://xrl.us/bfabog
myurl.in - http://myurl.in/xtP5H / http://urlgator.com/xtP5H /http://ug4.me/xtP5H / http://link-ed.in/xtP5H - HTML redirect
minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh
notlong.com - Appears to be alpha-only: http://yeitoo.notlong.com/
nutshellurl.com - Appears incremental. 301s to a redirector script, which then 301s you to the destination.
ow.ly - I can't get it to work.
pnt.me - Doesn't appear guessable, too big a space to bruteforce: http://pnt.me/FzAblc
qurlyq.com - Javascript redirect. Appears sequential: http://qurlyq.com/5nf
redirx.com - Lowercase alpha only, appears sequential or guessable: http://redirx.com/?wyok
s3nt.com - Probably sequential. http://s3nt.com/aa goes somewhere different from /ab
shortlinks.co.uk - Working again.
short.to - Domain is parked - Probably sequential/loweralpha: http://short.to/msmp
shorturl.com - Probably sequential/loweralpha: http://alturl.com/wqok
shrinklink.co.uk - Doesn't appear sequential: http://www.shrinklink.co.uk/45bmx , www.shrinklink.co.uk/npk6xp
shrinkurl.us - Alway telling URL is malformed
shrt.st - Appears incremental: http://shrt.st/vpz
simurl.com - Doesn't appear guessable: http://simurl.com/panpes
shorl.com - Doesn't appear guessable: http://shorl.com/tisikestibahu
smarturl.eu / joturl.com / zip.sm - Doesn't appear guessable, HTML redirect.
snipr.com - Appears incremental: http://snipr.com/27nvst http://snipr.com/27nvtt
snipurl.com - See above ^
snurl.com - See above above ^^
surl.co.uk - Many shortening options.
tighturl.com - Appears incremental: http://tighturl.com/30xu http://tighturl.com/30xv
tiny.cc - Appears non-incremental
traceurl.com
tr.im
tweetburner.com / twurl.nl - Appears incremental
twitpwr.com
twitthis.com
twurl.nl
u.mavrev.com
ur1.ca - Database is downloadable from website directly.
url9.com - Sequential, alphanumeric. Leading 0s are significant.
urlborg.com
urlbrief.com
urlcover.com
urlcut.com
urlhawk.com
url-press.com
urlsmash.com
urltea.com
urlvi.be
urlx.org - Owner has agreed to share his database
vimeo.com
wlink.us
xaddr.com
xil.in
xrl.us - see metamark.net
xym.kr
x.se
yatuc.com
yep.it
yweb.com
zi.ma
w3t.org

"Official" shorteners

goo.gl - Google
fb.me - Facebook
y.ahoo.it - Yahoo
youtu.be - YouTube
t.co? - Twitter
post.ly - Posterous
wp.me - Wordpress.com
flic.kr - Flickr
lnkd.in - LinkedIn
su.pr - StumbleUpon
go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)

bit.ly aliases

amzn.to - Amazon
binged.it - Bing (bonus points for being longer than bing.com)
1.usa.gov - USA Government
tcrn.ch - Techcrunch

Dead or Broken Shorteners

chod.sk - Appears non-incremental, not resolving
gonext.org - not resolving
ix.it - Not resolving
jijr.com - Doesn't appear to be a shortener, now parked
kissa.be - "Kissa.be url shortener service is shutdown"
kurl.us - Parked.
miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."
minurl.org - Presently in ERROR 404
muhlink.com - Not resolving
myurl.us - cpanel frontend
1link.in - Website dead
canurl.com - Website dead
dwarfurl.com - Website dead/Numeric, appears incremental: http://dwarfurl.com/08041
easyuri.com - Website dead/Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3
go2cut.com - Website dead
lnkurl.com - Website dead
minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh - Website dead
memurl.com - Pronounceable. Broken.
nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own). Taken by squatters
digg.com - discontinued - [2]

Hueg list

[3]

References

Weblinks

[1] ttp://github.com/chronomex/urlteam

[2] ttp://github.com/chronomex/urlteam

[3] ttp://github.com/chronomex/urlteam

[4] ttp://github.com/chronomex/urlteam

[5] ttp://blog.go2.me/2009/01/exhausting-review-of-link-shorteners.html

[1]

[2]

[3]

[4]

[5]

@@ Line 65: / Line 65: @@
 {| class="sortable wikitable" style="width: auto; text-align: center"
 ! Name
-! Number of shorturls
+! Est. number of shorturls
 ! Scraping done by
 ! Status
@@ Line 71: / Line 71: @@
 |-
 | [http://tinyurl.com TinyURL]
-| 1,000,000,000
+| 1000000000
 | [[User:Soult]]
 | 5-letter codes done, on halt due to being banned (2010-12-20)
@@ Line 77: / Line 77: @@
 |-
 | [http://bit.ly bit.ly]
-| 4,000,000,000
+| 4000000000
 | [[User:Soult]]
 | lots and lots of scraping needed (2011-03-25)
@@ Line 89: / Line 89: @@
 |-
 | [http://is.gd is.gd]
-| 354,527,352
+| 534183259
 | [[User:Chronomex]]/[[User:Soult]]
 | probably got about 95% before switch to non-sequential
@@ Line 125: / Line 125: @@
 |-
 | [http://tr.im tr.im]
-| ?
+| 1990425
 | [[User:Soult]]
-| 5-letter codes finished, 6-letter codes in progress
+| got what we could
-| resolving still works despite planning to completely shut down at the end of 2010 (2011-02-15), whoever owns that thing is a major pain in the ass
+| dead (2011-12-31)
 |-
 | adjix.com
@@ Line 149: / Line 149: @@
 |-
 | go.to
-| >60,000
+| 60000
 | [[User:Asiekierka]]
 | Done: ~45000 (go.to network links only: [http://64pixels.org/goto_dump.zip goto_dump.zip])

Difference between revisions of "URLTeam"

Revision as of 06:40, 31 December 2011

Contents

Who did this?

301Work cooperation

Tools

Or just ask!

URL shorteners

New table

Old list^[5]

"Official" shorteners

bit.ly aliases

Dead or Broken Shorteners

Hueg list

References

Weblinks

Navigation menu

Difference between revisions of "URLTeam"

Revision as of 06:40, 31 December 2011

Who did this?

301Work cooperation

Tools

Or just ask!

URL shorteners

New table

Old list[5]

"Official" shorteners

bit.ly aliases

Dead or Broken Shorteners

Hueg list

References

Weblinks

Navigation menu

Search

Old list^[5]