Difference between revisions of "URLTeam"

From Archiveteam
Jump to navigation Jump to search
m (Small update)
(Update all but the shortener list)
Line 7: Line 7:
| archiving_status = {{in progress}}
| archiving_status = {{in progress}}
| source = https://github.com/ArchiveTeam/urlteam-stuff
| source = https://github.com/ArchiveTeam/urlteam-stuff
| tracker =
| tracker = http://tracker.tinyarchive.org/
| irc = urlteam
| irc = urlteam
}}
}}
Line 21: Line 21:
* [[User:Soult]] Helps with scraping
* [[User:Soult]] Helps with scraping
* [[User:Jeroenz0r]] Helps with scraping (and stalking Soult)
* [[User:Jeroenz0r]] Helps with scraping (and stalking Soult)
* ... many ArchiveTeam people who run the scrapers


== 301Work cooperation ==
== 301Work cooperation ==
Line 27: Line 28:


== Tools ==
== Tools ==
* [[TinyBack]] (written in ruby by [[User:Soult]])
* [https://github.com/chronomex/urlteam fetcher.pl]: Perl-based scraper by [[User:Chronomex]]
* [[User:Chronomex]] wrote his own Perl-based scraper: [http://github.com/chronomex/urlteam]
* [https://github.com/soult/tinyback TinyBack]: Python 2.x-based, distributed scraper (also works with the [[Warrior]])
* The [http://github.com/mrflip/monkeyshines Monkeyshines algorithmic scraper] has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs.  With it, [[User:Mrflip]] gathered about 6M valid URLs pulled from twitter messages so far.


=== Or just ask! ===
=== TinyBack ===
Here's a template that worked for me at least once. Well, data is pending but the site owner is gung-ho.
The easiest way to help with scraping is to run the Warrior and select the ''URLTeam'' project. You can also run TinyBack outside the warrior, thought Python 2.6 or newer is required:


Try sending an email to the website owner:
  git clone https://github.com/soult/tinyback
 
  cd tinyback
Hello!
  # Use ./run.py --help for more information on command-line options
  ./run.py --tracker=http://tracker.tinyarchive.org/v1/ --num-threads=3 --sleep=180
I'm working with Jason Scott of textfiles.com and other members of the
Archive Team.
Since the recent scare involving http://tr.im/'s announced (and then
retracted) imminent demise, we've been working to archive all the
links from URL shorteners around the Internet.
If I'm not mistaken, you operate urlx.org.  Would you be so kind as to
share with us a copy of your URL database?  We'll do our best to
preserve this data forever in a useful way.
We are already very far along in scraping links from tr.im, but it's
faster (and friendlier!) to contact site owners asking for a copy of
their data than it is to scrape.
We've got a domain registered, urlte.am, and all links will be
available for redirect in the format:
http://urlx.org.urlte.am/av3
If you could help us, that would be excellent!
Thank you,


== URL shorteners ==
== URL shorteners ==

Revision as of 17:41, 20 October 2012

Urlteam
url shortening was a fucking awful idea
url shortening was a fucking awful idea
URL http://urlte.am
Status Online!
Archiving status In progress...
Archiving type Unknown
Project source https://github.com/ArchiveTeam/urlteam-stuff
Project tracker http://tracker.tinyarchive.org/
IRC channel #urlteam (on hackint)

TinyURL, bit.ly and other similar services allow long URLs to be converted to smaller ones on their specific service; the small URL is visited by a consumer and their web browser is redirected to the long URL.

Such services are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see Wikipedia: Link Rot). Archive.org/301Works is acting as an escrow for URL shortener databases, but they rely on URL shorteners to actually give them their databases. Even 301Works founding member bit.ly does not actually share their databases and most other big shorteners don't share theirs either.

Who did this?

You can join us in our IRC channel: #urlteam on EFNet

301Work cooperation

301works logo.jpg

The fine folks at archive.org have provides us with upload permissions to the 301Works archive: http://www.archive.org/details/301utm. They unfortunately do not want to make them downloadable, but the same data is in our torrents too, just in a different format (we use tab-delimited, xz-compressed files while 301works uses comma-delimited uncompressed files).

Tools

TinyBack

The easiest way to help with scraping is to run the Warrior and select the URLTeam project. You can also run TinyBack outside the warrior, thought Python 2.6 or newer is required:

 git clone https://github.com/soult/tinyback
 cd tinyback
 # Use ./run.py --help for more information on command-line options
 ./run.py --tracker=http://tracker.tinyarchive.org/v1/ --num-threads=3 --sleep=180

URL shorteners

New table

The new table includes shorteners we have already started to scrape.

Name Est. number of shorturls Scraping done by Status Comments
TinyURL 1000000000 User:Soult 5-letter codes done, on halt due to being banned (2010-12-20) non-sequential, bans IP for requesting too many non-existing shorturls
bit.ly 4000000000 User:Soult lots and lots of scraping needed (2011-03-25) non-sequential
goo.gl ?? User:Scumola started (2011-03-04) goo.gl throttles pulls
is.gd 534183259 User:Chronomex/User:Soult probably got about 95% before switch to non-sequential now non-sequential, new software version added crappy rate limiting
ff.im ? User:Chronomex only used by FriendFeed, no interface to shorten new URLs
4url.cc 1279 (2009-08-14)[1] User:Chronomex done (2009-08-14) dead (2011-02-15)
litturl.com 17096[2] User:Chronomex done dead (2010-11-18)
xs.md 3084 (2009-08-15)[3] User:Chronomex done dead (2010-11-18)
url.0daymeme.com 14867 (2009-08-14)[4] User:Chronomex done dead (2010-11-18)
tr.im 1990425 User:Soult got what we could dead (2011-12-31)
adjix.com ? User:Jeroenz0r Already done: 00-zz, 000-zzz, 0000-izzz. case-insensitive, incremental
rod.gs ? User:Jeroenz0r Done: 00-ZZ, 000-2Qc case-sensitive, incremental, server can't keep up with all the requests.
biglnk.com ? User:Jeroenz0r Done: 0-Z, 00-ZZ, 000-ZZZ case-sensitive, incremental
go.to 60000 User:Asiekierka Done: ~45000 (go.to network links only: goto_dump.zip) no codes, only names, google-fu only gives the first 1000 results for each, thankfully most domains have less
Name Number of shorturls Scraping done by Status Comments

Old list[5]

List last updated 2009-08-14.

"Official" shorteners

  • goo.gl - Google
  • fb.me - Facebook
  • y.ahoo.it - Yahoo
  • youtu.be - YouTube
  • t.co? - Twitter
  • post.ly - Posterous
  • wp.me - Wordpress.com
  • flic.kr - Flickr
  • lnkd.in - LinkedIn
  • su.pr - StumbleUpon
  • go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)
bit.ly aliases
  • amzn.to - Amazon
  • binged.it - Bing (bonus points for being longer than bing.com)
  • 1.usa.gov - USA Government
  • tcrn.ch - Techcrunch

Dead or Broken Shorteners

  • chod.sk - Appears non-incremental, not resolving
  • gonext.org - not resolving
  • ix.it - Not resolving
  • jijr.com - Doesn't appear to be a shortener, now parked
  • kissa.be - "Kissa.be url shortener service is shutdown"
  • kurl.us - Parked.
  • miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."
  • minurl.org - Presently in ERROR 404
  • muhlink.com - Not resolving
  • myurl.us - cpanel frontend
  • 1link.in - Website dead
  • canurl.com - Website dead
  • dwarfurl.com - Website dead/Numeric, appears incremental: http://dwarfurl.com/08041
  • easyuri.com - Website dead/Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3
  • go2cut.com - Website dead
  • lnkurl.com - Website dead
  • minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh - Website dead
  • memurl.com - Pronounceable. Broken.
  • nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own). Taken by squatters
  • digg.com - discontinued - [1]
  • u.nu - "The shortest URLs. period." Website dead since at least 1st of october 2010 (http://web.archive.org/web/20100104023208/http://u.nu/)

Hueg list

[2]

References

Weblinks