URLTeam/History

From Archiveteam
Jump to: navigation, search

This is a history of the URLTeam, written using wiki history, IRC chatlogs, git commit logs and news articles.

History

Beginning: 2009/01 - 2009/08

On January 21st, 2009 the URLTeam wiki page was created by swebb.[1] He started crawling tinyurl.com and ff.im (Friendfeed) and at the end of April had already backed up a couple million URLs.

fetcher.pl: 2009/08 - 2010/08

A second scraper was created by chronomex in August 2009 in Perl.[2] The output format pioneered by his script is still used by the urlteam for releases. He used the scraper to save various smaller shorteners such as 4url.cc and surl.ws, but also for bigger shorteners like is.gd. Back then is.gd was still using sequential shortcodes and had no rate limiting.

At the same time, on August 12th, 2009 the domain urlte.am was registered by SketchCow.[3] The website only displayed the logo and the tag line "url shortening was a fucking awful idea".[4]

tinyback: 2010/08 - 2011/01

Almost exactly a year after chronomex created a Perl-based shortener, soultcer decided to write a shortener in Ruby.[5] Developement was slow and sporadical due to his day job, with the first runable but buggy version ready on October 26th, 2010. Tinyback originally used MessagePack as output format and only supported tinyurl.com. Support for is.gd, bit.ly and tr.im was added in the following months along with many other improvements and the change to the same output format as chronomex' fetcher.pl.

One goal of tinyback was to handle special cases and bugs in many URL shorteners, like tinyurl.com's TinyURL redirects to a TinyURL error page or bit.ly's STOP page. On one hand this led to more complete backups of an URL shortener, but on the other hand it made adding support for new shorteners more difficult.

Also on October 26th, 2010, when tinyback developement was only getting started, soultcer asked SketchCow about putting up the scraped data for download on the urlte.am website. SketchCow approved, but it would still be a long time until the first release.

File handling and shortcode assignment was done manually back then, with the resuling output files being copied around using scp before being sorted and merged with some rather slow C tools. Following the creation of tinyback, both chronomex and soultcer did lots of scraping using their respective tools. To get around IP bans, cheap low end VPS were used for scraping.

First release: 2011/01 - 2011/06

is.gd changes

On January 12th, 2011 is.gd migrated to a new architecture[6]. Shortcodes changed from sequential to random and a very strict 60 requests per minute limit was added. This made scraping more difficult, but luckily chronomex had fetched almost all the sequential shortcodes before the switch was made.

Release preparation

In January 2011 soultcer started an effort to combine the scraped data from chronomex and himself, which he finished in February. On February 9th, 2011 soultcer made the first commit to the urlteam-stuff repository, which holds the urlte.am website.

Also in 2011 Jeroenz0r and underscor joined the urlteam and helped with scraping and various other stuff. In March 2011 swebb also discovered the IRC channel and uploaded his scraped data for soultcer to merge. His scrape of tr.im was especially useful, but more on that later.

301works cooperation

In March 2011 Sketchcow arranged for Jeff Kaplan from 301Works.org to give soultcer an upload slot to the 301works collection on archive.org. From March 15th to July 6th soultcer uploaded all data he had scraped from bit.ly so far in a csv-based format to the 301utm collection. The list of files and the codes they contain is also stored in the urlteam-stuff git repository. Unfortunately no further updates to the data have been made after that point.

Release

In April 2011 soultcer stated on IRC that he wanted to create the release torrent, but he had trouble merging swebb's data from tr.im with his own scrapes. As it turned out, tr.im was very broken and returned some bad data, which is why the URLteam settled on only putting parts of the tr.im scrape in the torrent. After the contents were finalized in May 2011, underscor provided a server to upload the 40 GB of compressed data files that had been collected. The upload finished on May 31st, and underscor created a torrent from it, marking our very first release. It took another couple of days to update the homepage, but after over 2 years of scraping, the first results were finally available for download.

Second release: 2011/06 - 2012/01

After the first release, most people were rather busy with other stuff, so when the self-imposed deadline of December 2011 approached, the only new files were a couple gigabytes of scraped data from tinyurl.com and the merged data from tr.im (see below). The release was created on the last day of 2011.

tr.im (a short digression)

In August 2009, when popular shortener bit.ly became the default shortener for Twitter, Eric Woodward, owner of the not quite so popular shortener tr.im was rather butthurt, and decided to shut down tr.im in spite.[7] This caused a massive uproar, because it made people realize that once tr.im shut down, millions of URLs would just stop working, or even worse, redirect to some spam site. This not only affected tr.im, but undermined the claim for legetimazy for every other URL shortener as well. bit.ly offered to continue hosting tr.im, but Eric Woodward was having none if it.[8][9] In the end he reopened tr.im and a few days later announced it would live on as an open source project.[10][11] While he did release the source, nothing ever came of his "community-owned" URL shortener idea.[12] Shortening of new URLs was disabled and redirecting barely worked, breaking when too many requests (like more than 5 per minute) were made. In March 2011 soultcer removed the Trim class from tinyback, effectively ceasing any further backup efforts.[5]

Since the first release included not the full scraped that from tr.im, soultcer decided to do it right for the second release. In May 2011 he tried to merge the scrapes swebb and he had done, which turned out to be a complicated process. Using the source code released by Woodward, he was able to understand some of the weird quirks that tr.im had: Shorturl codes could either be autogenerated or custom codes. Autogenerated ones were case-sensitive, custom one were not. If a new autogenerated code was the same as a custom code, it might overwrite the custom code. Also, URLs would be randomly truncated for no understandable reason.[12] With that (half-)knowledge, he pierced together a final backup of the tr.im shortener, which was included in the second release.

A new approach: 2012/01 - 2013/01

After the second release, work on the URLTeam slowed down once more. Soultcer became unhappy with his ruby-based scraper and rewrote it in Python 3, which was not very widely used at the time and made character encoding handling more difficult. In August 2012 he rewrote tinyback again, based on his Python 3 version, but this time for Python 2. To distinguish the new version from the old Ruby tinyback, it was called tinyback v2.

Tinyarchive database and tracker

The process of creating the release was rather cumbersome and error-prone: URL shorteners and ranges were assigned manually to scraping hosts, using a text file for coordination. The results were tracked with git-annex, and then merged using unintuitive command-line tools. With the release of tinyback v2, soultcer also created a small tracker. It was written in September 2012, also using Python 2, and sqlite3 as database backend. The tracker was responsible for handing out tasks to tinyback instances using a simple HTTP API, making sure that only one task for each URL shortener was handed out per IP address, to avoid IP blocks. The results were then uploaded back to the tracker.

Previously all data was stored in sorted and unsorted text files, often compressed to save space. Using the name tinyarchive, soultcer created some tools to manage all scraped URLs in a database instead, using Python 2 and BerkeleyDB. The resuling database was from then used as the canonical method for storing URL shortener backups, with releases being generated directly from the database.

Third release

The planned release cycle of 6 months put the next release in June 2012, but work on the new tinyback and tinyarchive only started in August 2012. Almost no scraping was done beforehand, so we needed some time to scrape new data for the third release. The new tools sped up scraping, especially since the hurdle-of-entry was lowered by the automatic task assignment. The tinyback project was also added to the ArchiveTeam Warrior.

Preparation for the next release started mid-December and the release was created on January 1st, 2013.

Fourth release: 2013/01 - now

The new tinyback code made scraping easier, and as a result more people started doing so, even outside the Warrior. The code also received patches by alard, chfoo, ersi and jopie91 that fixed bugs, improved log output, and added support for new URL shorteners.

tr.im relaunches

On December 2012 soultcer discovered a posting on a job board looking for a programmer to work on a relaunch for tr.im. The domain name had been brought by a domain name "investment company" (= domain squatter), but it was unclear if they had also acquired the database. On January 30th, 2013, tr.im was back online, and it turns out that the database had not been part of the deal. Shortcode generation was sequential, but with weird gaps in between, probably in an effort to leave many codes unused. Unused codes would automatically redirect to advertisement after a couple of seconds. To keep at least some of the original tr.im links alive, soultcer started submitting the data he had from the old tr.im into the new tr.im. It was not a perfect solution, but at least a part of the links were preserved that way.

Fourth Release

On May 16th, 2013 soultcer announced that he would be stepping down from his role in the URLTeam after the following release. Chronomex and ersi volunteered to take over the duty of running the tracker, the tinyarchive database and creating the releases. The fourth release was released on July 20th, 2013.

Sources