Blogger

From Archiveteam
Jump to: navigation, search
Blogger
Blogger logo
Blogger- Crea tu blog gratuito 1303511108785.png
URL http://www.blogger.com/
Project status Online!
Archiving status In progress...
Project source blogger-discovery
Project tracker bloggerdisco
IRC channel #frogger

Blogger is a blog hosting service. On February 23, 2015, they announced that "sexually explicit" blogs would be restricted from public access in a month. But soon they withdrew their plan, and said they wouldn't change their existing policies.[1] However, before that, we had decided to downloading everything.

Contents

Strategy

Find as many http://foobar.blogspot.com domains as possible and download them. Blogs often link to other blogs, which will help, so each individual blog saved will help discover others. Also a small-scale crawl of Blogger profiles (e.g. http://www.blogger.com/profile/{random number up to 35217655}) will provide links to blogs authored by each user (e.g. https://www.blogger.com/profile/5618947 links to http://hintergedanke.blogspot.com/) - Although note that this does not cover ALL bloggers or ALL blogs, and is merely a starting point for further discovery.

Country Redirect

Accessing http://whatever.blogspot.com will usually redirect to a country-specific subdomain depending on your IP address (e.g. whatever.blogspot.co.uk, whatever.blogspot.in, etc) which in some cases may be censored or edited to meet local laws and standards - this can be bypassed by requesting http://whatever.blogspot.com/ncr as the root URL.[2] [3]

Downloading a single blog with Wget

These Wget parameters can download a BlogSpot blog, including comments and any on-site dependencies. It should also reject redundant pages such as the /search/ directory and any multiple occurrences of the same page but with different query strings. It has only be tested on blogs using a Blogger subdomain (e.g. http://foobar.blogspot.com), not custom domains (e.g. http://foobar.com). Both instances of [URL] should be replaced with the same URL. A simple Perl wrapper is available here.

wget --recursive --level=2 --no-clobber --no-parent --page-requisites --continue --convert-links --user-agent="" -e robots=off --reject "*\\?*,*@*" --exclude-directories="/search,/feeds" --referer="[URL]" --wait 1 [URL]

UPDATE:

Use this improved bash script instead, in order to bypass the adult content confirmation. BLOGURL should be in http://someblog.blogspot.com format.

#!/bin/bash
blogspoturl="BLOGURL"
wget -O - "blogger.com/blogin.g?blogspotURL=$blogspoturl" | grep guestAuth | cut -d'"' -f 4 | wget -i - --save-cookies cookies.txt --keep-session-cookies
wget --load-cookies cookies.txt --recursive --level=2 --no-clobber --no-parent --page-requisites --continue --convert-links --user-agent="" -e robots=off --reject "*\\?*,*@*" --exclude-directories="/search,/feeds" --referer="$blogspoturl" --wait 1 $blogspoturl

Export XML trick

Add this to a blog url and it will download the most recent 499 posts (that is the limit): /atom.xml?redirect=false&max-results=

How can I help?

Running the Warrior

Start up the Warrior and select the Blogger Discovery project. Do not increase the default concurrency of 2, because Google limits requests aggressively (and you get blocked for ~45 minutes, maybe less). Moreover, if you see "503 Service Unavailable" messages, decrease concurrency to 1.

Running the script manually

See details here: http://github.com/ArchiveTeam/blogger-discovery

Do not increase the concurrency above 2, because Google limits requests aggressively (and you get blocked for ~45 minutes, maybe less). Moreover, if you see "503 Service Unavailable" messages, decrease concurrency to 1.

Speeding things up

Disclaimer

The following method is not ArchiveTeam's official recommendation. You are solely responsible for any consequences of using this abusive method.

Solving Google's captcha, and using the resulting cookie, the request limit doesn't apply for three hours, that is, one can hammer Blogger as intensely as they like.

You can find a modified discover.py script here, which uses a cookies file, and has decreased sleep time if you encounter a captcha (so that you don't need to wait 45 minutes if you have the cookie), and lacks the sleep between requests at all. Replace the original discover.py script with this.

The other thing, that you have to do in every three hours, is:

  • When you see the script bump into a captcha, go to http://blogger.com/ and solve it
  • Export your cookies with some tool, e.g. for Firefox there is this extension. Save the file as cookies.txt into the folder where discover.py resides.

In case you want to renew the cookie before the three hours expire, find and delete the cookie named GOOGLE_ABUSE_EXEMPTION in your browser, and then do the things above. Note: changing the cookie's expiry date doesn't have effect.

DO NOT leave this script alone without solving the captcha latest right after the expiry of the cookie, otherwise items will be garbaged continuously. If you have to leave the script alone, schedule its stop by issuing the sleep 10400; touch STOP command in its folder right when renewing the cookie. (This will stop the script's operation after 3 hours; when you want to restart the script, issue rm STOP beforehand.)

External links

References

  1. https://support.google.com/blogger/answer/6170671?p=policy_update&hl=en&rd=1
  2. https://support.google.com/blogger/answer/2402711?hl=en
  3. http://www.bbc.co.uk/news/technology-16852920

[view]  [edit]                   Archive Team                  
Current events Alive... OR ARE THEY · Deathwatch · Projects
Archiveteam.jpg
Archiving projects Archive.is · BetaArchive · Gmane · Internet Archive · It Died · OldApps.com · OldVersion.com · OSBetaArchive · TEXTFILES.COM · The Dead, the Dying & The Damned · The Mail Archive · UK Web Archive · WebCite
Blogging Blog.pl · Blogger · Blogster · Blogter.hu · Freeblog.hu · Fuelmyblog · Jux · LiveJournal · My Opera · Open Diary · ownlog.com · Posterous · Powerblogs · Proust · Roon · Splinder · Tumblr · Vox · Weblog.nl · Windows Live Spaces · Wordpress.com · Xanga · Yahoo! Blog · Zapd
Cloud hosting/file sharing AnyHub · Box · Dropbox · Google Drive · Google Groups Files · iCloud · Fileplanet · LayerVault · MediaCrush · MediaFire · Mega · MegaUpload · MobileMe · OneDrive · Pomf.se · RapidShare · Ubuntu One · Yahoo! Briefcase
Corporations Apple · IBM · Google · Lycos Europe · Microsoft · Yahoo!
Events Arab Spring · Occupy movement · Spanish Revolution
Font Repos Google Web Fonts · GNU FreeFont · Fontspace
Forums 4chan · College Confidential · ESPN Forums · forums.starwars.com · HeavenGames · Yahoo! Messages · Yahoo! Neighbors
Gaming City of Heroes · Club Nintendo · Desura · Emulation Zone · GameMaker Sandbox · Halo · Infinite Crisis · Minecraft.net · Player.me · Playfire · Steam · Warhammer · Xfire
Image hosting AOL Pictures · Blipfoto · Blingee · Canv.as · Camera+ · Cameroid · DailyBooth · Degree Confluence Project · deviantART · Demotivalo.net · Flickr · Fotoalbum.hu · Fotopedia · Geograph Britain and Ireland · GTF Képhost · ImageShack · Imgur · Inkblazers · Instagr.am · Kepfeltoltes.hu · Kephost.com · Kephost.hu · Kepkezelo.com · Keptarad.hu · Madden GIFERATOR · MLKSHK · Microsoft Clip Art · Nokia Memories · noob.hu · Odysee · Panoramio · Photobucket · Picasa · Picplz · PSharing · Ptch · puu.sh · Rawporter · Relay.im · ScreenshotsDatabase.com · Snapjoy · Streetfiles · Tabblo · Trovebox · TwitPic · Wallbase · Wallhaven · Webshots · Wikimedia Commons
Knowledge/Wikis arXiv · Citizendium · Clipboard.com · Deletionpedia · EditThis · Encyclopedia Dramatica · Etherpad · Everything2 · infoAnarchy · GeoNames · GNUPedia · Google Books (Google Books Ngram) · Insurgency Wiki · Knol · Lost Media Wiki · Neoseeker.com · Nupedia · OpenCourseWare · OpenStreetMap · Orain · Pastebin · Patch.com · Project Gutenberg · Puella Magi · Referata · Resedagboken · SongMeanings · ShoutWiki · The Internet Movie Database · TropicalWikis · Uncyclopedia · Urban Dictionary · Webmonkey · Wikia · Wikidot · WikiHow · Wikkii · WikiLeaks · Wikipedia (Simple English Wikipedia) · Wikispaces · Wikispot · Wik.is · Wiki-Site · WikiTravel · Word Count Journal
Magazines/Blogs/News Cyberpunkreview.com · Game Developer Magazine · Gigaom · Helium · JPG Magazine · San Fransisco Bay Guardian · Scoop · Regretsy · Yahoo! Voices
Microblogging Heello · Identi.ca · Jaiku · Mommo.hu · Plurk · Sina Weibo · Twitter · TwitLonger
Music/Audio AOL Music · Audimated.com · Cinch · digCCmixter · Dogmazic.net · Earbits · exfm · Free Music Archive · Gogoyoko · Indaba Music · Instacast · Jamendo · Last.fm · Music Unlimited · MOG · PureVolume · Reverbnation · ShareTheMusic · SoundCloud · Soundpedia · TuneWiki · Twaud.io · WinAmp
People Aaron Swartz · Michael S. Hart · Steve Jobs · Mark Pilgrim · Dennis Ritchie · Len Sassaman Project
Protocols/Infrastructure FTP · Gopher · IRC · Usenet · World Wide Web
Q&A Askville · Answerbag · Answers.com · Ask.com · Askalo · Baidu Knows · Blurtit · ChaCha · Experts Exchange · Formspring · GirlsAskGuys · Google Answers · Google Baraza · JustAnswer · MetaFilter · Quora · Retrospring · StackExchange · The AnswerBank · The Internet Oracle · Uclue · WikiAnswers · Yahoo! Answers
Recipes/Food Allrecipes · Epicurious · Food.com · Foodily · Food Network · Punchfork · ZipList
Social bookmarking Addinto · Backflip · Balatarin · BibSonomy · Bkmrx · Blinklist · BlogMarks · BookmarkSync · CiteULike · Connotea · Delicious · Designer News · Digg · Diigo · Dir.eccion.es · Evernote · Excite Bookmark · Faves · Favilous · folkd · Freelish · Getboo · GiveALink.org · Gnolia · Google Bookmarks · Hacker News · HeyStaks · IndianPad · Kippt · Knowledge Plaza · Licorize · Linkwad · Menéame · Microsoft Developer Network · myVIP · Mister Wong · My Web · Mylink Vault · Newsvine · Oneview · Pearltrees · Pinboard · Pocket · Propeller.com · Reddit · sabros.us · Scloog · Scuttle · Simpy · SiteBar · Slashdot · Squidoo · StumbleUpon · Twine · Vizited · Yummymarks · Xmarks · Yahoo! Buzz · Zootool · Zotero
Social networks Bebo · BlackPlanet · Classmates.com · Cyworld · Dogster · Dopplr · douban · Ello · Facebook · Flixster · FriendFeed · Friendster · Gaia Online · Google+ · Habbo · hi5 · Hyves · iWiW · LinkedIn · Miiverse · mixi · MyHeritage · MyLife · Myspace · Netlog · Odnoklassniki · Orkut · Plaxo · Qzone · Renren · Skyrock · Sonico.com · Storylane · Tagged · tvtag · Upcoming · Viadeo · Vkontakte · WeeWorld · Weibo · Wretch · Yahoo! Groups · Yahoo! Stars India · Yahoo! Upcoming · more sites...
Shopping/Retail Alibaba · AliExpress · Amazon · Apple Store · eBay · Printfection · RadioShack · Sears · Target · The Book Depository · ThinkGeek · Walmart
Software/code hosting Android Development · Alioth · Assembla · BerliOS · Betavine · Bitbucket · BountySource · Codecademy · CodePlex · Freepository · Free Software Foundation · GNU Savannah · GitHost · GitHub · GitHub Downloads · Gitorious · Gna! · Google Code · ibiblio · java.net · JavaForge · KnowledgeForge · Launchpad · LuaForge · Maemo · mozdev · OSOR.eu · OW2 Consortium · Openmoko · OpenSolaris · Ourproject.org · Ovi Store · Project Kenai · RubyForge · SEUL.org · SourceForge · TestFlight · tigris.org · Transifex · TuxFamily · Yahoo! Downloads
Torrenting/Piracy ExtraTorrent · EZTV · isoHunt · KickassTorrents · The Pirate Bay · Torrentz
Video hosting Academic Earth · Blip.tv · Epic · Google Video · Justin.tv · Nokia Trailers · Qwiki · Stickam · TED Talks · Twitch.tv · Ustream · Viddler · Viddy · Vimeo · Vstreamers · Yahoo! Video · YouTube · Famous Internet videos (Me at the zoo)
Web hosting Angelfire · Brace.io · BT Internet · CableAmerica Personal Web Space · Comcast Personal Web Pages · Extra.hu · FortuneCity · Free ProHosting · GeoCities (patch) · Google Business Sitebuilder · Google Sites · Internet Centrum · MBinternet · MSN TV · Nwnyet · Parodius Networking · Prodigy.net · Saunalahti Iso G · Swipnet · Tripod · University of Michigan personal webpages · Verizon Mysite · Verizon Personal Web Space · Webzdarma · Virgin Media
Web applications Mailman · MediaWiki · phpBB · Simple Machines Forum · vBulletin
Other AOL · Akoha · Ancestry.com · April Fools' Day · Amplicate · AutoAdmit · Bre.ad · Circavie · Cobook · Co.mments · Countdown · Distill · Dmoz · Easel · Electronic Frontier Foundation · FanFiction.Net · Feedly · Ficlets · FunnyExam.com · FurAffinity · Google Helpouts · Google Moderator · Google Reader · ICQmail · IFTTT · Jajah · JuniorNet · Lulu Poetry · Mochi Media · Mozilla Firefox · MyBlogLog · NBII · Neopets · Quantcast · Quizilla · Salon Table Talk · Slidecast · SOPA blackout pages · starwars.yahoo.com · TechNet · Toshiba Support · Volán · Widgetbox · Windows Technical Preview · Wunderlist · Zoocasa
Information A Million Ways to Die on the Web · Backup Tips · Cheap storage · Collecting items randomly · Data compression algorithms and tools · Dev · Discovery Data · DOS Floppies · Fortress of Solitude · Keywords · Naughty List · Nightmare Projects · Backup Tips · Rescuing floppy disks · Rescuing optical media · Site exploration · The WARC Ecosystem · Working with ARCHIVE.ORG
Projects Audit2014 · Faceoff · FlickrFckr · Froogle · INTERNETARCHIVE.BAK (Internet Archive Census) · IRC Quotes · ISP Hosting · JSMESS · JSVLC · Just Solve the Problem · Project Newsletter · University Web Hosting · Valhalla · Woohoo
Tools ArchiveBot · ArchiveTeam Warrior (Tracker) · Google Takeout · HTTrack · Video downloaders · Wget (Lua · WARC)
Teams Bibliotheca Anonoma · LibreTeam · URLTeam · Yahoo Video Warroom · WikiTeam
About Archive Team Introduction · Philosophy · Who We Are · Our stance on robots.txt · Why Back Up? · Software · Formats · Storage Media · Recommended Reading · Films and documentaries about archiving · Talks · In The Media · FAQ
Personal tools