Difference between revisions of "Blogger"
Powerkitten (talk | contribs) m ("downloading" to "download" in the opening paragraph.) |
(updated) |
||
Line 6: | Line 6: | ||
| URL = http://www.blogger.com/ | | URL = http://www.blogger.com/ | ||
| project_status = {{online}} | | project_status = {{online}} | ||
| archiving_status = {{ | | archiving_status = {{notsavedyet}} | ||
| source = [https://github.com/ArchiveTeam/blogger-discovery blogger-discovery] | | source = [https://github.com/ArchiveTeam/blogger-discovery blogger-discovery] | ||
| tracker = [http://tracker.archiveteam.org/bloggerdisco/ bloggerdisco] | | tracker = [http://tracker.archiveteam.org/bloggerdisco/ bloggerdisco] | ||
Line 12: | Line 12: | ||
}} | }} | ||
'''Blogger''' is a blog hosting service. On February 23, 2015, they announced that "sexually explicit" blogs would be restricted from public access in a month. But soon they withdrew their plan, and said they wouldn't change their existing policies.<ref>https://support.google.com/blogger/answer/6170671?p=policy_update&hl=en&rd=1</ref> | '''Blogger''' is a blog hosting service. On February 23, 2015, they announced that "sexually explicit" blogs would be restricted from public access in a month. But soon they withdrew their plan, and said they wouldn't change their existing policies.<ref>https://support.google.com/blogger/answer/6170671?p=policy_update&hl=en&rd=1</ref> | ||
'''ArchiveTeam did a discovery between February and May 2015, but actual content has not been downloaded yet.''' | |||
== Strategy == | == Strategy == | ||
Line 40: | Line 42: | ||
== Export XML trick == | == Export XML trick == | ||
Add this to a blog url and it will download the most recent 499 posts (that is the limit): /atom.xml?redirect=false&max-results= | Add this to a blog url and it will download the most recent 499 posts (that is the limit): /atom.xml?redirect=false&max-results= | ||
== External links == | == External links == |
Revision as of 10:53, 10 July 2016
Blogger | |
URL | http://www.blogger.com/ |
Status | Online! |
Archiving status | Not saved yet |
Archiving type | Unknown |
Project source | blogger-discovery |
Project tracker | bloggerdisco |
IRC channel | #frogger (on hackint) |
Blogger is a blog hosting service. On February 23, 2015, they announced that "sexually explicit" blogs would be restricted from public access in a month. But soon they withdrew their plan, and said they wouldn't change their existing policies.[1]
ArchiveTeam did a discovery between February and May 2015, but actual content has not been downloaded yet.
Strategy
Find as many http://foobar.blogspot.com domains as possible and download them. Blogs often link to other blogs, which will help, so each individual blog saved will help discover others. Also a small-scale crawl of Blogger profiles (e.g. http://www.blogger.com/profile/{random number up to 35217655}) will provide links to blogs authored by each user (e.g. https://www.blogger.com/profile/5618947 links to http://hintergedanke.blogspot.com/) - Although note that this does not cover ALL bloggers or ALL blogs, and is merely a starting point for further discovery.
Country Redirect
Accessing http://whatever.blogspot.com will usually redirect to a country-specific subdomain depending on your IP address (e.g. whatever.blogspot.co.uk, whatever.blogspot.in, etc) which in some cases may be censored or edited to meet local laws and standards - this can be bypassed by requesting http://whatever.blogspot.com/ncr as the root URL.[2] [3]
Downloading a single blog with Wget
These Wget parameters can download a BlogSpot blog, including comments and any on-site dependencies. It should also reject redundant pages such as the /search/ directory and any multiple occurrences of the same page but with different query strings. It has only be tested on blogs using a Blogger subdomain (e.g. http://foobar.blogspot.com), not custom domains (e.g. http://foobar.com). Both instances of [URL] should be replaced with the same URL. A simple Perl wrapper is available here.
wget --recursive --level=2 --no-clobber --no-parent --page-requisites --continue --convert-links --user-agent="" -e robots=off --reject "*\\?*,*@*" --exclude-directories="/search,/feeds" --referer="[URL]" --wait 1 [URL]
UPDATE:
Use this improved bash script instead, in order to bypass the adult content confirmation. BLOGURL should be in http://someblog.blogspot.com
format.
#!/bin/bash blogspoturl="BLOGURL" wget -O - "blogger.com/blogin.g?blogspotURL=$blogspoturl" | grep guestAuth | cut -d'"' -f 4 | wget -i - --save-cookies cookies.txt --keep-session-cookies wget --load-cookies cookies.txt --recursive --level=2 --no-clobber --no-parent --page-requisites --continue --convert-links --user-agent="" -e robots=off --reject "*\\?*,*@*" --exclude-directories="/search,/feeds" --referer="$blogspoturl" --wait 1 $blogspoturl
Export XML trick
Add this to a blog url and it will download the most recent 499 posts (that is the limit): /atom.xml?redirect=false&max-results=