Difference between revisions of "Blogger"

From Archiveteam
Jump to navigation Jump to search
(+category)
(30 intermediate revisions by 10 users not shown)
Line 1: Line 1:
{{Infobox project
{{Infobox project
| title = Blogger
| title = Blogger
| logo = Blogger-logo.png
| image = Blogger- Crea tu blog gratuito 1303511108785.png
| image = Blogger- Crea tu blog gratuito 1303511108785.png
| description =  
| description =  
| URL = http://www.blogger.com/
| URL = http://www.blogger.com/
| project_status = {{online}}
| project_status = {{online}}
| archiving_status = {{upcoming}}
| archiving_status = {{notsavedyet}}
| source = [https://github.com/ArchiveTeam/blogger-discovery blogger-discovery]
| tracker = [http://tracker.archiveteam.org/bloggerdisco/ bloggerdisco]
| irc = frogger
| irc = frogger
}}
}}


'''Blogger''' is a blog hosting service. On February 23, 2015, they announced that "sexually explicit" blogs would be restricted from public access in a month. We're downloading everything.
'''Blogger''' is a blog hosting service. On February 23, 2015, they announced that "sexually explicit" blogs would be restricted from public access in a month. But soon they withdrew their plan, and said they wouldn't change their existing policies.<ref>https://support.google.com/blogger/answer/6170671?p=policy_update&hl=en&rd=1</ref>  With the exception of removing [[Google+]] comments, there hasn't been an update on the [https://blogger.googleblog.com/ official blog] in around one year.  Google has also moved away from Blogger for their own company blogs.  For these reasons, Blogger is at risk of shutting down.
 
'''ArchiveTeam did a discovery between February and May 2015, but actual content has not been downloaded yet.'''


== Strategy ==
== Strategy ==


Find as many http://foobar.blogspot.com domains as possible and download them. Blogs often link to other blogs, which will help, so each individual blog saved will help discover others. Also a small-scale crawl of Blogger profiles (e.g. http://www.blogger.com/profile/{random number up to 35217655}) will provide links to blogs authored by each user (e..g https://www.blogger.com/profile/5618947 links to http://hintergedanke.blogspot.com/)
Find as many http://foobar.blogspot.com domains as possible and download them. [https://archive.org/details/all_blogger_subdomains Here is a full list] (as of February 2020) and a new list can be generated with [[User:Trumad|these instructions]]. Otherwise, manual discovery can be attempted. Blogs often link to other blogs, which will help, so each individual blog saved will help discover others. Also a small-scale crawl of Blogger profiles (e.g. <nowiki>http://www.blogger.com/profile/{random number up to 35217655}</nowiki>) will provide links to blogs authored by each user (e.g. https://www.blogger.com/profile/5618947 links to http://hintergedanke.blogspot.com/) - Although note that this does not cover ALL bloggers or ALL blogs, and is merely a starting point for further discovery.


== Country Redirect ==
== Country Redirect ==


Accessing http://whatever.blogspot.com will usually redirect to a country-specific subdomain depending on your IP address (e.g. whatever.blogspot.co.uk, whatever.blogspot.in, etc) which in some cases may be censored or edited to meet local laws and standards - this can be bypassed by requesting http://whatever.blogspot.com/ncr as the root URL.<ref>https://support.google.com/blogger/answer/2402711?hl=en</ref>
Accessing http://whatever.blogspot.com will usually redirect to a country-specific subdomain depending on your IP address (e.g. whatever.blogspot.co.uk, whatever.blogspot.in, etc) which in some cases may be censored or edited to meet local laws and standards - this can be bypassed by requesting http://whatever.blogspot.com/ncr as the root URL.<ref>https://support.google.com/blogger/answer/2402711?hl=en</ref> <ref>http://www.bbc.co.uk/news/technology-16852920</ref>


== Downloading a single blog with Wget ==
== Downloading a single blog with Wget ==
Line 23: Line 28:


<tt>wget --recursive --level=2 --no-clobber --no-parent --page-requisites --continue --convert-links --user-agent="" -e robots=off --reject "*\\?*,*@*" --exclude-directories="/search,/feeds" --referer="[URL]" --wait 1 [URL]</tt>
<tt>wget --recursive --level=2 --no-clobber --no-parent --page-requisites --continue --convert-links --user-agent="" -e robots=off --reject "*\\?*,*@*" --exclude-directories="/search,/feeds" --referer="[URL]" --wait 1 [URL]</tt>
'''UPDATE''':
Use this improved bash script instead, in order to bypass the adult content confirmation. BLOGURL should be in <code><nowiki>http://someblog.blogspot.com</nowiki></code> format.
<pre style="white-space: pre-wrap">
#!/bin/bash
blogspoturl="BLOGURL"
wget -O - "blogger.com/blogin.g?blogspotURL=$blogspoturl" | grep guestAuth | cut -d'"' -f 4 | wget -i - --save-cookies cookies.txt --keep-session-cookies
wget --load-cookies cookies.txt --recursive --level=2 --no-clobber --no-parent --page-requisites --continue --convert-links --user-agent="" -e robots=off --reject "*\\?*,*@*" --exclude-directories="/search,/feeds" --referer="$blogspoturl" --wait 1 $blogspoturl
</pre>


== Export XML trick ==
== Export XML trick ==
Add this to a blog url and it will download the most recent 499 posts (that is the limit): /atom.xml?redirect=false&max-results=499
Add this to a blog url and it will download the most recent 499 posts (that is the limit): /atom.xml?redirect=false&max-results=
 
== Your own blogs ==
 
Download them at https://takeout.google.com/settings/takeout
 
We've not tested whether the output is suitable for importing in any other software such as Wordpress.


== External links ==
== External links ==
* {{url|1=http://www.blogger.com/|2=Blogger}}
* {{url|1=http://www.blogger.com/|2=Blogger}}
== References ==
<references/>


{{Navigation box}}
{{Navigation box}}
[[Category:Google]]
[[Category:Blogging]]

Revision as of 13:46, 20 August 2020

Blogger
Blogger logo
Blogger- Crea tu blog gratuito 1303511108785.png
URL http://www.blogger.com/
Status Online!
Archiving status Not saved yet
Archiving type Unknown
Project source blogger-discovery
Project tracker bloggerdisco
IRC channel #frogger (on hackint)

Blogger is a blog hosting service. On February 23, 2015, they announced that "sexually explicit" blogs would be restricted from public access in a month. But soon they withdrew their plan, and said they wouldn't change their existing policies.[1] With the exception of removing Google+ comments, there hasn't been an update on the official blog in around one year. Google has also moved away from Blogger for their own company blogs. For these reasons, Blogger is at risk of shutting down.

ArchiveTeam did a discovery between February and May 2015, but actual content has not been downloaded yet.

Strategy

Find as many http://foobar.blogspot.com domains as possible and download them. Here is a full list (as of February 2020) and a new list can be generated with these instructions. Otherwise, manual discovery can be attempted. Blogs often link to other blogs, which will help, so each individual blog saved will help discover others. Also a small-scale crawl of Blogger profiles (e.g. http://www.blogger.com/profile/{random number up to 35217655}) will provide links to blogs authored by each user (e.g. https://www.blogger.com/profile/5618947 links to http://hintergedanke.blogspot.com/) - Although note that this does not cover ALL bloggers or ALL blogs, and is merely a starting point for further discovery.

Country Redirect

Accessing http://whatever.blogspot.com will usually redirect to a country-specific subdomain depending on your IP address (e.g. whatever.blogspot.co.uk, whatever.blogspot.in, etc) which in some cases may be censored or edited to meet local laws and standards - this can be bypassed by requesting http://whatever.blogspot.com/ncr as the root URL.[2] [3]

Downloading a single blog with Wget

These Wget parameters can download a BlogSpot blog, including comments and any on-site dependencies. It should also reject redundant pages such as the /search/ directory and any multiple occurrences of the same page but with different query strings. It has only be tested on blogs using a Blogger subdomain (e.g. http://foobar.blogspot.com), not custom domains (e.g. http://foobar.com). Both instances of [URL] should be replaced with the same URL. A simple Perl wrapper is available here.

wget --recursive --level=2 --no-clobber --no-parent --page-requisites --continue --convert-links --user-agent="" -e robots=off --reject "*\\?*,*@*" --exclude-directories="/search,/feeds" --referer="[URL]" --wait 1 [URL]

UPDATE:

Use this improved bash script instead, in order to bypass the adult content confirmation. BLOGURL should be in http://someblog.blogspot.com format.

#!/bin/bash
blogspoturl="BLOGURL"
wget -O - "blogger.com/blogin.g?blogspotURL=$blogspoturl" | grep guestAuth | cut -d'"' -f 4 | wget -i - --save-cookies cookies.txt --keep-session-cookies
wget --load-cookies cookies.txt --recursive --level=2 --no-clobber --no-parent --page-requisites --continue --convert-links --user-agent="" -e robots=off --reject "*\\?*,*@*" --exclude-directories="/search,/feeds" --referer="$blogspoturl" --wait 1 $blogspoturl

Export XML trick

Add this to a blog url and it will download the most recent 499 posts (that is the limit): /atom.xml?redirect=false&max-results=

Your own blogs

Download them at https://takeout.google.com/settings/takeout

We've not tested whether the output is suitable for importing in any other software such as Wordpress.

External links

References