Blogger

From Archiveteam
Jump to navigation Jump to search
Blogger
Blogger logo
Blogger- Crea tu blog gratuito 1303511108785.png
URL https://www.blogger.com/
Status Endangered
Archiving status In progress...
Archiving type DPoS
Project source blogger-grab (2023)
blogger-discovery (2015)
Project tracker blogger (2023)
bloggerdisco (2015)
IRC channel #frogger (on hackint)
Data[how to use] archiveteam_blogger

Blogger is a blog hosting service. On February 23, 2015, they announced that "sexually explicit" blogs would be restricted from public access in a month. But soon they withdrew their plan, and said they wouldn't change their existing policies.[1] In 2019, Google removed Google+ comments placed on Blogger blogs. In May 2020, Blogger announced a redesign of their on their official blog, and posted weekly updates on their community support forum in August and September 2020. This redesign covered most of the Blogger experience, but pages related to user profiles, following blogs, cookie management, video management, and classic-style blog widget management still use the old Blogger design. No updates to Blogger have been documented since September 2020. Google has also moved away from Blogger for their own company blogs. For these reasons, Blogger may be at risk of shutting down.

In May 2023, Google announced that inactive accounts would be deleted starting on 2023-12-01 across their platform, including Blogger blogs.

Archive Team did a discovery between February and May 2015, but did not begin downloading actual content until November 2023.

Strategy

Find as many http://foobar.blogspot.com domains as possible and download them. Here is a full list (as of February 2020) and a new list can be generated with these instructions. Otherwise, manual discovery can be attempted. Blogs often link to other blogs, which will help, so each individual blog saved will help discover others. Also a small-scale crawl of Blogger profiles (e.g. http://www.blogger.com/profile/{random number up to 35217655}) will provide links to blogs authored by each user (e.g. https://www.blogger.com/profile/5618947 links to http://hintergedanke.blogspot.com/) - Although note that this does not cover ALL bloggers or ALL blogs, and is merely a starting point for further discovery.

Another strategy is to scrape Blogspot sites from Blogger profiles and vice versa (example script) and you will get an almost ever-expanding list. There will be captchas to deal with though, so you may need to distribute the scraping. There is a list of blogs discovered using this method.

How to help if you have lists of URLs

For other ArchiveTeam projects that can use this kind of help, see Projects requiring URL lists.

This project requires lists of URLs for content on the target website. If you have a source of URLs, please:

  1. Use the PCRE regular expression (\S+\.blogspot|\S*blogger)\.\S+ for filtering.
    • Note that this regex is intentionally broad to cover many different URL formats. Please do not try to use a more narrow pattern, as it may miss valid URLs. We can always filter or transform the results as needed later.
    • Enable case-insensitive matching (e.g. grep's -i) to catch URLs with capitalization.
    • If using grep or similar, enable text matching (-a or --text) to catch URLs in files with apparent binary data.
    • Example command (GNU grep): grep -Pahoi '(\S+\.blogspot|\S*blogger)\.\S+' FILENAME FILENAME...
  2. If the output exceeds a few megabytes, compress it, preferably using zstd -10.
  3. Give the file a descriptive name and upload it to https://transfer.archivete.am/.
  4. Share the resulting URL in the project IRC channel.
    • If you wish your list to remain private, please get in touch with a channel op (e.g. arkiver or JustAnotherArchivist). Items generated from your list will still be processed publicly, but they will be mixed in with all other items and channel logs will not associate them with you.

Country Redirect

Accessing http://whatever.blogspot.com used to redirect to a country-specific subdomain depending on your IP address (e.g. whatever.blogspot.co.uk, whatever.blogspot.in, etc) which in some cases may be censored or edited to meet local laws and standards - this can be bypassed by requesting http://whatever.blogspot.com/ncr as the root URL.[2] [3] As of May 2018, all international Blogger domains now redirect to blogspot.com.

Downloading a single blog with Wget

These Wget parameters can download a BlogSpot blog, including comments and any on-site dependencies. It should also reject redundant pages such as the /search/ directory and any multiple occurrences of the same page but with different query strings. It has only be tested on blogs using a Blogger subdomain (e.g. http://foobar.blogspot.com), not custom domains (e.g. http://foobar.com). Both instances of [URL] should be replaced with the same URL. A simple Perl wrapper is available here.

wget --recursive --level=2 --no-clobber --no-parent --page-requisites --continue --convert-links --user-agent="" -e robots=off --reject "*\\?*,*@*" --exclude-directories="/search,/feeds" --referer="[URL]" --wait 1 [URL]

UPDATE:

Use this improved bash script instead, in order to bypass the adult content confirmation. BLOGURL should be in http://someblog.blogspot.com format.

#!/bin/bash
blogspoturl="BLOGURL"
wget -O - "blogger.com/blogin.g?blogspotURL=$blogspoturl" | grep guestAuth | cut -d'"' -f 4 | wget -i - --save-cookies cookies.txt --keep-session-cookies
wget --load-cookies cookies.txt --recursive --level=2 --no-clobber --no-parent --page-requisites --continue --convert-links --user-agent="" -e robots=off --reject "*\\?*,*@*" --exclude-directories="/search,/feeds" --referer="$blogspoturl" --wait 1 $blogspoturl

Export XML trick

Add this to a blog url and it will download the most recent 499 posts (that is the limit): /atom.xml?redirect=false&max-results=

Your own blogs

Download them at https://takeout.google.com/settings/takeout

We've not tested whether the output is suitable for importing in any other software such as Wordpress.

External links

References