Difference between revisions of "Nifty"

From Archiveteam
Jump to navigation Jump to search
(add e-shuushuu.net to todo)
(fix github urls)
Line 31: Line 31:


* On 2016-09-12, [[User:Sanqui]] harvested 8884 *.nifty.com URLs from Wikimedia sites using [[Site exploration#MediaWiki wikis|mwlinkscrape]]
* On 2016-09-12, [[User:Sanqui]] harvested 8884 *.nifty.com URLs from Wikimedia sites using [[Site exploration#MediaWiki wikis|mwlinkscrape]]
* On 2016-09-13, root homepages were added to this list, making it 11423 URLs: https://raw.githubusercontent.com/Sanqui/archiveteam-nifty/master/urls/wikimedia.txt.  ArchiveBot job ident <tt>21z8da69732jgmp4g6pn949p4</tt>
* On 2016-09-13, root homepages were added to this list, making it 11423 URLs: https://raw.githubusercontent.com/ArchiveTeam/nifty-discovery/master/urls/wikimedia.txt.  ArchiveBot job ident <tt>21z8da69732jgmp4g6pn949p4</tt>
* On 2016-09-15, Hatena bookmarks were scraped with [https://github.com/Sanqui/archiveteam-nifty/blob/master/scrape_hatena.py a script] and derived, producing a list of 19973 URLs: https://raw.githubusercontent.com/Sanqui/archiveteam-nifty/master/urls/hatena.txt.  ArchiveBot job ident <tt>3i04vcsil92hl80yxbxiimncn</tt>
* On 2016-09-15, Hatena bookmarks were scraped with [https://github.com/ArchiveTeam/nifty-discovery/blob/master/scrape_hatena.py a script] and derived, producing a list of 19973 URLs: https://raw.githubusercontent.com/ArchiveTeam/nifty-discovery/master/urls/hatena.txt.  ArchiveBot job ident <tt>3i04vcsil92hl80yxbxiimncn</tt>


Next steps
Next steps

Revision as of 07:24, 16 September 2016

Nifty
Japanese ISP with web hosting
Japanese ISP with web hosting
URL homepage.nifty.com
Status Closing
Archiving status Not saved yet
Archiving type Unknown
IRC channel #archiveteam-bs (on hackint)

Japanese ISP providing web hosting. Will be closing about 140,000 unclaimed homepages by 2016-09-29. Termination notice[IAWcite.todayMemWeb] (Japanese)

http://homepage1.nifty.com/USERNAME/
http://homepage2.nifty.com/USERNAME/
http://homepage3.nifty.com/USERNAME/

URL harvesting

Let's follow Site exploration.

<polm> One thing I would recommend is searching Hatena Bookmarks, which is like a Japanese free Pinboard
<polm> Like so: http://b.hatena.ne.jp/entrylist?url=homepage2.nifty.com
<polm> the "of" query parameter paginates like so: http://b.hatena.ne.jp/entrylist?url=homepage2.nifty.com&of=20
<zout> there's some here. https://archive.is/homepage2.nifty.com

Progress

Next steps

  • GoogleScraper is no good. Make attempts at scraping, Bing, Twitter using hints on Site exploration
  • Scrape archive.is
  • Scrape http://e-shuushuu.net/ (DoomTay)
  • Put chunks of up to 100k URLs onto high speed (20160911.01) ArchiveBot pipelines