Difference between revisions of "Nifty"

From Archiveteam
Jump to navigation Jump to search
Line 35: Line 35:
* On 2016-09-16, archive.is pages were scraped with [https://github.com/ArchiveTeam/nifty-discovery/blob/master/scrape_archive_is.py a script], derived and deduplicated, producing a list of mere 1165 URLs: https://raw.githubusercontent.com/ArchiveTeam/nifty-discovery/master/urls/archiveis.txt.  ArchiveBot job ident <tt>2bkvkya714zxqkity2cmw1w10</tt>
* On 2016-09-16, archive.is pages were scraped with [https://github.com/ArchiveTeam/nifty-discovery/blob/master/scrape_archive_is.py a script], derived and deduplicated, producing a list of mere 1165 URLs: https://raw.githubusercontent.com/ArchiveTeam/nifty-discovery/master/urls/archiveis.txt.  ArchiveBot job ident <tt>2bkvkya714zxqkity2cmw1w10</tt>
* [[User:DoomTay]] has plucked more URLs from [http://e-shuushuu.net/wiki/index.php?title=Special:LinkSearch&target=http%3A%2F%2F%2A.nifty.com&limit=500&offset=0 e-shuushuu wiki] (ArchiveBot job ident <tt>3spkhvzhep0azp811nk4zelw5</tt>) and from {{url|http://award.surpara.com/misssp/|Miss Surfersparadise}} (ArchiveBot job ident <tt>ew3a0olovf2e2pq20ki2fwgra</tt>)
* [[User:DoomTay]] has plucked more URLs from [http://e-shuushuu.net/wiki/index.php?title=Special:LinkSearch&target=http%3A%2F%2F%2A.nifty.com&limit=500&offset=0 e-shuushuu wiki] (ArchiveBot job ident <tt>3spkhvzhep0azp811nk4zelw5</tt>) and from {{url|http://award.surpara.com/misssp/|Miss Surfersparadise}} (ArchiveBot job ident <tt>ew3a0olovf2e2pq20ki2fwgra</tt>)
* On 2016-09-23, almost 80 URLs were scraped from [[Portalgraphics.net]] artist data (ArchiveBot job ident <tt>6gjq81kbvhhcjvf6v5z4ysv4i</tt>)


Next steps
Next steps

Revision as of 04:42, 24 September 2016

Nifty
Japanese ISP with web hosting
Japanese ISP with web hosting
URL homepage.nifty.com
Status Closing
Archiving status In progress...
Archiving type Unknown
Project source https://github.com/ArchiveTeam/nifty-discovery
IRC channel #niftyjanai (on hackint)

Japanese ISP providing web hosting. Will be closing about 140,000 unclaimed homepages by 2016-09-29. Termination notice[IAWcite.todayMemWeb] (Japanese)

http://homepage1.nifty.com/USERNAME/
http://homepage2.nifty.com/USERNAME/
http://homepage3.nifty.com/USERNAME/

URL harvesting

Let's follow Site exploration.

<polm> One thing I would recommend is searching Hatena Bookmarks, which is like a Japanese free Pinboard
<polm> Like so: http://b.hatena.ne.jp/entrylist?url=homepage2.nifty.com
<polm> the "of" query parameter paginates like so: http://b.hatena.ne.jp/entrylist?url=homepage2.nifty.com&of=20
<zout> there's some here. https://archive.is/homepage2.nifty.com

Progress

Next steps

  • GoogleScraper is no good. Make attempts at scraping, Bing, Twitter using hints on Site exploration
  • Put chunks of up to 100k URLs onto high speed (20160911.01) ArchiveBot pipelines