Difference between revisions of "Nifty"

Revision as of 04:42, 24 September 2016

Nifty
Japanese ISP with web hosting
URL	homepage.nifty.com
Status	Closing
Archiving status	In progress...
Archiving type	Unknown
Project source	https://github.com/ArchiveTeam/nifty-discovery
IRC channel	#niftyjanai (on hackint)

Japanese ISP providing web hosting. Will be closing about 140,000 unclaimed homepages by 2016-09-29. Termination notice^{[IA•Wcite•.today•MemWeb]} (Japanese)

http://homepage1.nifty.com/USERNAME/
http://homepage2.nifty.com/USERNAME/
http://homepage3.nifty.com/USERNAME/

URL harvesting

Let's follow Site exploration.

<polm> One thing I would recommend is searching Hatena Bookmarks, which is like a Japanese free Pinboard
<polm> Like so: http://b.hatena.ne.jp/entrylist?url=homepage2.nifty.com
<polm> the "of" query parameter paginates like so: http://b.hatena.ne.jp/entrylist?url=homepage2.nifty.com&of=20
<zout> there's some here. https://archive.is/homepage2.nifty.com

Progress

On 2016-09-12, User:Sanqui harvested 8884 *.nifty.com URLs from Wikimedia sites using mwlinkscrape
On 2016-09-13, root homepages were added to this list, making it 11423 URLs: https://raw.githubusercontent.com/ArchiveTeam/nifty-discovery/master/urls/wikimedia.txt. ArchiveBot job ident 21z8da69732jgmp4g6pn949p4
On 2016-09-15, Hatena bookmarks were scraped with a script and derived, producing a list of 19973 URLs: https://raw.githubusercontent.com/ArchiveTeam/nifty-discovery/master/urls/hatena.txt. ArchiveBot job ident 3i04vcsil92hl80yxbxiimncn
On 2016-09-16, archive.is pages were scraped with a script, derived and deduplicated, producing a list of mere 1165 URLs: https://raw.githubusercontent.com/ArchiveTeam/nifty-discovery/master/urls/archiveis.txt. ArchiveBot job ident 2bkvkya714zxqkity2cmw1w10
User:DoomTay has plucked more URLs from e-shuushuu wiki (ArchiveBot job ident 3spkhvzhep0azp811nk4zelw5) and from Miss Surfersparadise^{[IA•Wcite•.today•MemWeb]} (ArchiveBot job ident ew3a0olovf2e2pq20ki2fwgra)
On 2016-09-23, almost 80 URLs were scraped from Portalgraphics.net artist data (ArchiveBot job ident 6gjq81kbvhhcjvf6v5z4ysv4i)

Next steps

GoogleScraper is no good. Make attempts at scraping, Bing, Twitter using hints on Site exploration
Put chunks of up to 100k URLs onto high speed (20160911.01) ArchiveBot pipelines

@@ Line 35: / Line 35: @@
 * On 2016-09-16, archive.is pages were scraped with [https://github.com/ArchiveTeam/nifty-discovery/blob/master/scrape_archive_is.py a script], derived and deduplicated, producing a list of mere 1165 URLs: https://raw.githubusercontent.com/ArchiveTeam/nifty-discovery/master/urls/archiveis.txt.  ArchiveBot job ident <tt>2bkvkya714zxqkity2cmw1w10</tt>
 * [[User:DoomTay]] has plucked more URLs from [http://e-shuushuu.net/wiki/index.php?title=Special:LinkSearch&target=http%3A%2F%2F%2A.nifty.com&limit=500&offset=0 e-shuushuu wiki] (ArchiveBot job ident <tt>3spkhvzhep0azp811nk4zelw5</tt>) and from {{url|http://award.surpara.com/misssp/|Miss Surfersparadise}} (ArchiveBot job ident <tt>ew3a0olovf2e2pq20ki2fwgra</tt>)
+* On 2016-09-23, almost 80 URLs were scraped from [[Portalgraphics.net]] artist data (ArchiveBot job ident <tt>6gjq81kbvhhcjvf6v5z4ysv4i</tt>)
 Next steps

Difference between revisions of "Nifty"

Revision as of 04:42, 24 September 2016

URL harvesting

Progress

Navigation menu

Search