Difference between revisions of "GeoCities Japan"
Jump to navigation
Jump to search
(Adding some issues to keep track of.) |
|||
Line 1: | Line 1: | ||
[[File:Geocities_japan_2k.png.jpg|400px|right]] | |||
{{Infobox project | {{Infobox project | ||
| title = GeoCities Japan | | title = GeoCities Japan | ||
| image = Geocities japan 2k.png | |||
| URL = http://www.geocities.jp/, http://www.geocities.co.jp/ | | URL = http://www.geocities.jp/, http://www.geocities.co.jp/ | ||
| project_status = {{closing}} | | project_status = {{closing}} | ||
Line 20: | Line 23: | ||
** geocities_jp_first.txt: First level subdirectory list under geocities.jp, compiled from IA CDX data. 566,690 records in total. | ** geocities_jp_first.txt: First level subdirectory list under geocities.jp, compiled from IA CDX data. 566,690 records in total. | ||
** geocities_co_jp_first.txt: Same as above, for geocities.co.jp. 12,470 records in total. | ** geocities_co_jp_first.txt: Same as above, for geocities.co.jp. 12,470 records in total. | ||
*** NOTE: The majority of sites under geocities.co.jp are not first-level sites, but " | *** NOTE: The majority of sites under geocities.co.jp are not first-level sites, but "neighborhood" sites which are second-level (there could be, in theory, 1.79M of them; how many actually exist unknown), see explanation below. | ||
** blogs_yahoo_co_jp_first.txt: Same as above, for blogs.yahoo.co.jp. 646,901 records in total. | ** blogs_yahoo_co_jp_first.txt: Same as above, for blogs.yahoo.co.jp. 646,901 records in total. | ||
** geocities_co_jp_fields.txt: List of | ** geocities_co_jp_fields.txt: List of neighborhood names under geocities.co.jp. | ||
*** Individual websites are listed in the following format: "http://www.geocities.co.jp/[ | *** Individual websites are listed in the following format: "http://www.geocities.co.jp/[NeighborhoodName]/[AAAA]" where AAAA ranges from 1000 to 9999. | ||
** include-surts.txt: List of subdomains that should be allowed by your crawler. | ** include-surts.txt: List of subdomains that should be allowed by your crawler. | ||
* geocities.jp grab from [https://e-shuushuu.net/wiki/index.php/Main_Page E-Shuushuu Wiki], crawled as {{Job|cu6azkjwy45qmo1wwdxsdfusj}}: https://pastebin.com/raw/17hLpsN5: | * geocities.jp grab from [https://e-shuushuu.net/wiki/index.php/Main_Page E-Shuushuu Wiki], crawled as {{Job|cu6azkjwy45qmo1wwdxsdfusj}}: https://pastebin.com/raw/17hLpsN5: | ||
Line 29: | Line 32: | ||
* geocities.co.jp and missed geocities.jp URLs grabbed from the above targets, crawled as {{Job|31ges4c4c96k140sp6zah5vcc}}: https://transfer.sh/CLtZc/geocities-patch.txt ('''dead link''') | * geocities.co.jp and missed geocities.jp URLs grabbed from the above targets, crawled as {{Job|31ges4c4c96k140sp6zah5vcc}}: https://transfer.sh/CLtZc/geocities-patch.txt ('''dead link''') | ||
* geocities.co.jp and geocities.jp crawl from [http://web.archive.org/web/20140403184117/http://award.surpara.com/misssp/ Miss Surfersparadise], crawled as {{Job|e8ynrp5a7p4vwjkyxw9eph9p0}}: https://archive.org/download/archiveteam_archivebot_go_20181021150002/urls-transfer.sh-geocities-misssp.txt-inf-20181007-102152-3ntkw-urls.txt | * geocities.co.jp and geocities.jp crawl from [http://web.archive.org/web/20140403184117/http://award.surpara.com/misssp/ Miss Surfersparadise], crawled as {{Job|e8ynrp5a7p4vwjkyxw9eph9p0}}: https://archive.org/download/archiveteam_archivebot_go_20181021150002/urls-transfer.sh-geocities-misssp.txt-inf-20181007-102152-3ntkw-urls.txt | ||
== Issues == | |||
* Hidden-entry sites ''(Importance: '''Low''')'': There are a few sites that do not use index.htm/index.html as their entry points; as a result, first level directory access will fail to reach them. | |||
** However, as long as there are other geocities sites linked to them, they should be discoverable by the crawler. | |||
** So the only problem are those pages whose inlinks are all dead. There should be very few of those. If we want to be absolutely sure, we can run a diff between IA's current CDX and that from the crawl. | |||
** Notice that this is not a problem with the neighborhood sites as we can enumerate the URLs. | |||
* Deduplication ''(Importance: '''Low''')'': If we are going to release a torrent as we did with Geocities, they it may be worth to dedup. Most likely won't be a major difference. | |||
* Final Snapshot ''(Importance: '''Moderate''')'': The page contents may still change between now and March 31 2019, so we need to do another crawl when the time is near. | |||
** Note that a lot of users will be setting up 301/302s before the server shuts down. According to Yahoo, we'll have until Sep 30 2019 to log down those 301/302s. |
Revision as of 05:31, 5 November 2018
GeoCities Japan | |
URL | http://www.geocities.jp/, http://www.geocities.co.jp/ |
Status | Closing |
Archiving status | In progress... |
Archiving type | Unknown |
IRC channel | #notagain (on hackint) |
GeoCities Japan is the Japanese version of GeoCities. It survived the 2009 shutdown of the global platform.
Shutdown
On 2018-10-01, Yahoo! Japan announced that they would be closing GeoCities at the end of March 2019. (New accounts can still be created until 2019-01-10.)
Discovery Info
- DNS CNAMEs for geocities (JSON format) (dead link): https://transfer.sh/QYWEG/geocities-dns-data
- Several records available at: https://anonfile.com/z1z62ak8ba/records_zip
- geocities_jp_first.txt: First level subdirectory list under geocities.jp, compiled from IA CDX data. 566,690 records in total.
- geocities_co_jp_first.txt: Same as above, for geocities.co.jp. 12,470 records in total.
- NOTE: The majority of sites under geocities.co.jp are not first-level sites, but "neighborhood" sites which are second-level (there could be, in theory, 1.79M of them; how many actually exist unknown), see explanation below.
- blogs_yahoo_co_jp_first.txt: Same as above, for blogs.yahoo.co.jp. 646,901 records in total.
- geocities_co_jp_fields.txt: List of neighborhood names under geocities.co.jp.
- Individual websites are listed in the following format: "http://www.geocities.co.jp/[NeighborhoodName]/[AAAA]" where AAAA ranges from 1000 to 9999.
- include-surts.txt: List of subdomains that should be allowed by your crawler.
- geocities.jp grab from E-Shuushuu Wiki, crawled as job:cu6azkjwy45qmo1wwdxsdfusj: https://pastebin.com/raw/17hLpsN5:
- geocities.jp grab from Danbooru, crawled as job:5x0pf7wloqgeqc2r9rddino2l: https://gist.githubusercontent.com/DoomTay/12a146e35fcee745b764ba3ae3c7545f/raw/863a021e43e0c93cb6f8943725a2ef5d1a699477/geocities-danbooru.txt
- geocities.co.jp and missed geocities.jp URLs grabbed from the above targets, crawled as job:31ges4c4c96k140sp6zah5vcc: https://transfer.sh/CLtZc/geocities-patch.txt (dead link)
- geocities.co.jp and geocities.jp crawl from Miss Surfersparadise, crawled as job:e8ynrp5a7p4vwjkyxw9eph9p0: https://archive.org/download/archiveteam_archivebot_go_20181021150002/urls-transfer.sh-geocities-misssp.txt-inf-20181007-102152-3ntkw-urls.txt
Issues
- Hidden-entry sites (Importance: Low): There are a few sites that do not use index.htm/index.html as their entry points; as a result, first level directory access will fail to reach them.
- However, as long as there are other geocities sites linked to them, they should be discoverable by the crawler.
- So the only problem are those pages whose inlinks are all dead. There should be very few of those. If we want to be absolutely sure, we can run a diff between IA's current CDX and that from the crawl.
- Notice that this is not a problem with the neighborhood sites as we can enumerate the URLs.
- Deduplication (Importance: Low): If we are going to release a torrent as we did with Geocities, they it may be worth to dedup. Most likely won't be a major difference.
- Final Snapshot (Importance: Moderate): The page contents may still change between now and March 31 2019, so we need to do another crawl when the time is near.
- Note that a lot of users will be setting up 301/302s before the server shuts down. According to Yahoo, we'll have until Sep 30 2019 to log down those 301/302s.