Talk:Robots.txt

From Archiveteam
Revision as of 17:49, 4 May 2019 by ATrescue (talk | contribs) (Bravo!: Referencing feature suggestion at find more URL's to archive!.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

There are a few issues with this decision which have strong consequences.

  • Robots.txt is an established protocol. Changing the meaning of it will lead to failed expectations from users.
  • Robots.txt is a simple (dumb) protocol which targets indexing, harvesting bots without having to rely on user agent sniffing.
  • Robots.txt is a very simple mechanism for managing a certain level of opacity.

The only way to move forward is not to ditch robots.txt but to create something better and develop tools and protocols which help people move forward. Let's give a better control to user. We could start something with a W3C community group.

Issues to solve

  • Robots.txt exists only in the root of a website which makes it unusable in multi-owned web site. For example, it doesn't address the differences between example.org/mike and example.org/suzie
  • Robots.txt makes the directory visible, when it should not be necessary the case. Reveal to hide.
  • A protocol should help to target classes of user-agents (browser, bots, etc.) in a way that the Web site can be public and at the same time not indexable.

Karlcow

Bravo!

This manifesto truly stole my words! I have always despised how “This page cannot be crawled or displayed due to robots.txt”! Thank you for pointing out the truth about R*b*ts.t**, of which the only function for ArchiveTeam is to find more URL's to archive! (similar to sitemap.xml).--ATrescue (talk) 17:47, 4 May 2019 (UTC)