Cost of archiving a website

From Archiveteam
Revision as of 13:01, 14 August 2018 by DiscussionArchiver (talk | contribs) (→‎Real-time archiving)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Using wget to archive a website is simple and is ideal for small and medium websites. For very large websites, however, you will end up downloading a lot of pages and files that are not needed. Unneeded pages include duplicate pages, search pages, login pages and "edit message" pages. Downloading those unneeded will waste a lot of disk space. In this case, one may have to spend a little time to study the website's organization in order to determine which pages should be downloaded or not.

Real-time archiving

Real-time archiving require scripts to poll new content continuously. Writing such scripts may require a lot of time. One may have to parse the HTML to discover pages with new or updated content. The HTML may be invalid. The encoding of the page may also be invalid. One may have to convert the encoding to another encoding before being stored. One has to handle connection errors or missing/invalid pages. If there exists any media such as images, you have to design a good enough data structure or file structure to store them.

This gets even more complicated when the website content is edited often. A data structure must be designed to store different revisions of the content in a space-efficient manner. One has to periodically re-download content and then compare them with prior revisions to see if anything is edited.