Cost of archiving a website

From Archiveteam
Jump to: navigation, search

Using wget to archive a website is simple and is ideal for small and medium websites. For very large websites, however, you will end up downloading a lot of pages and files that are not needed. Unneeded pages include duplicate pages, search pages, login pages and "edit message" pages. Downloading those unneeded will waste a lot of disk space. In this case, one may have to spend a little time to study the website's organization in order to determine which pages should be downloaded or not.

Real-time archiving

Real-time archiving require scripts to poll new content continuously. Writing such scripts may require a lot of time. One may have to parse the HTML to discover pages with new or updated content. The HTML may be invalid. The encoding of the page may also be invalid. One may have to convert the encoding to another encoding before being stored. One has to handle connection errors or missing/invalid pages. If there exists any media such as images, you have to design a good enough data structure or file structure to store them.

This gets even more complicated when the website content is edited often. A data structure must be designed to store different revisions of the content in a space-efficient manner. One has to periodically re-download content and then compare them with prior revisions to see if anything is edited.