The Internet Archive

The Internet Archive

Intro

Archives are good. Original documents, original materials, well taken care of, can provide invaluable information to researchers who are trying to find out why and how something happened. who are trying to get a perspective on a subject. Our US National Archives and Records Administration (http://www.archives.gov/) is one great example, but almost every governmental organization, every not-for-profit organization, every academic institution, every body that needs to document its history, and many people who have make a profound intellectual or aesthetic impact on our lives need to develop some kind of archive.
But how do you archive the the entire Internet, the entire World Wide Web? How do you archive transient electronic information?

Solving those problems and creating such an archive is the goal of the not-for-profit Internet Archive (http://www.archive.org), which is headquartered in the Presidio in San Francisco. To quote its home page, "The Internet Archive is building a digital library of Internet sites and other cultural artifacts in digital form. Like a paper library, we provide free access to researchers, historians, scholars, and the general public."

So, the organization's Webcrawlers/Webspiders download everything that they can from the Internet. The Archive says it has over one petabyte (1,000,000,000,000,000) of information stored on its disk arrays and is downloading about 20 terabytes each month.

This translates into over 40 billion Web pages in the Archive's database. But, remember, this is a very rough estimation.

Note: The Internet Archive is mirrored at the Bibliotheca Alexandrina (the new Library of Alexandria) in Egypt at: http://www.bibalex.org/English/initiatives/internetarchive/web.htm.

Internet Archive Tools

The Wayback Machine (anybody remember the reference)

The Wayback Machine (http://www.archive.org/web/web.php) enables you to look at the previous states of Web sites and Web pages.

You just enter a Web page, and the Wayback Machine displays a table with the Web page collected at different days through the years.

It only goes back to 1996, though.

one fun thing to do is look at the early Google.com or Yahoo.com.

Advanced Search is at: http://web.archive.org/collections/web/advanced.html

Recall (Currently inactive)

Recall is a search engine that searches the entire content of the Internet Archive, not for specific Web sites or pages as when you use the Wayback Machine.

Unfortunately, as of October 2004, Recall is down. It is supposed to be back up in a few weeks with a new index.

http://recall.archive.org

Movie Archive

Currently, an evolving collection of advertising, educational, industrial, and amateur films plus all of the episodes of the (deceased) Computer Chronicles, animated films from Siggraph 2001, the episodes of Net Cafe, and other short films.

http://www.archive.org/movies/movies.php

Live Music Archive

Currently, an evolving collection of live music concerts.

http://www.archive.org/audio/etree.php

Text Archive

The Text Archive is aggregating all of the free, online books and other texts from the Million Book Project (currently 14,631 texts), Children's Library (1,582 texts) including works from the International Children's Digital Library, Project Gutenberg (9,848 texts), Arpanet documents (251 texts), and other Open Source Books (413 texts). Numbers are current as of October 2004.

http://www.archive.org/texts/texts.php