Peer-to-peer web

Please email comments to the author at the above address.

First draft 12 Nov 2001
Second draft 7 May 2002
Third draft 15 Feb 2003
Fourth draft 31 Jul 2007

Introduction

I assume readers of this document are familiar with the many advantages of the World Wide Web (henceforth abbreviated as "the web"). However there are still many severe technical shortcomings which I hope to address. The present proposal for peer-to-peer enhancements to the web is intended primarily to address scalability, as discussed in my 1995 presentation at the Asia-Pacific Web Conference.

Goals

The goals of this proposal are as follows:

Reduce load on web servers (and hence hardware and bandwidth expenses)

As discussed in the above-cited paper, web servers are subject to unpredictable bursts of high activity called "flash crowds", aka "the Slashdot effect" after the effects of publicity in highly popular online journals and news sites such as slashdot.org. This is a major issue for many web publishers who can not afford the hardware and bandwidth costs required to meet the level of demand, or even if they can afford it cannot cost justify it since the average traffic levels are typically many orders of magnitude smaller. In some cases popular sites have been forced to shut down either temporarily (due to load on their web server and/or their and even their provider's bandwidth) or permanently (due to unanticipated bandwidth expenses).
Improve availability for web users

This leads directly to the second goal - it benefits both the publishers and the readers if popular web sites become more available rather than less! If the content is replicated across multiple web servers in different geographical locations connected to different service providers and backbones, the content is far more likely to remain available over longer periods of time thus providing a much-needed increase in the longer-term stability of hyperlinks.
Improve performance for web users

In addition to increasing the total amount of bandwidth available for distribution of popular web content, replication also increases the probability that the content will be available from a source that is less distant in the network topology and thus has both lower latency and possibly also higher (and perhaps also cheaper) bandwidth. All of these factors improve the download performance for web users.

Roles

Publisher

The publisher will often be the author and/or copyright holder of the documents being published, but not always. Whoever is responsible for making the content available online is by definition the publisher. One of the strengths of the Internet and the web is the increasing ease with which people can publish content online and achieve global reach at increasingly little difficulty and expense. This leads to an increase in the amount and diversity of available content, which benefits all Internet users.
Archivist/mirror

Persons other than the author, copyright holder and publisher also have an interest in the continued and widespread availability of documents, either for their own continued use over a lengthy period of time or because they value the work and wish to encourage its wider dissemination. These people may wish to make one or more replicas (mirrors) of documents available over extended periods of time either publically or to a restricted audience in order to assist these goals.
Cache

Some people and organisations also benefit by providing temporary replicas of web content in order to improve the performance and decrease the cost of their Internet service, either for their own use or as provided to others.
Reader/viewer/user/client

Practically all Internet users will require access to content made available on the web at some time - if not constantly. Easier and faster retrieval of desired content is beneficial to everyone.

Note that one person and indeed one computer system can and often will take on any combination of one or more of these roles.

Content need not be made unrestrictedly available; it can also be restricted by (for example) the IP address, domain name or user authentication of the client in order to limit the resource usage on a particular server or network. Restricting the circulation of a document is better achieved by encrypting the content with keys available only to the desired audience.

For example, Internet Service Providers often set up caches and in some cases mirrors for their customers. Mirrors are also provided by libraries and other people and institutions with an interest in the long-term preservation of content such as Google and the Internet Archive. Internet Content Hosts may wish to enter into mutual agreements with mirrors elsewhere in order to gain the benefits outlined above. Presently this has to be manually configured for each web site. This proposal also incorporates methods for automating mirroring agreements.

Implementation

I propose that these goals can be achieved by making content available from other hosts in addition to those operated by the content publisher, thus moving from a centralised to a decentralised model of web publishing. In addition to transient caching, this proposal also permits web users to become archivists/mirrors to guarantee stable availability.

The proposed technical implementation described in more detail below consists of methods for implementing mirrors and caches for both public and restricted use, methods for automatically determining the availability of content at these mirrors and caches, and methods for authenticating the validity of the copies of documents made available. Mechanisms are also proposed to permit publishers to actively encourage the replication of their content.

The proposal consists of "client functions" to be implemented in software that retrieves web content such as web servers acting as mirrors, proxy caches and web browsers and "server functions" to be implemented in software that provides web content such as web servers and proxy caches. Web browsers also typically include a local cache and could be extended to provide network access to the cache.

Client functions

The client should try to retrieve from fastest/nearest/cheapest source
The client should ask the source "tell me some public mirrors" and try fittest first on next request
Propose new HTTP extension headers? Also ICP extension?
Standard fitness function: within local net first, then highest download speed. Can add % hit ratio, traceroute #hops, partial IP address match etc. later.
Rotate through all known mirrors, moving fittest to front, least fit towards back and offline to very back. Use front of list more often than back of list.

Server functions

The server should maintain a list of mirror clients who have requested content. If contacted by an enhanced client, the server should provide a (partial? random?) list of mirror clients.
The server should periodically poll clients on the distribution list (oldest first) and check if they are online and still have valid data. This can be achieved by selecting a random resource and requesting a random byte range from that resource via a random proxy client to confirm that it matches the corresponding part of the original resource.
A mirror client can send an HTTP error to indicate that a resource is no longer present, out of date or the client is not making service available. A lack of response, however, just indicates that the client is temporarily offline.
When the list of mirror clients is too large, the clients that have been offline the longest should be removed first.
Archivist functionality is offered by selecting match criteria to keep the content indefinitely (exempt from LRU cache deletion). Match criteria are normally start-of-url but others could also be implemented such as title, author or filetype.
Servers can "push-cache", offering additional related data to a client that seems to be frequently requesting related material. Clients can reject "pushed" content for any reason - a server not already known to host the resource must be rejected to avoid malicious content spoofing, but a client can also reject content for lack of space etc.

Client - Server HTTP (& ICP?) request protocol

Client: GET <uri>, new headers (or options?): List-mirrors, Will-mirror
Will-mirror header includes URI prefix(es) offered

Server: Normal HTTP response, plus map of original → mirror URIs

Server - Mirror validation protocol

Server: GET <uri> plus Byte-Range header

Proxy: error response or valid data

Server - Mirror push-cache protocol

Server: PUT <uri>

Proxy: success or error response

Client - Server cross-check protocol

Client: HEAD <uri>

Server: Usual response, including document hash header (etag?)

Security implications

Only static resources can be supported with this system. Dynamic resources must not be offered for caching by servers (use existing HTTP headers, also proxy caches can refuse to fetch dynamic requests.) Proxy cache servers (not web servers) can have request limits from any one source and overall request limits from "outsiders".

Malicious caches that respond with valid data to the original server but altered data to some or all other requests can be detected by the original server performing verification using random proxies as described above. This requires clients to support proxy requests for content originating from servers the client is already caching.

Micropayment

TBD

Additional notes

Note added 2 Dec 2001 - see also IBM YouServ (no longer available at http://www.almaden.ibm.com/cs/people/bayardo/userv/ as of 2007)

Note added 4 May 2002 - see also http://cogitive.com/nl/main.html (no longer available as of 2004)

Note added 18 May 2002 - open-content.net looks very similar!!

Note added 15 Feb 2003 - also support out-of-band mirroring using eg. rsync

Note added 24 Nov 2004 - see also Dijjer

Note added 12 Jan 2005 - see also World Free Web (Freshmeat editorial) and Kenosis