Logo
      Menu
 Index
 Introduction
 Topology & Peer Discovery
 Instant Messaging
  File Sharing

     Capabilities
     Napster
     Gnutella
 Distributed Storage
 Distributed Computation
 Obstacles
 Future of P2P
    Part 4: File Sharing

Capabilities

File-sharing is the most known application of P2P. The idea of file-sharing is as it name implies is sharing user's files with his fellow peers in the network. Each user makes a portion of his files available to other users in the network. This way users can freely share and receive various files of their interest. First file-sharing application was Napster, which allowed swapping MP3 music files. Today file-sharing applications support swapping files of all types. Some of the popular file-sharing applications are Gnutella, KaZaA, iMesh, Music Nation (Morpheus), AudioGalaxy, eDonkey2000, Scour Exchange and numerous others.

The file sharing world has many interesting aspects in addition to the technological one. One area is the resistance to new information distribution techniques from old content distributors like the established film and music industry. Unfortunately, P2P file sharing systems have been labeled as a technology for illegal distribution of copyrighted material. It is worth noting that this technology has usages besides the pirating ones. One example is as a tool for in-house project share spaces. Several of the technologies also use a decentralized operation model with no central single point of failure that makes them suitable for use as a tool for combining together numerous heterogeneous information systems. An example of this application already exists in the "Care Data Exchange" system from CareScience Inc. that applies the techniques from P2P file sharing to establish glue for searching and exchanging medical journals between self-contained systems.

The world of P2P file sharing technologies is a constantly changing field of new ideas and systems. For example, fie-sharing applications now allow chat between users and sophisticated fuzzy searches on files' meta-data. In this section, we will try to give an introduction to technical aspects of P2P file sharing technologies by describing the Napster technology and the P2P architectures that replaced it.

First, one must understand that the communication channels in P2P networks are application level logical channels, independent of the physical or network level. As a result of this, your closest peer may be physically located on the other side of the world and the computer in your neighbor office is too distant and not within your reach. We should mention that research is done on P2P systems that try to merge the logical infrastructure to the physical for performance purposes.


Top

Napster

 

Napster has undoubtedly been the first killer application of P2P technology. As the user sees it, Napster is a music exchange community. More precisely, it is a commercial, centrally coordinated, P2P, MP3 file exchange system with chat support.

Peer operation
A participating peer goes through four different phases: connection establishment, search, download and connection termination. Upon connection establishment to the central Napster service, a peer uploads the list of his locally stored MP3 files. The information, including additional data in the form of the MP3-tag, is merged into the central Napster database of currently available music files. The central Napster database gives at any moment an updated view of the MP3-files of the peers currently connected to the Napster service. A peer searches the central database for music to download and will receive a list of matches including additional information about the location of each matching file. Attributes describing the network connection of the serving peers are also returned to help the user select a specific download source from the returned list of matches.

Downloading is carried out directly between peers without the involvement of the central Napster service. A peer wishing to download a file addresses the specific data unit on a serving peer in the form of the IP-address and the filename of the MP3-file. Upon connection termination, the central Napster server removes MP3 file entries in its database relating to this specific peer.

Network
The Napster network is a simple two-level hierarchical network with direct P2P communication during the actual download process. Napster uses a proprietary TCP based communication protocol. Although it has never been officially released, it has been reversed engineered and compatible clients are now available from other sources. The reverse engineering has also created the OpenNap (Open Napster), a separate but compatible Napster service. OpenNap is an open source project that has also extended the original system with a backbone of a distributed set of coordinated OpenNap servers. Although the single central point of coordination is removed, the new distributed backbone is still somewhat static and can still not be described as a "true P2P" system.

The most significant advantages and disadvantages of the Napster technology are just two sides of the same property. The centralized service makes Napster as scalable as any other traditional Internet service. Scalability is simply a matter of more bandwidth to a backbone system with enough CPU power. The downside is the vulnerability of a centralized system and the dependence on a closed technology under the control of a commercial actor. This is what allowed the RIAA to shutdown Napster; by closing the main server. Napster was virtually eliminated in a second. Several other file-sharing technologies are based on the same design as Napster. Some clones specialize in the exchange of other or more generic file types (Scour Exchange, Audio Galaxy ,File Rouge), "improvement" of the basic functionality (eDonkey2000, iMesh, etc.) or in community/content .

From technological point of view, all we can say about Napster is that it was inspiration for other people to create better file-sharing architectures. Napster employed semi-P2P infrastructure, simple dynamic addressing and rather obvious peer discovery mechanism.


Top

Gnutella

Gnutella is a fully decentralized P2P file sharing system. Another more interesting way of describing Gnutella is as a network of simple stand alone web servers interconnected and searchable through the Gnutella network. Gnutella has lived its first year in the shadow of its famous P2P cousin Napster. The source was never released and the original program binary was online for only a few hours before someone in the Nullsoft/AOL/Warner Corporation silently removed it. As another example of the viral characteristics of Internet and P2P technologies in general, information released is almost impossible to retrieve and the original Gnutella client was widely spread, mostly thanks to the high impact of the -Slashdot forum-. Soon after the initial release, the Gnutella protocol was reverse-engineered and resulted in alternative implementations and also initiated a further development of the technology.

Gnutella is under constant development to remove bottlenecks and to expand the user functionality. The following explanation is mostly based on its original design that supported searching and downloading of files in a fully decentralized network of small stand alone web servers. The term Gnutella is used to describe:

1. The original client implementation from Nullsoft (or alternative implementations).
2. The specification of the protocol that makes interoperation possible between different implementations.
3. The network of interconnected Gnutella clients often denoted as the GnutellaNet.

Gnutella introduced the term Servent for a peer participating in a P2P network. The word is a paraphrase of the words "server" and "client" and indicates the dualistic nature of a peer in the network where the client/server separation has disappeared and the peer is operating both as a server and a client.

The Gnutella Rave
This phrase has been used to describe the extreme dynamic and flexible nature of the Gnutella network. Gnutella servent is commonly interconnected through direct connections to a few other servents, i.e. the direct neighbors of a servent. All communication, with the exception of direct file download, is done through its immediate neighbors. The whole network is just the collection of servents that may be able to communicate indirectly through the global set of such interconnected neighbours. Gnutella is denoted a mesh type network or a graph with a very large number of cycles.

The term Rave also hints at the constant reformation of the network due to servents entering and leaving the mesh or dropping of network links due to communication problems. Each servent tries to maintain a number of connections and the "repair" process results in a dynamically changing "neighborhood". The design will in practice only make a subset of the whole network reachable to a single servent. Since a servent also has its private view of the neighborhood, each servent has an individual view of what the network looks like. The reachable region for a peer is commonly denoted as its horizon. The view horizon of Gnutella servents makes all data beyond this region both invisible and inaccessible. At first sight it may seem quite a limitation for a participant only to reach a part of the available information. Due to the super distribution property of a P2P system, information will spread around and in many cases be available in the horizon of a specific peer.

The dynamic nature of the Gnutella network, and also the network horizon of servents, makes it almost impossible to get an exact view of the whole network. It is not strictly correct to talk about "the GnutellaNet" because several separate networks exist with no interconnection to "the Gnutella network". Since the early days of Gnutella, implementations have supported the formation of separate communities by giving the user the ability to change the protocol name used in the connection request string. The Gnutella implementations simply describe this as the 8-character name of the network. Some client implementations support password protected private networks but access to a private network will typically only require the knowledge of the clear-text network ID and the location of an active servent. Future Gnutella implementations can be expected to have improved its support for private locked networks.

The rules of the game
To join the Gnutella network, a servent's first challenge is to locate other servents already connected to the network, i.e. finding and establishing connections to its immediate neighbours. This process is not a part of the Gnutella protocol specification and was initially done manually by picking from a static list of known and possible connected peers. The bootstrap process was soon automated and servent implementations are now configured to automatically connect to one of several peer-brokers available on the Internet. These peer brokers, or "rendezvous peers" or "super peers", present themselves to a connecting peer as an ordinary servent, but will disconnect when the new servent is bound to the network.

A running servent maintains a locally dynamic cache, or list, of known active servents. -host image -. It picks candidates for its neighbours from this list. This cache is built with the help of the "Ping" and "Pong" protocol messages. Both during the bootstrap, and also through its lifetime, a servent generates "Ping" messages to inform others of its existence. This message is broadcasted to the peers within its network horizon. The region is specified with a TTL (time-to-live) parameter telling how deep into the mesh the message should be distributed. TTL is just a counter that a servent decreases when it forwards a message. Further broadcasting of the message is dropped when it reaches zero.

A servent receiving a "Ping" message acknowledges this by returning a "Pong" protocol message that is routed back through the same path as the "Ping" message. A servent can then do an active discovery using a "Ping" message, waiting to receive the acknowledgment "Pong" messages, or just stay passively listening for "Ping" messages (which it of course should acknowledge with a "Pong").

Not all messages in the Gnutella network are broadcasted. Some messages ("Pong", "QueryHit" and "Push") are routed directly from source to destination. Routing of messages throughout the network is based on a mechanism where a servent temporarily stores the "semi-unique" message identifier ("Descriptor ID") together with the network link source where the message was received. When a message with a corresponding ID is received, the servent looks in this cache to locate the outgoing link to where the message should be pushed.

Searching
Searching in Gnutella is based on a distributed broadcast and forward technique. The search is entirely distributed and each servent processes a query individually and replies to the initiator if any matches were found locally. A query message is broadcasted to the immediate neighbours of a servent, which forwards it to all their neighbors, and so on. This process is repeated until the TTL counter reaches zero, i.e. the query is broadcasted throughout the peers above the horizon of the servent. A successful match by a servent is reported using a QueryHit protocol message containing a set of responses to the corresponding Query.

As stated above, the QueryHit message is routed back to the originator of the search by matching the descriptor ID against the identical Query ID. Existing Gnutella implementations interpret this as a search string to match against the names of local files made available to the GnutellaNet. The matching algorithm varies amongst implementations but the search string is commonly treated as a sub string of a potential file name. The interpretations of a Gnutella query as a file search is also substantiated by the specification of the format of a result set in a QueryHit message. Although the format restricts the utilization of Gnutella for file searching purposes only, further references to a match are done using an ID that uniquely identifies the resource at a specific servent. It is worth noting that this opens the utilization of the Gnutella technology for any type of distributed searches. One example is a distributed patient information retrieval system using Gnutella as the glue to connect all the non-cooperating information systems typically found in e.g. larger hospitals.

Downloading
The servents are simple web-servers and the download is done P2P, using a standard GET HTTP request. Another advantage by selecting an existing open standard for data transmission is related to the use of firewalls and P2P. In some cases, firewalls are not filtering HTTP traffic and it is possible for outside peers to initiate downloads from an inside servent. If this is not the case, the Gnutella protocol supports remote initiated transfers, i.e. that the servent inside the firewall serving the file will initiate the connection establishment. The outside peer will initiate this process by generating a "Push" message that is routed back to the serving servent.

Anonymity
Describing the anonymity property of Gnutella in regard to the roles of publisher, searcher and reader, Gnutella can be summarized as non-anonymous publishing, anonymous search and non-anonymous reading.

Publishing: Publishing is a "passive action" in Gnutella. A file is "published to the Gnutella network" by making it available to the local servent. The presence of the document is not reported anywhere and other servents must actively detect it by issuing a query. This means no file browsing supported like in the "Direct Connect" client. The publishing is a non-anonymous process since any searchers may disclose the location of the document.

Searching: The originator of a query is hidden for a servent receiving the message, and searching the Gnutella network is consequently anonymous. However, immediate neighbours know the source since this is indicated by a "Hops" field of 0 (number of times the message has been forwarded) and they know the link address of the initiator. As long as servents deeper into the mesh do not cooperate to reveal the message distribution chain, searching should be classified as anonymous. Since the searches are visible to a specific servent, it may implement filtering rules deciding which searches should be broadcasted or not. Although a central censoring mechanism is impossible to implement, Gnutella is a rare example of a self-governed community where the individual members actually have direct influence of the utilization of the system from others.

Downloading: Downloading is not anonymous and the requestor is visible to the serving peer. This was exploited by an initiative to stop the distribution of illegal material on the Gnutella network. The "Gnutella Wall of Shame" revealed the identity of requestors trying to download harmless material that was published as clearly illegal material. Guerrilla Network Trading is another decentralized file sharing technology using a modification of the Gnutella design supporting anonymous downloads. The transfers are encrypted and also routed through the network. Downloading through the network increases the load but would probably not be a problem in practice. This due to the nature of the information to flow through a network designed to "make a system for spreading political propaganda in countries without the freedom of speech".

Scalability
The story tells that an employee at Nullsoft said in an IRC chat that GnutellaNet probably would not scale to more than 250 or so clients. Gene Kan, a highly profiled spokesman in the Gnutella community, states that the technology was initially designed to support file sharing in a small network between friends. Mr. Kan tells further that with improvements, the technology may be able to support thousands of users but never millions like Napster. Gnutella has indisputable limitations in its design that makes it non-scalable; the problems are related to the broadcast-and-forward mechanism of messages.

1st Generation - From Nullsoft to the August 2000 breakdown:
Gnutella in the original Nullsoft design treated every peer as equal and the GnutellaNet was one unified network. A peer connected by a low bandwidth connection could be as important in the message passing process as a node with a high bandwidth connection.

In the period from March to August 2000, a Gnutella servent typically reported between 1000-4000 available servents and periodically up to 8000. Then suddenly, users started to experience abnormalities. The number of visible peers dropped dramatically although the peer brokers did not report any reduction in the number of servents connecting to the network. This incident is known as the August breakdown of Gnutella. The network after the breakdown lived in a semi-collapsed state where the network was fragmented in numerous disconnected segments. It did not collapse entirely but existed in an intermediate position between scaling and collapse. Gnutella collapsed due to its popularity. Analyses of the August incident showed that the initial Gnutella had reached its maximum and further expansion was impossible since peers connected through low bandwidth connections could not keep up with the traffic load. Gnutella had reached its "modem bandwidth barrier".

2nd Generation - Connection logic:
After the breakdown, servent implementations were extended with logic to reduce the role of bandwidth-limited peers in the network. The strategy was to drop connections to peers that could not keep up with their traffic load. Such servents would then be pushed away from the central parts of the network. Based on the user specification of the network connection type, servents also implemented simple logic to automatically configure the number of simultaneous incoming and outgoing connections that the bandwidth could typically support. By specifying a maximum of 1 outgoing and disallowing incoming connections, a peer could be placed on the far edge of the network. After the introduction of the new connection management logic, the Gnutella network stabilized through segmentation, with partitions disjoining and rejoining periodically.

3rd Generation - Dynamic or static hierarchies:
As stated above, the disadvantage of the Gnutella decentralized distributed flat architecture is its lack of scalability. If you want to reach a larger number of peers, i.e. put the horizon further away, you need to start building hierarchies with a well connected backbone. This is happening just now in Gnutella with the introduction of both static managed and also dynamic spontaneous hierarchies: the "reflector" as a network point for servents with low bandwidth connections. This is a type of server, or "super-peer", which processes the network load on behalf of its serving clients and shields them from the huge load of network messages. The reflector is also an index server for the files of its clients and serves incoming queries on their behalf. Although it currently has some compatibility problems with its role as an indexing server, it is a transparent proxy for both clients and remote servents. Another interesting feature of the reflector is its ability to work as a hub in a small private Gnutella network segment.

The reflectors require installation and configuration of the specialized software. Its availability must also be announced by out-of-band means so servents requiring this functionality can be manually configured to utilize them as GnutellaNet access points. The reflector is a method to manually build a static 2-level network hierarchy in the Gnutella network. What GnutellaNet really needs is a fully automatic and transparent solution. We mean Gnutella must become "self-aware". Some mechanism must elect servents to become reflectors automatically by some form of sensitivity. Also it is very likely that caching will boost performance, because traces have shown that only a small (compared with the available selection) portion if searches are performed. We believe multi-level reflectors (adding more than two levels should be trivial generalization) and smart caching are the key to Gnutella growth in the future. New, more advanced P2P networks like FastTrack have those features and present strong competition to various Gnutella clients.


Top
     

Site Map | Top Page