Added JC's changes to the paper.

author Cameron Dale <camrdale@gmail.com>

Tue, 26 Aug 2008 16:39:37 +0000 (09:39 -0700)

committer Cameron Dale <camrdale@gmail.com>

Tue, 26 Aug 2008 16:39:37 +0000 (09:39 -0700)
author Cameron Dale <camrdale@gmail.com>
Tue, 26 Aug 2008 16:39:37 +0000 (09:39 -0700)
committer Cameron Dale <camrdale@gmail.com>
Tue, 26 Aug 2008 16:39:37 +0000 (09:39 -0700)
diff --git a/docs/abstract/apt-p2p-abstract.kilepr b/docs/abstract/apt-p2p-abstract.kilepr

index 80fc672bbdef5265a7773fc81df701e266440402..45f925b0d8027415b561cbb538aef6ad8b1638bd 100644 (file)
--- a/docs/abstract/apt-p2p-abstract.kilepr
+++ b/docs/abstract/apt-p2p-abstract.kilepr
@@ -21,7 +21,7 @@ column=42
  encoding=UTF-8
  highlight=LaTeX
  line=39
  encoding=UTF-8
  highlight=LaTeX
  line=39
-open=true
+open=false
  
  [item:all.bib]
  archive=true
  
  [item:all.bib]
  archive=true
diff --git a/docs/paper/apt-p2p-paper.kilepr b/docs/paper/apt-p2p-paper.kilepr

index d0db04bebd2ee994493f2f8d627df19a78a456c9..9065c3929081665f6543f19e2df9f3fa714d9f6f 100644 (file)
--- a/docs/paper/apt-p2p-paper.kilepr
+++ b/docs/paper/apt-p2p-paper.kilepr
@@ -21,7 +21,7 @@ column=0
  encoding=UTF-8
  highlight=BibTeX
  line=234
  encoding=UTF-8
  highlight=BibTeX
  line=234
-open=true
+open=false
  
  [item:apt-p2p-paper.kilepr]
  archive=true
  
  [item:apt-p2p-paper.kilepr]
  archive=true
@@ -33,8 +33,8 @@ open=false
  
  [item:paper.tex]
  archive=true
  
  [item:paper.tex]
  archive=true
-column=0
+column=16
  encoding=UTF-8
  highlight=LaTeX
  encoding=UTF-8
  highlight=LaTeX
-line=1
+line=35
  open=true
  open=true
diff --git a/docs/paper/paper.tex b/docs/paper/paper.tex

index 33bcb182e001ece5105908f2b2abb737d4d4b69e..8423d82a69a173791bb0b0f6ad79acaa746fd30c 100644 (file)
--- a/docs/paper/paper.tex
+++ b/docs/paper/paper.tex
@@ -1,961 +1,901 @@
-\documentclass[conference]{IEEEtran}
-
-%% INFOCOM addition:
-\makeatletter
-\def\ps@headings{%
-\def\@oddhead{\mbox{}\scriptsize\rightmark \hfil \thepage}%
-\def\@evenhead{\scriptsize\thepage \hfil\leftmark\mbox{}}%
-\def\@oddfoot{}%
-\def\@evenfoot{}}
-\makeatother
-\pagestyle{headings}
-
-\usepackage[noadjust]{cite}
-
-\usepackage[dvips]{graphicx}
-
-\usepackage{url}
-
-\begin{document}
-
-\title{Peer-to-Peer Distribution of Free Software Packages}
-\author{\IEEEauthorblockN{Cameron Dale}
-\IEEEauthorblockA{School of Computing Science\\
-Simon Fraser University\\
-Burnaby, British Columbia, Canada\\
-Email: camerond@cs.sfu.ca}
-\and
-\IEEEauthorblockN{Jiangchuan Liu}
-\IEEEauthorblockA{School of Computing Science\\
-Simon Fraser University\\
-Burnaby, British Columbia, Canada\\
-Email: jcliu@cs.sfu.ca}}
-
-\maketitle
-
-\begin{abstract}
-A large amount of free software packages are available over the
-Internet from many different distributors. Most of these
-distributors use the traditional client-server model to handle
-requests from users. However, there is an excellent opportunity to
-use peer-to-peer techniques to reduce the cost of much of this
-distribution, especially due to the altruistic nature of many of
-these users. There are no existing solution suitable for this
-situation, so we present a new technique for satisfying the needs of
-this P2P distribution, which is generally applicable to many of
-these distributors' systems. Our method makes use of a DHT for
-storing the location of peers, using the cryptographic hash of the
-package as a key. To show the simplicity and functionality, we
-implement a solution for the distribution of Debian software
-packages, including the many DHT customizations needed. Finally, we
-analyze our system to determine how it is performing and what effect
-it is having.
-\end{abstract}
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\section{Introduction}
-\label{intro}
-
-There are a large number of free software package distributors using
-package distribution systems over the Internet to distribute
-software to their users. These distributors have developed many
-different methods for this distribution, but they almost exclusively
-use a client-server model to satisfy user requests. The popularity
-and number of users results in a large number of requests, which
-usually requires a network of mirrors to handle. Due to the free
-nature of this software, many users are willing and able to
-contribute upload bandwidth to this distribution, but have no
-current way to do this.
-
-We present a new peer-to-peer distribution model to meet these
-demands. It is based on many previous implementations of successful
-peer-to-peer protocols, especially distributed hash tables (DHT) and
-BitTorrent. The model relies on the pre-existence of cryptographic
-hashes of the packages, which should uniquely identify it for a
-request from other peers. If the peer-to-peer download fails, then
-the original request to the server is used as a fallback to prevent
-any dissatisfaction from users. The peer can then share this new
-package with others through the P2P system.
-
-First, we examine the opportunity that is available for many of
-these free software package distributors. We present an overview of
-a system that would efficiently satisfy the demands of a large
-number of users, and should significantly reduce the currently
-substantial bandwidth requirements of hosting this software. We then
-present an example implementation based on the Debian package
-distribution system. This implementation will be used by a large
-number of users, and serves as an example for other free software
-distributors of the opportunity that can be met with such a system.
-
-This paper is organized as follows. The current situation and its
-problems are presented in section \ref{situation}. We analyze some
-related solutions in section \ref{related}, and then present our
-solution in section \ref{opportunity}. We then detail our sample
-implementation for Debian-based distributions in section
-\ref{implementation}, including an in-depth look at our DHT
-customizations in section \ref{custom_dht}. We analyze our deployed
-sample implementation in section \ref{analysis}, suggest some future
-enhancements in section \ref{future}, and conclude the paper in
-section \ref{conclusions}.
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\section{Problem/Situation}
-\label{situation}
-
-There are a large number of groups using the Internet to distribute
-their free software packages. These packages are usually available
-from a free download site, which usually requires a network of
-mirrors to support the large number of requests. These distribution
-systems almost always support some kind of file verification,
-usually cryptographic hashes, to verify the completed downloads
-accuracy or authenticity.
-
-\subsection{Examples}
-\label{examples}
-
-Most Linux distributions use a software package management system
-that fetches packages to be installed from an archive of packages
-hosted on a network of mirrors. The Debian project, and other
-Debian-based distributions such as Ubuntu and Knoppix, use the
-\texttt{apt} (Advanced Package Tool) program, which downloads Debian
-packages in the \texttt{.deb} format from one of many HTTP mirrors.
-The program will first download index files that contain a listing
-of which packages are available, as well as important information
-such as their size, location, and a hash of their content. The user
-can then select which packages to install or upgrade, and
-\texttt{apt} will download and verify them before installing them.
-
-There are also several similar frontends for the RPM-based
-distributions. Red Hat's Fedora project uses the \texttt{yum}
-program, SUSE uses \texttt{YAST}, while Mandriva has
-\texttt{Rpmdrake}, all of which are used to obtain RPMs from
-mirrors. Other distributions use tarballs (\texttt{.tar.gz} or
-\texttt{.tar.bz2}) to contain their packages. Gentoo's package
-manager is called \texttt{portage}, SlackWare Linux uses
-\texttt{pkgtools}, and FreeBSD has a suite of command-line tools,
-all of which download these tarballs from web servers.
-
-Other software distributors also use a similar system. CPAN
-distributes software packages for the PERL
-programming language, using SOAP RPC requests to find and download
-files. Cygwin provides many of the
-standard Unix/Linux tools in a Windows environment, using a
-package management tool that requests packages from websites. There
-are two software distribution systems for Mac OSX, fink and
-MacPorts, that also retrieve packages in this way.
-
-Also, some systems use direct web downloads, but with a hash
-verification file also available for download next to the desired
-file. These hash files usually have the same file name, but with an
-added extension identifying the hash used (e.g. \texttt{.md5} for
-the MD5 hash). This type of file downloading and verification is
-typical of free software hosting facilities that are open to anyone
-to use, such as SourceForge.
-
-\subsection{Similarities}
-\label{similarities}
-
-The important things to note for each of the systems mentioned in
-section \ref{examples}, is that they all have the following in
-common:
-\begin{itemize}
- \item The content is divided into distinct units (packages), each
-       of which contains a small independent part of all the
-       content.
- \item The packages are avaliable for anyone to freely download.
- \item Users are typically not interested in downloading all of the
-       packages available.
- \item Hashes of the packages are available before the download is
-       attempted.
- \item Requests to download the packages are sent by a tool, not
-       directly by the user (though the tool is responding to
-       requests from the user).
-\end{itemize}
-
-We also expect that there are a number of users of these systems
-that are motivated by altruism to want to help out with this
-distribution. This is common in these systems due to the free nature
-of the software being delivered, which encourages some to want to
-help out in some way. A number of the systems are also used by
-groups that are staffed mostly, or sometimes completely, by
-volunteers. This also encourages users to want to give back
-to the volunteer community that has created the software they are
-using.
-
-\subsection{Differences}
-\label{differences}
-
-Although at first it might seem that a single all-reaching solution
-is possible for this situation, there are some differences in each
-system that require independent solutions. The systems all use
-different tools for their distribution, so any implementation would
-have to be specific to the tool it is to integrate with. In
-particular, how to obtain requests from the tool or user, and how to
-find a hash that corresponds to the file being requested, is very
-specific to each system.
-
-Also, there may be no benefit in creating a single large solution to
-integrate all these problems. Although they sometimes distribute
-nearly identical packages (e.g. the same software available in
-multiple Linux distributions), it is not exactly identical and has
-usually been tailored to the system by the distributor. The small
-differences will change the hash of the files, and so will make it
-impossible to distribute similar packages across systems. And,
-although peer-to-peer systems can scale very well with the number of
-peers in the system, there is some overhead involved, so having a
-much larger system of peers would mean that requests could take
-longer to complete.
-
-\subsection{Problems}
-\label{problems}
-
-There are some attributes of software package distribution that make
-it difficult to use an existing peer-to-peer solution with, or to
-design a new solution for this need.
-
-\subsubsection{Archive Dimensions}
-
-The dimensions of the software archive makes it difficult to
-distribute efficiently. While most of the packages are very small in
-size, there are some that are quite large. There are too many
-packages to distribute each individually, but the archive is also
-too large to distribute in its entirety. In some archives there are
-also divisions of the archive into sections, e.g. by the computer
-architecture that the package is intended for. 
-
-\begin{figure}
-\centering
-\includegraphics[width=\columnwidth]{apt_p2p_simulation-size_CDF.eps}
-\caption{The CDF of the size of packages in a Debian system, both
-for the actual size and adjusted size based on the popularity of
-the package.}
-\label{size_CDF}
-\end{figure}
-
-For example, Figure~\ref{size_CDF} shows the size of packages in the
-Debian distribution. While 80\% of the packages are less than
-512~KB, some of the packages are hundreds of megabytes. The entire
-archive consists of 22,298 packages and is approximately 119,000 MB
-in size. Some packages are divided by architecture, but there are a
-large number of packages that are not architecture specific and so
-can be installed on any computer architecture.
-
-\subsubsection{Package Updates}
-
-The software packages being distributed are being constantly
-updated. These updates could be the result of the software creators
-releasing a new version of the software with improved functionality,
-or the result of the distributor updating their packaging of the
-software to meet new requirements. Even if the distributor
-periodically makes \emph{stable} releases, which are snapshots of
-all the packages in the archive at a certain time, updates are still
-released for security issues or serious bugs.
-
-\begin{figure}
-\centering
-\includegraphics[width=\columnwidth]{size-quarter.eps}
-\caption{The amount of data in the Debian archive that is updated
-each day, broken down by architecture.}
-\label{update_size}
-\end{figure}
-
-For example, Figure~\ref{update_size} shows the amount of data in
-the Debian archive that was updated each day over a period of 3
-months. Each day approximately 1.5\% of the 119,000 MB archive is
-updated with new versions of packages.
-
-\subsubsection{Limited Interest}
-
-Though there are a large number of packages, and a large number of
-users, interest in a given version of a package could be very
-limited. If the package is necessary for the system (e.g. the
-package distribution software), then it is likely that everyone will
-have it installed, but there are very few of these packages. Most
-packages fall in the category of optional or extra, and so are
-interesting to only a limited number of people.
-
-\begin{figure}
-\centering
-\includegraphics[width=\columnwidth]{apt_p2p_popularity-cdf.eps}
-\caption{The CDF of the popularity of packages in a Debian system.}
-\label{popularity_CDF}
-\end{figure}
-
-For example, the Debian distribution tracks the popularity of its
-packages using popcon \cite{popcon}. Figure~\ref{popularity_CDF}
-shows the cumulative distribution function of the percentage of all
-users who install each package. Though some packages are installed
-by everyone, 80\% of the packages are installed by less than 1\% of
-users.
-
-\subsubsection{Interactive Users}
-
-The package management software that downloads packages displays
-some kind of indication of speed and completeness for users to
-watch. Since previous client-server downloads occur in a sequential
-fashion, the package management software also measures the speed
-based on sequential downloading. This requires the peer-to-peer
-solution to be reasonably responsive at retrieving packages
-sequentially.
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\section{Related Work/Existing Solutions}
-\label{related}
-
-There are many peer-to-peer implementations available today, but
-none is very well suited to this specific problem.
-
-\subsection{BitTorrent}
-\label{bittorrent}
-
-Many distributors make their software available in some form using
-BitTorrent \cite{COHEN03}, e.g. for the distribution of CD
-images. This is not an ideal situation though, as it requires the
-peers to download large numbers of packages that they are not
-interested in, and prevents them from updating to new packages
-without downloading another image containing a lot of the same
-packages they already have. Unfortunately, BitTorrent is not ideally
-suited to an application such as this one.
-
-First, there is no obvious way to divide the packages into torrents.
-Most of the packages are too small, and there are too many packages
-in the entire archive, to create individual torrents for each one.
-However, all the packages together are too large to track
-efficiently as a single torrent. Some division of the archive's
-packages into torrents is obviously necessary, but wherever that
-split occurs it will cause either some duplication of connections,
-or prevent some peers from connecting to others who do have the same
-content. Also, a small number of the packages can be updated every
-day which would add new files to the torrent, thereby changing its
-\emph{infohash} identifier and making it a new torrent. This will
-severely fracture the download population, even though peers in the
-new torrent share 99\% of the packages in common with peers in the
-old torrent.
-
-Other issues also prevent BitTorrent from being a good solution to
-this problem. BitTorrent's fixed piece sizes that disregard file
-boundaries are bigger than many of the packages in the archive. This
-will waste peers' downloading bandwidth as they will end up
-downloading parts of other packages just to get the piece that
-contains the package they do want. Incentives to share (upload) are
-no longer needed, as the software is freely available for anyone to
-download without sharing. Seeds are also not needed as the mirrors
-can serve in that capacity. Finally, BitTorrent downloads files
-randomly, which does not work well with the interactive package
-management tools expectation of sequential downloads.
-
-\subsection{Other Solutions}
-\label{others}
-
-There have also been other attempted implementations, usually based
-on some form of BitTorrent, to address this problem. Some other
-implementations have used BitTorrent almost unmodified, while others
-have only looked at using DHTs to replace the tracker in a
-BitTorrent system. apt-torrent \cite{apttorrent} creates torrents
-for some of the larger packages available, but it ignores the
-smaller packages, which are often the most popular. DebTorrent
-\cite{debtorrent} makes widespread modifications to a traditional
-BitTorrent client, to try and fix the drawbacks mentioned in section
-\ref{bittorrent}. However, these changes also require some
-modifications to the distribution system to support it.
-
-Others have also used DHTs to support this type of functionality.
-Kenosis \cite{kenosis} is a P2P Remote Procedure Call
-client also based on the Kademlia DHT, but it is meant as a P2P
-primitive system on which other tools can be built, and so it has no
-file sharing functionality at all. Many have used a DHT as a drop-in
-replacement for the tracker in a BitTorrent system
-\cite{bittorrent-dht, azureus-dht}, but such systems only use the
-DHT to find peers for the same torrent, so the file sharing uses
-traditional BitTorrent and so is not ideal for the reasons listed
-above.
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\section{Opportunity/Our Proposal}
-\label{opportunity}
-
-The situation described in section \ref{situation} presents a clear
-opportunity to use some form of peer-to-peer file-sharing protocol
-to allow willing users to contribute upload bandwidth. This sparse
-interest in a large number of packages undergoing constant updating
-is well suited to the functionality provided by a Distributed Hash
-Table (DHT). DHTs require unique keys to store and retrieve strings
-of data, for which the cryptographic hashes used by these package
-management systems are perfect for. The stored and retrieved strings
-can then be pointers to the peers that have the package that hashes
-to that key. A downloading peer can lookup the package hash in the
-DHT and, if it is found, download the file from those peers and
-verify the package with the hash. Once the download is complete, the
-peer will add its entry to the DHT indicating that it now has the
-package.
-
-The fact that this package is also available to download for free
-from a server is very important to our proposal. If the package hash
-can not be found in the DHT, the peer can then fallback to
-downloading from the original location (i.e. the network of
-mirrors). The mirrors thus, with no modification to their
-functionality, serve as seeds for the packages in the peer-to-peer
-system. Any packages that have just been updated, or that are very
-rare, and so don't have any peers available can always be found on
-the mirror. Once the peer has completed the download from the mirror
-and verified the package, it can then add itself to the DHT as the
-first peer for the new package, so that future requests from peers
-will not need the mirror.
-
-The trust of the package is also always guaranteed through the use
-of the cryptographic hashes. Nothing can be downloaded from a peer
-until the hash is looked up in the DHT, so a hash must first come
-from a trusted source (i.e. a mirror). Most distributors use index
-files that contain hashes for a large number of the packages in
-their archive, and which are also hashed. After retrieving the
-index's hash from the mirror, the index file can be downloaded from
-peers and verified. Then the program has access to all the hashes of
-the packages it will be downloading, all of which can be verified
-with a \emph{chain of trust} that stretches back to the original
-distributor's server.
-
-\subsection{Implementation Options}
-\label{imp_options}
-
-There are several ways to implement the desired P2P functionality
-into the existing package management software. The functionality can
-be directly integrated through modifications to the software, though
-this could be difficult as the P2P functionality should be running
-at all times. This is needed both for efficient lookups and to
-support uploading of already downloaded packages. Unfortunately, the
-package management tools typically only run until the download and
-install request is complete.
-
-Many of the package management software implementations use HTTP
-requests from web servers to download the packages, which makes it
-possible to implement the P2P aspect as an almost standard HTTP
-caching proxy. This proxy will run as a daemon in the background,
-listening for requests from the package management tool for package
-files. It will get uncached requests first from the P2P system, or
-falling back to the normal HTTP request from a server should it not
-be found. For methods that don't use HTTP requests, other types of
-proxies may also be possible.
-
-\subsection{Downloading From Peers}
-\label{downloading}
-
-Although not necessary, we recommend implementing a download
-protocol that is similar to the protocol used to fetch packages from
-the distributor's servers. This simplifies the P2P program, as it
-can then treat peers and mirrors almost identically when requesting
-packages. In fact, the mirrors can be used when there are only a few
-slow peers available for a file to help speed up the download
-process.
-
-Downloading a file efficiently from a number of peers is where
-BitTorrent shines as a peer-to-peer application. Its method of
-breaking up larger files into pieces, each with its own hash,
-makes it very easy to parallelize the downloading process and
-maximize the download speed. For very small packages (i.e. less than
-the piece size), this parallel downloading is not necessary, or
-even desirable. However, this method should still be used, in
-conjunction with the DHT, for the larger packages that are
-available.
-
-Since the package management system only stores a hash of the entire
-package, and not of pieces of that package, we will need to be able
-to store and retrieve these piece hashes using the P2P protocol. In
-addition to storing the file download location in the DHT (which
-would still be used for small files), a peer will store a
-\emph{torrent string} containing the peer's hashes of the pieces of
-the larger files. These piece hashes could be compared ahead of time
-to determine which peers have the same piece hashes (they all
-should), and then used during the download to verify the pieces of
-the downloaded package.
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\section{Sample Implementation/Implementation}
-\label{implementation}
-
-We have created a sample implementation that functions as described
-in section \ref{opportunity}, and is freely available for other
-distributors to download and modify \cite{apt-p2p}. This software,
-called \texttt{apt-p2p}, interacts with the \texttt{apt} tool, which
-is found in most Debian-based Linux distributions. \texttt{apt} uses
-SHA1 hashes to verify most downloaded files, including the large
-index files that contain the hashes of the individual packages. We
-chose this distribution system as it is familiar to us, it allows
-software contributions, and there are interesting statistics
-available for analyzing the popularity of the software packages
-\cite{popcon}.
-
-Since all requests from apt are in the form of HTTP downloads from a
-server, the implementation takes the form of a caching HTTP proxy.
-Making a standard \texttt{apt} implementation use the proxy is then
-as simple as prepending the proxy location and port to the front of
-the mirror name in \texttt{apt}'s configuration file (i.e.
-``localhost:9977/mirrorname.debian.org/\ldots'').
-
-We created a customized DHT based on Khashmir \cite{khashmir}, which
-is an implementation of Kademlia \cite{kademlia} using methods
-familiar to BitTorrent developers. Khashmir is also the same DHT
-implementation used by most of the existing BitTorrent clients to
-implement trackerless operation. The communication is all handled by
-UDP messages, and RPC (remote procedure call) requests and responses
-are all \emph{bencoded} in the same way as BitTorrent's
-\texttt{.torrent} files. Khashmir uses the high-level Twisted
-event-driven networking engine \cite{twisted}, so we also use
-Twisted in our sample implementation for all other networking needs.
-More details of this customized DHT can be found below in section
-\ref{custom_dht}.
-
-Downloading is accomplished by sending simple HTTP requests to the
-peers identified by lookups in the DHT to have the desired file.
-Requests for a package are made using the package's hash, properly
-encoded, as the URL to request from the peer. The HTTP server used
-for the proxy also doubles as the server listening for requests for
-downloads from other peers. All peers support HTTP/1.1, both in the
-server and the client, which allows for pipelining of multiple
-requests to a peer, and the requesting of smaller pieces of a large
-file using the Range request header.
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\section{Customized DHT}
-\label{custom_dht}
-
-A large contribution of our work is in the customization and use of
-a Distributed Hash Table (DHT). Although our DHT is based on
-Kademlia, we have made many improvements to it to make it suitable
-for this application. In addition to a novel storage technique to
-support piece hashes, we have improved the response time of looking
-up queries, allowed the storage of multiple values for each key, and
-incorporated some improvements from BitTorrent's tracker-less DHT
-implementation.
-
-\subsection{Kademlia Background}
-\label{kademlia}
-
-The Kademlia DHT, like most other DHTs, assigns IDs to peers from
-the same space that is used for keys. The peers with IDs closest to
-the desired key will then store the values for that key. Lookups are
-recursive, as nodes are queried in each step that are exponentially
-closer to the key than in the previous step.
-
-Nodes in a Kademlia system support four primitive requests.
-\texttt{ping} will cause a peer to return nothing, and is only used
-to determine if a node is still alive, while \texttt{store} tells a
-node to store a value associated with a given key. The most
-important primitives are \texttt{find\_node} and
-\texttt{find\_value}, which both function recursively to find nodes
-close to a key. The queried nodes will return a list of the nodes
-they know about that are closest to the key, allowing the querying
-node to quickly traverse the DHT to find the nodes close to the
-desired key. The only difference between them is that the
-\texttt{find\_value} query will cause a node to return a value, if
-it has one for that key, instead of a list of nodes.
-
-\subsection{Piece Hash Storage}
-\label{pieces}
-
-Hashes of pieces of the larger package files are needed to support
-their efficient downloading from multiple peers.
-For large files (5 or more pieces), the torrent strings described in
-section \ref{downloading}
-are too long to store with the peer's download info in the DHT. This
-is due to the limitation that a single UDP packet should be less
-than 1472 bytes to avoid fragmentation.
-% For example, for a file of 4 pieces, the peer would need to store a
-% value 120 bytes in size (IP address, port, four 20-byte pieces and
-% some overhead), which would limit a return value from including more
-% than 10 values to a request.
-
-Instead, the peers will store the torrent string for large files
-separately in the DHT, and only contain a reference to it in their
-stored value for the hash of the file. The reference is an SHA1 hash
-of the entire concatenated length of the torrent string. If the
-torrent string is short enough to store in the DHT (i.e. less than
-1472 bytes, or about 70 pieces for the SHA1 hash), then a lookup of
-that hash in the DHT will return the torrent string. Otherwise, a
-request to the peer for the hash (using the same method as file
-downloads, i.e. HTTP), will cause the peer to return the torrent
-string.
-
-Figure \ref{size_CDF} shows the package size of the 22,298 packages
-available in Debian in January 2008. We can see that most of the
-packages are quite small, and so most will therefore not require
-piece hash information to download. We have chosen a piece
-size of 512 kB, which means that 17,515 (78\%) of the packages will
-not require this information. There are 3054 packages that will
-require 2 to 4 pieces, for which the torrent string can be stored
-directly with the package hash in the DHT. There are 1667 packages
-that will require a separate lookup in the DHT for the longer
-torrent string as they require 5 to 70 pieces. Finally, there are
-only 62 packages that require more than 70 pieces, and so will
-require a separate request to a peer for the torrent string.
-
-\subsection{Response Time}
-\label{response_time}
-
-Most of our customizations to the DHT have been to try and improve
-the time of the recursive \texttt{find\_value} requests, as this can
-cause long delays for the user waiting for a package download to
-begin. The one problem that slows down such requests is waiting for
-timeouts to occur before marking the node as failed and moving on.
-
-Our first improvement is to retransmit a request multiple times
-before a timeout occurs, in case the original request or its
-response was lost by the unreliable UDP protocol. If it does not
-receive a response, the requesting node will retransmit the request
-after a short delay. This delay will increase exponentially for
-later retransmissions, should the request again fail. Our current
-implementation will retransmit the request after 2 seconds and 6
-seconds (4 seconds after the first retransmission), and then timeout
-after 9 seconds.
-
-We have also added some improvements to the recursive
-\texttt{find\_node} and \texttt{find\_value} queries to speed up the
-process when nodes fail. If enough nodes have responded to the
-current query such that there are many new nodes to query that are
-closer to the desired key, then a stalled request to a node further
-away will be dropped in favor of a new request to a closer node.
-This has the effect of leap-frogging unresponsive nodes and
-focussing attention on the nodes that do respond. We will also
-prematurely abort a query while there are still oustanding requests,
-if enough of the closest nodes have responded and there are no
-closer nodes found. This prevents a far away unresponsive node from
-making the query's completion wait for it to timeout.
-
-Finally, we made all attempts possible to prevent firewalled and
-NATted nodes from being added to the routing table for future
-requests. Only a node that has responded to a request from us will
-be added to the table. If a node has only sent us a request, we
-attempt to send a \texttt{ping} to the node to determine if it is
-NATted or not. Unfortunately, due to the delays used by NATs in
-allowing UDP packets for a short time if one was recently sent by
-the NATted host, the ping is likely to succeed even if the node is
-NATted. We therefore also schedule a future ping to the node to make
-sure it is still reachable after the NATs delay has hopefully
-elapsed. We also schedule future pings of nodes that fail once to
-respond to a request, as it takes multiple failures (currently 3)
-before a node is removed from the routing table.
-
-\subsection{Multiple Values}
-\label{multiple_values}
-
-The original design of Kademlia specified that each keywould have
-only a single value associated with it. The RPC to find this value
-was called \texttt{find\_value} and worked similarly to
-\texttt{find\_node}, iteratively finding nodes with ID's closer to
-the desired key. However, if a node had a value stored associated
-with the searched for key, it would respond to the request with that
-value instead of the list of nodes it knows about that are closer.
-
-While this works well for single values, it can cause a problem when
-there are multiple values. If the responding node is no longer one
-of the closest to the key being searched for, then the values it is
-returning will probably be the staler ones in the system, and it
-will not have the latest stored values. However, the search for
-closer nodes will stop here, as the queried node only returned the
-values and not a list of nodes to recursively query. We could have
-the request return both the values and the list of nodes, but that
-would severely limit the size and number of the values that could be
-returned.
-
-Instead, we have broken up the original \texttt{find\_value}
-operation into two parts. The new \texttt{find\_value} request
-always returns a list of nodes that the node believes are closest to
-the key, as well as a number indicating the number of values that
-this node has for the key. Once a querying node has finished its
-search for nodes and found the closest ones to the key, it can issue
-\texttt{get\_value} requests to some nodes to actually retrieve the
-values they have. This allows for much more control of when and how
-many nodes to query for values. For example, a querying node could
-abort the search once it has found enough values in some nodes, or
-it could choose to only request values from the nodes that are
-closest to the key being searched for.
-
-\subsection{BitTorrent's Improvements}
-\label{bittorrent_dht}
-
-In the many years that some BitTorrent clients have been using a
-Kademlia based DHT for tracker-less operation, the developers have
-made many enhancements which we can take advantage of. One of the
-most important is a security feature added to stop malicious nodes
-from subscribing other nodes as downloaders. When a node issues a
-request to another node to store its download info for that key, it
-must include a \emph{token} returned by the storing node in a recent
-\texttt{find\_node} request. The token is created by hashing the IP
-address of the requesting peer with a temporary secret that expires
-after several minutes. This prevents the requesting peer from faking
-its IP address in the store request, since it must first receive a
-response from a \texttt{find\_node} on that IP.
-
-We also made some BitTorrent-inspired changes to the parameters of
-the DHT originally specified by the authors of Kademlia. Since, as
-we will show later, peers stay online in our system for much longer
-periods of time, we reduced Kademlia's \emph{k} value from 20 to 8.
-The value is supposed to be large enough such that any given
-\emph{k} nodes are unlikely to fail within an hour of each other,
-which is very unlikely in our system given the long uptimes of
-nodes. We also increased the number of concurrent outstanding
-requests allowed from 3 to 6 to speed up the recursive key finding
-processes.
-
-\subsection{Other Changes}
-\label{other_changes}
-
-We added one other new RPC request that nodes can make:
-\texttt{join}. This request is only sent on first loading the DHT,
-and is usually only sent to the bootstrap nodes that are listed for
-the DHT. These bootstrap nodes will respond to the request with the
-requesting peer's IP and port, so that the peer can determine what
-its oustide IP address is and whether port translation is being
-used. In the future, we hope to add functionality similar to STUN
-\cite{STUN}, so that nodes can detect whether they are NATted and
-take appropriate steps to circumvent it.
-
-In addition, we have allowed peers to store values in the DHT, even
-if the hash they are using is not the correct length. Most of the
-keys used in the DHT are based on the SHA1 hash, and so they are 20
-bytes in length. However, some files in the targetted Debian package
-system are only hashed using the MD5 algorithm, and so the hashes
-retrieved from the server are 16 bytes in length. The DHT will still
-store values using these keys by padding the end of them with zeroes
-to the proper length, as the loss of uniqueness from the 4 less
-bytes is not an issue. Also, the DHT will happily store hashes that
-are longer than 20 bytes, such as those from the 32 byte SHA256
-algorithm, by truncating them to the correct length. Since the
-requesting peer will have the full length of the hash, this will not
-affect its attempt to verify the downloaded file.
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\section{Analysis}
-\label{analysis}
-
-Our \texttt{apt-p2p} implementation supporting the Debian package
-distribution system has been available to all Debian users since May
-3rd, 2008 \cite{apt-p2p-debian}, and will also be available in the
-next release of Ubuntu \cite{apt-p2p-ubuntu}. We have since created
-a \emph{walker} that will navigate the DHT and find all the peers
-currently connected to it. This allows us to analyze many aspects of
-our implementation.
-
-\subsection{Peer Lifetimes}
-\label{peer_life}
-
-\begin{figure}
-\centering
-\includegraphics[width=\columnwidth]{AptP2PWalker-peers.eps}
-\caption{The number of peers found in the system, and how many are
-behind a firewall or NAT.}
-\label{walker_peers}
-\end{figure}
-
-We first began analyzing the DHT on June 24th, and continue to this
-day, giving us 2 months of data so far. Figure~\ref{walker_peers}
-shows the number of peers we have seen in the DHT during this time.
-The peer population is very steady, with just over 50 regular users
-participating in the DHT at any time. We also note that we find 100
-users who connect regularly (weekly), and we have found 186 unique
-users in the 2 months of our analysis. We determined which users are
-behind a firewall or NAT, which is one of the main problems of
-implementing a peer-to-peer network. These peers will be
-unresponsive to DHT requests from peers they have not contacted
-recently, which will cause the peer to wait for a timeout to occur
-(currently set at 9 seconds) before moving on. They will also be
-unable to contribute any upload bandwidth to other peers, as all
-requests for packages from them will also timeout. From
-Figure~\ref{walker_peers}, we see that approximately half of all
-peers suffered from this restriction.
-
-\begin{figure}
-\centering
-\includegraphics[width=\columnwidth]{AptP2PDuration-peers.eps}
-\caption{The CDF of how long an average session will last.}
-\label{duration_peers}
-\end{figure}
-
-Figure~\ref{duration_peers} shows the cumulative distribution of how
-long a connection from a peer can be expected to last. Due to our
-software being installed as a daemon that is started by default
-every time their computer boots up, peers are expected to stay for a
-long period in the system. 50\% of connections last longer than 5
-hours, and 20\% last longer than 10 hours. These connections are
-much longer than those reported by Saroiu et. al. \cite{saroiu2001}
-for other P2P systems, which had 50\% of Napster and Gnutella
-sessions lasting only 1 hour.
-
-\begin{figure}
-\centering
-\includegraphics[width=\columnwidth]{AptP2PDuration-ind_peers.eps}
-\caption{The CDF of the average time individual peers stay in the
-system.}
-\label{duration_ind_peers}
-\end{figure}
-
-We also examined the average time each individual peer spends in the
-system. Figure~\ref{duration_peers} shows the cumulative
-distribution of how long each individual peer remains in the system.
-Here we see that 50\% of peers have average stays in the system
-longer than 10 hours.
-
-\begin{figure}
-\centering
-\includegraphics[width=\columnwidth]{AptP2PDuration-online_1.eps}
-\caption{The fraction of peers that, given their current duration in
-the system, will stay online for another hour.}
-\label{duration_online_1}
-\end{figure}
-
-Since our DHT is based on Kademlia, which was designed based on the
-probability that a node will remain up another hour, we also
-analyzed our system for this parameter.
-Figure~\ref{duration_online_1} shows the fraction of peers will
-remain online for another hour, as a function of how long they have
-been online so far. Maymounkov and Mazieres found that the longer a
-node has been online, the higher the probability that it will stay
-online \cite{kademlia}. Our results also show this behavior. In
-addition, similar to the Gnutella peers, over 90\% of our peers that
-have been online for 10 hours, will remain online for another hour.
-Our results also show that, for our system, over 80\% of all peers
-will remain online another hour, compared with around 50\% for
-Gnutella.
-
-\begin{figure}
-\centering
-\includegraphics[width=\columnwidth]{AptP2PDuration-online_6.eps}
-\caption{The fraction of peers that, given their current duration in
-the system, will stay online for another 6 hours.}
-\label{duration_online_6}
-\end{figure}
-
-Since our peers are much longer-lived than other P2P systems, we
-also looked at the fraction of peers that stay online for another 6
-hours. Figure~\ref{duration_online_6} shows that over 60\% of peers
-that are online for 10 hours will stay online for another 6.
-However, we see an interesting decrease in this fraction between 8
-and 12 hours, which can also be seen in
-Figure~\ref{duration_online_1}. We believe this to be due to desktop
-users, who regularly turn off their computers at night.
-
-\subsection{Peer Statistics}
-\label{peer_stats}
-
-On July 31st we enhanced our walker to retrieve additional
-information from each contacted peer. The peers are configured, by
-default, to publish some statistics on how much they are downloading
-and uploading, and their measured response times for DHT queries.
-Our walker can extract this information if the peer is not
-firewalled or NATted, it has not disabled this functionality, and if
-it uses the same port for both its DHT (UDP) requests and download
-(TCP) requests (which is also a configuration parameter).
-
-\begin{figure}
-\centering
-\includegraphics[width=\columnwidth]{AptP2PDownloaded-peers.eps}
-\caption{The number of peers that were contacted to determine their
-bandwidth, and the total number of peers in the system.}
-\label{down_peers}
-\end{figure}
-
-Figure~\ref{down_peers} shows the total number of peers we have been
-able to contact since starting to gather this additional
-information, as well as how many total peers were found. We were
-only able to contact 30\% of all the peers that connected to the
-system during this time.
-
-\begin{figure}
-\centering
-\includegraphics[width=\columnwidth]{AptP2PDownloaded-bw.eps}
-\caption{The bandwidth of data that the contacted peers have
-downloaded and uploaded.}
-\label{down_bw}
-\end{figure}
-
-Figure~\ref{down_bw} shows the amount of data the peers we were able
-to contact have downloaded. Peers measure their downloads from other
-peers and mirrors separately, so we are able to get an idea of how
-much savings our system is generating for the mirrors. We see that
-the peers are downloading approximately 20\% of their package data
-from other peers, which is saving the mirrors from supplying that
-bandwidth. The actual numbers are only a lower bound, since we have
-only contacted 30\% of the peers in the system, but we can estimate
-that \texttt{apt-p2p} has already saved the mirrors 15 GB of
-bandwidth, or 1 GB per day.
-
-We also collected the statistics on the measured response time peers
-were experiencing when sending requests to the DHT. We found that
-the recursive \texttt{find\_value} query, which is necessary before
-a download can occur, is taking 17 seconds on average. This
-indicates that, on average, requests are experiencing almost 2
-stalls while waiting for the 9 second timeouts to occur on
-unresponsive peers. This is longer than our target of 10 seconds,
-although it will only lead to a slight average delay in downloading
-of 1.7 seconds when the default 10 concurrent downloads are
-occurring.This increased response time is due to the number of peers
-that were behind firewalls or NATs, which was much higher than we
-anticipated. We do have plans to improve this through better
-informing of users of their NATted status, the use of STUN
-\cite{STUN} to circumvent the NATs, and by better exclusion of
-NATted peers from the DHT (which will not prevent them from using
-the system).
-
-We were also concerned that the constant DHT requests and responses,
-even while not downloading, would overwhelm some peers' network
-connections. However, we found that peers are using 200 to 300 bytes
-per second of bandwidth in servicing the DHT. These numbers are
-small enough to not affect any other network services the peer would
-be running.
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\section{Future Work}
-\label{future}
-
-We feel that our P2P software package distribution model is
-well-suited to be used by many free software distributors. We hope
-to convince them to adopt such a model in their distribution, or we
-may port our existing system to some of the other groups for them to
-try.
-
-One aspect missing from our model is the removal of old packages
-from the cache. Since our implementation is still relatively young,
-we have not had to deal with the problems of a growing cache of
-obsolete packages consuming all of a user's hard drive. We plan to
-implement some form of least recently used (LRU) cache removal
-technique, in which packages that are no longer available on the
-server, no longer requested by peers, or simply are the oldest in
-the cache, will be removed.
-
-The most significant area of improvement still needed in our sample
-implementation is to further speed up some of the slower recursive
-DHT requests. We hope to accomplish this by further tuning the
-parameters of our current system, better exclusion of NATted peers
-from the routing tables, and through the use of STUN \cite{STUN} to
-circumvent the NATs of the 50\% of the peers that have not
-configured port forwarding.
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\section{Conclusions}
-\label{conclusions}
-
-We have designed a generally applicable peer-to-peer content
-distribution model to be used by many of the free software
-distributors operating today. It makes use of already existing
-features of most package management systems to create a
-peer-to-peer distribution that should substantially reduce the costs
-of hosting the software packages.
-
-We have also implemented our design in freely available software to
-be used in conjuction with Debian-based distribution of Linux
-software packages. It is currently in use by some users of the
-Debian project's distribution, and so serves as an example of the
-possibilities that exist.
-
-\bibliographystyle{IEEEtran}
-\bibliography{./IEEEabrv,./all}
-
-\end{document}
+\documentclass[conference]{IEEEtran}\r
+\r
+%% INFOCOM addition:\r
+\makeatletter\r
+\def\ps@headings{%\r
+\def\@oddhead{\mbox{}\scriptsize\rightmark \hfil \thepage}%\r
+\def\@evenhead{\scriptsize\thepage \hfil\leftmark\mbox{}}%\r
+\def\@oddfoot{}%\r
+\def\@evenfoot{}}\r
+\makeatother\r
+\pagestyle{headings}\r
+\r
+\usepackage[noadjust]{cite}\r
+\r
+\usepackage[dvips]{graphicx}\r
+\r
+\usepackage{url}\r
+\r
+\begin{document}\r
+\r
+\title{\texttt{apt-p2p}: A Peer-to-Peer Distributor for Free Software Package Release and Update}\r
+\author{\IEEEauthorblockN{Cameron Dale}\r
+\IEEEauthorblockA{School of Computing Science\\\r
+Simon Fraser University\\\r
+Burnaby, British Columbia, Canada\\\r
+Email: camerond@cs.sfu.ca}\r
+\and\r
+\IEEEauthorblockN{Jiangchuan Liu}\r
+\IEEEauthorblockA{School of Computing Science\\\r
+Simon Fraser University\\\r
+Burnaby, British Columbia, Canada\\\r
+Email: jcliu@cs.sfu.ca}}\r
+\r
+\maketitle\r
+\r
+\begin{abstract}\r
+A large amount of free software packages are available over the\r
+Internet from many different distributors. Most of these\r
+distributors use the traditional client-server model to handle\r
+requests from users. However, there is an excellent opportunity to\r
+use peer-to-peer techniques to reduce the cost of much of this\r
+distribution, especially due to the altruistic nature of many of\r
+these users. There are no existing solution suitable for this\r
+situation, so we present a new technique for satisfying the needs of\r
+this P2P distribution, which is generally applicable to many of\r
+these distributors' systems. Our method makes use of a DHT for\r
+storing the location of peers, using the cryptographic hash of the\r
+package as a key. To show the simplicity and functionality, we\r
+implement a solution for the distribution of Debian software\r
+packages, including the many DHT customizations needed. Finally, we\r
+analyze our system to determine how it is performing and what effect\r
+it is having.\r
+\end{abstract}\r
+\r
+%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
+\r
+\section{Introduction}\r
+\label{intro}\r
+\r
+There are a large number of free software package distributors using\r
+package distribution systems over the Internet to distribute\r
+software to their users. These distributors have developed many\r
+different methods for this distribution, but they almost exclusively\r
+use a client-server model to satisfy user requests. The popularity\r
+and number of users results in a large number of requests, which\r
+usually requires a network of mirrors to handle. Due to the free\r
+nature of this software, many users are willing and able to\r
+contribute upload bandwidth to this distribution, but have no\r
+current way to do this.\r
+\r
+We present a new peer-to-peer distribution model to meet these\r
+demands. It is based on many previous implementations of successful\r
+peer-to-peer protocols, especially distributed hash tables (DHT) and\r
+BitTorrent. The model relies on the pre-existence of cryptographic\r
+hashes of the packages, which should uniquely identify it for a\r
+request from other peers. If the peer-to-peer download fails, then\r
+the original request to the server is used as a fallback to prevent\r
+any dissatisfaction from users. The peer can then share this new\r
+package with others through the P2P system.\r
+\r
+First, we examine the opportunity that is available for many of\r
+these free software package distributors. We present an overview of\r
+a system that would efficiently satisfy the demands of a large\r
+number of users, and should significantly reduce the currently\r
+substantial bandwidth requirements of hosting this software. We then\r
+present an example implementation based on the Debian package\r
+distribution system. This implementation will be used by a large\r
+number of users, and serves as an example for other free software\r
+distributors of the opportunity that can be met with such a system.\r
+\r
+The rest of this paper is organized as follows. The background and motivation are presented in Section \ref{situation}, together with related works in  Section \ref{related}. We propose\r
+our solution in Section \ref{opportunity}. We then detail our sample\r
+implementation for Debian-based distributions in Section\r
+\ref{implementation}, including an in-depth look at our DHT\r
+customizations in Section \ref{custom_dht}. Its performance is evaluated in Section \ref{analysis}. Finally, \r
+Section \ref{conclusions} concludes the paper and offers some future directions.\r
+\r
+%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
+\r
+\section{Background and Motivations}\r
+\label{situation}\r
+\r
+In the free software society, there are a large number of groups using the Internet to \r
+collaboratively develop and release their software. The ever increasing power of\r
+modern programming languages and operating systems has made these software, like commercial software, extremely large and complex, which often\r
+consists of many small units (packages). Together with their popularity among users, \r
+an efficient and reliable management and distribution of these packages over the Internet has become an daunting task. In this section, we offer concrete examples illustrating the \r
+unique challenges in this context. \r
+\r
+\subsection{Free Software Package Distributor: Examples}\r
+\label{examples}\r
+\r
+\r
+Most Linux distributions use a software package management system\r
+that fetches packages to be installed from an archive of packages\r
+hosted on a network of mirrors. The Debian project, and other\r
+Debian-based distributions such as Ubuntu and Knoppix, use the\r
+\texttt{apt} (Advanced Package Tool) program, which downloads Debian\r
+packages in the \texttt{.deb} format from one of many HTTP mirrors.\r
+The program will first download index files that contain a listing\r
+of which packages are available, as well as important information\r
+such as their size, location, and a hash of their content. The user\r
+can then select which packages to install or upgrade, and\r
+\texttt{apt} will download and verify them before installing them.\r
+\r
+There are also several similar frontends for the RPM-based\r
+distributions. Red Hat's Fedora project uses the \texttt{yum}\r
+program, SUSE uses \texttt{YAST}, while Mandriva has\r
+\texttt{Rpmdrake}, all of which are used to obtain RPMs from\r
+mirrors. Other distributions use tarballs (\texttt{.tar.gz} or\r
+\texttt{.tar.bz2}) to contain their packages. Gentoo's package\r
+manager is called \texttt{portage}, SlackWare Linux uses\r
+\texttt{pkgtools}, and FreeBSD has a suite of command-line tools,\r
+all of which download these tarballs from web servers.\r
+\r
+Similar tools have been used for other software packages. CPAN\r
+distributes software packages for the PERL\r
+programming language, using SOAP RPC requests to find and download\r
+files. Cygwin provides many of the\r
+standard Unix/Linux tools in a Windows environment, using a\r
+package management tool that requests packages from websites. There\r
+are two software distribution systems for Mac OSX, fink and\r
+MacPorts, that also retrieve packages in this way.\r
+\r
+Direct web downloading is also a common way, often coupled with a hash\r
+verification file to be downloaded next to the desired\r
+file. The hash file usually have the same file name, but with an\r
+added extension identifying the hash used (e.g. \texttt{.md5} for\r
+the MD5 hash). This type of file downloading and verification is\r
+typical of free software hosting facilities that are open to anyone\r
+to use, such as SourceForge.\r
+\r
+\r
+Given the free nature of these software, there are often a number of users \r
+motivated by altruism to want to help out with their distribution. \r
+This is particularly true considering that many of these software are used by\r
+groups that are staffed mostly, or sometimes completely, by\r
+volunteers. They are thus motivated to contribute their network resources, so as to promote the healthy development \r
+of the volunteer community that released the software. As a matter of fact,\r
+we have see many free mirror sites hosting these software packages for downloading. \r
+We also naturally expect that peer-to-peer distribution can be implementation in \r
+this context, which scale well with large user bases and can easily explore the network resources made available by\r
+the volunteers. \r
+\r
+\r
+\r
+\subsection{Unique Characteristics}\r
+\label{problems}\r
+\r
+While it seems straightforward to use an existing peer-to-peer file sharing tool like BitTorrent for\r
+free software package distribution, there are indeed a series of new challenges in this unique scenario:\r
+\r
+\subsubsection{Archive Dimensions}\r
+\r
+While most of the packages of a software release are very small in\r
+size, there are some that are quite large. There are too many\r
+packages to distribute each individually, but the archive is also\r
+too large to distribute in its entirety. In some archives there are\r
+also divisions of the archive into sections, e.g. by the operating system (OS) or computer\r
+architecture that the package is intended for.\r
+\r
+\begin{figure}\r
+\centering\r
+\includegraphics[width=\columnwidth]{apt_p2p_simulation-size_CDF.eps}\r
+\caption{The CDF of the size of packages in a Debian system, both\r
+for the actual size and adjusted size based on the popularity of\r
+the package.}\r
+\label{size_CDF}\r
+\end{figure}\r
+\r
+For example, Figure~\ref{size_CDF} shows the size of the packages in the\r
+current Debian distribution. While 80\% of the packages are less than\r
+512~KB, some of the packages are hundreds of megabytes. The entire\r
+archive consists of 22,298 packages and is approximately 119,000 MB\r
+in size. Most of the packages are to be installed in any computer environment, but there are \r
+also OS- or architecture-specific packages, as shown by the adjusted sizes based on popularity of the packages. ((((more words on how the size is adjusted by popularity))))\r
+\r
+\subsubsection{Package Updates}\r
+\r
+The software packages being distributed are being constantly\r
+updated. These updates could be the result of the software creators\r
+releasing a new version with improved functionality,\r
+or the distributor updating their packaging of the\r
+software to meet new requirements. Even if the distributor\r
+periodically makes \emph{stable} releases, which are snapshots of\r
+all the packages in the archive at a certain time, updates are still\r
+released for security issues or serious bugs.\r
+\r
+\begin{figure}\r
+\centering\r
+\includegraphics[width=\columnwidth]{size-quarter.eps}\r
+\caption{The amount of data in the Debian archive that is updated\r
+each day, broken down by architecture.}\r
+\label{update_size}\r
+\end{figure}\r
+\r
+For example, Figure~\ref{update_size} shows the amount of data in\r
+the Debian archive that was updated each day over a period of 3\r
+months. In every single day, approximately 1.5\% of the 119,000 MB archive is\r
+updated with new versions of packages. Note that this frequency is much higher than\r
+that of most commercial software, mainly because many free software are\r
+developed in a loosely management environment with developers working\r
+asynchronously from worldwide. \r
+\r
+\subsubsection{Limited Interest}\r
+\r
+Finally, though there are a large number of packages and a large number of\r
+users, the interest in a particular version of a package can be very \r
+limited. Specifically, there are core packages that every user has to download, but most\r
+packages would fall in the category of optional or extra, and so are\r
+interesting to only a limited number of people.\r
+\r
+\begin{figure}\r
+\centering\r
+\includegraphics[width=\columnwidth]{apt_p2p_popularity-cdf.eps}\r
+\caption{The CDF of the popularity of packages in a Debian system.}\r
+\label{popularity_CDF}\r
+\end{figure}\r
+\r
+For example, the Debian distribution tracks the popularity of its\r
+packages using popcon \cite{popcon}. Figure~\ref{popularity_CDF}\r
+shows the cumulative distribution function of the percentage of all\r
+users who install each package. Though some packages are installed\r
+by everyone, 80\% of the packages are installed by less than 1\% of\r
+users.\r
+\r
+\r
+%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
+\r
+\subsection{Why BitTorrent Doesn't Work Well}\r
+\label{related}\r
+\r
+Recently, many distributors make their software available using\r
+BitTorrent \cite{COHEN03}, in particular, for the distribution of CD\r
+images. This straightforward use however is far ineffective, as it requires the\r
+peers to download large numbers of packages that they are not\r
+interested in, and prevents them from updating to new packages\r
+without downloading another image containing a lot of the same\r
+packages they already have. \r
+\r
+An alternative is to create torrents tracking individual packages. Unfortunately, we find that this enhancement can be\r
+quite difficult given the unique characteristic of free software packages. \r
+\r
+First, there is no obvious way to divide the packages into torrents.\r
+Most of the packages are too small, and there are too many packages\r
+in the entire archive to create individual torrents for each one.\r
+On the other hand, all the packages together are too large to track\r
+efficiently as a single torrent. Hence, some division of the archive's\r
+packages into torrents is obviously necessary, but wherever that\r
+split occurs it will cause either some duplication of connections,\r
+or prevent some peers from connecting to others who do have the same\r
+content. In addition, a small number of the packages can be updated every\r
+day which would add new files to the torrent, thereby changing its\r
+\emph{infohash} identifier and making it a new torrent. This will\r
+severely fracture the download population, even though peers in the\r
+new torrent may share 99\% of the packages in common with peers in the\r
+old torrent.\r
+\r
+Other issues also prevent BitTorrent from being a good solution to\r
+this problem. In particular, BitTorrent's fixed piece sizes (?KB) that disregard file\r
+boundaries are bigger than many of the packages in the archive. This\r
+will waste peers' downloading bandwidth as they will end up\r
+downloading parts of other packages just to get the piece that\r
+contains the package they do want. \r
+\r
+Finally, note that BitTorrent downloads files\r
+randomly, which does not work well with the interactive package\r
+management tools expectation of sequential downloads. On the other hand, with altruistic peers, incentives to share (upload) \r
+become a less important issue, and the availability of seeds are not critical, either, as the mirror sites \r
+can serve in that capacity.\r
+\r
+\r
+\r
+%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
+\r
+\section{Peer-to-Peer Assisted Distributor: An Overview}\r
+\label{opportunity}\r
+\r
+The situation described in section \ref{situation} presents a clear\r
+opportunity to use some form of peer-to-peer file-sharing protocol\r
+to allow willing users to contribute upload bandwidth. This sparse\r
+interest in a large number of packages undergoing constant updating\r
+is well suited to the functionality provided by a Distributed Hash\r
+Table (DHT). DHTs require unique keys to store and retrieve strings\r
+of data, for which the cryptographic hashes used by these package\r
+management systems are perfect for. The stored and retrieved strings\r
+can then be pointers to the peers that have the package that hashes\r
+to that key. A downloading peer can lookup the package hash in the\r
+DHT and, if it is found, download the file from those peers and\r
+verify the package with the hash. Once the download is complete, the\r
+peer will add its entry to the DHT indicating that it now has the\r
+package.\r
+\r
+The fact that this package is also available to download for free\r
+from a server is very important to our proposal. If the package hash\r
+can not be found in the DHT, the peer can then fallback to\r
+downloading from the original location (i.e. the network of\r
+mirrors). The mirrors thus, with no modification to their\r
+functionality, serve as seeds for the packages in the peer-to-peer\r
+system. Any packages that have just been updated, or that are very\r
+rare, and so don't have any peers available can always be found on\r
+the mirror. Once the peer has completed the download from the mirror\r
+and verified the package, it can then add itself to the DHT as the\r
+first peer for the new package, so that future requests from peers\r
+will not need the mirror.\r
+\r
+The trust of the package is also always guaranteed through the use\r
+of the cryptographic hashes. Nothing can be downloaded from a peer\r
+until the hash is looked up in the DHT, so a hash must first come\r
+from a trusted source (i.e. a mirror). Most distributors use index\r
+files that contain hashes for a large number of the packages in\r
+their archive, and which are also hashed. After retrieving the\r
+index's hash from the mirror, the index file can be downloaded from\r
+peers and verified. Then the program has access to all the hashes of\r
+the packages it will be downloading, all of which can be verified\r
+with a \emph{chain of trust} that stretches back to the original\r
+distributor's server.\r
+\r
+\subsection{Implementation Options}\r
+\label{imp_options}\r
+\r
+There are several ways to implement the desired P2P functionality\r
+into the existing package management software. The functionality can\r
+be directly integrated through modifications to the software, though\r
+this could be difficult as the P2P functionality should be running\r
+at all times. This is needed both for efficient lookups and to\r
+support uploading of already downloaded packages. Unfortunately, the\r
+package management tools typically only run until the download and\r
+install request is complete.\r
+\r
+Many of the package management software implementations use HTTP\r
+requests from web servers to download the packages, which makes it\r
+possible to implement the P2P aspect as an almost standard HTTP\r
+caching proxy. This proxy will run as a daemon in the background,\r
+listening for requests from the package management tool for package\r
+files. It will get uncached requests first from the P2P system, or\r
+falling back to the normal HTTP request from a server should it not\r
+be found. For methods that don't use HTTP requests, other types of\r
+proxies may also be possible.\r
+\r
+\subsection{Downloading From Peers}\r
+\label{downloading}\r
+\r
+Although not necessary, we recommend implementing a download\r
+protocol that is similar to the protocol used to fetch packages from\r
+the distributor's servers. This simplifies the P2P program, as it\r
+can then treat peers and mirrors almost identically when requesting\r
+packages. In fact, the mirrors can be used when there are only a few\r
+slow peers available for a file to help speed up the download\r
+process.\r
+\r
+Downloading a file efficiently from a number of peers is where\r
+BitTorrent shines as a peer-to-peer application. Its method of\r
+breaking up larger files into pieces, each with its own hash,\r
+makes it very easy to parallelize the downloading process and\r
+maximize the download speed. For very small packages (i.e. less than\r
+the piece size), this parallel downloading is not necessary, or\r
+even desirable. However, this method should still be used, in\r
+conjunction with the DHT, for the larger packages that are\r
+available.\r
+\r
+Since the package management system only stores a hash of the entire\r
+package, and not of pieces of that package, we will need to be able\r
+to store and retrieve these piece hashes using the P2P protocol. In\r
+addition to storing the file download location in the DHT (which\r
+would still be used for small files), a peer will store a\r
+\emph{torrent string} containing the peer's hashes of the pieces of\r
+the larger files. These piece hashes could be compared ahead of time\r
+to determine which peers have the same piece hashes (they all\r
+should), and then used during the download to verify the pieces of\r
+the downloaded package.\r
+\r
+%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
+\r
+\section{\texttt{apt-p2p}: A Practical Implementation}\r
+\label{implementation}\r
+\r
+We have created a sample implementation that functions as described\r
+in section \ref{opportunity}, and is freely available for other\r
+distributors to download and modify \cite{apt-p2p}. This software,\r
+called \texttt{apt-p2p}, interacts with the \texttt{apt} tool, which\r
+is found in most Debian-based Linux distributions. \texttt{apt} uses\r
+SHA1 hashes to verify most downloaded files, including the large\r
+index files that contain the hashes of the individual packages. We\r
+chose this distribution system as it is familiar to us, it allows\r
+software contributions, and there are interesting statistics\r
+available for analyzing the popularity of the software packages\r
+\cite{popcon}.\r
+\r
+Since all requests from apt are in the form of HTTP downloads from a\r
+server, the implementation takes the form of a caching HTTP proxy.\r
+Making a standard \texttt{apt} implementation use the proxy is then\r
+as simple as prepending the proxy location and port to the front of\r
+the mirror name in \texttt{apt}'s configuration file (i.e.\r
+``localhost:9977/mirrorname.debian.org/\ldots'').\r
+\r
+We created a customized DHT based on Khashmir \cite{khashmir}, which\r
+is an implementation of Kademlia \cite{kademlia} using methods\r
+familiar to BitTorrent developers. Khashmir is also the same DHT\r
+implementation used by most of the existing BitTorrent clients to\r
+implement trackerless operation. The communication is all handled by\r
+UDP messages, and RPC (remote procedure call) requests and responses\r
+are all \emph{bencoded} in the same way as BitTorrent's\r
+\texttt{.torrent} files. Khashmir uses the high-level Twisted\r
+event-driven networking engine \cite{twisted}, so we also use\r
+Twisted in our sample implementation for all other networking needs.\r
+More details of this customized DHT can be found below in section\r
+\ref{custom_dht}.\r
+\r
+Downloading is accomplished by sending simple HTTP requests to the\r
+peers identified by lookups in the DHT to have the desired file.\r
+Requests for a package are made using the package's hash, properly\r
+encoded, as the URL to request from the peer. The HTTP server used\r
+for the proxy also doubles as the server listening for requests for\r
+downloads from other peers. All peers support HTTP/1.1, both in the\r
+server and the client, which allows for pipelining of multiple\r
+requests to a peer, and the requesting of smaller pieces of a large\r
+file using the Range request header.\r
+\r
+%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
+\r
+\section{Customized DHT}\r
+\label{custom_dht}\r
+\r
+A large contribution of our work is in the customization and use of\r
+a Distributed Hash Table (DHT). Although our DHT is based on\r
+Kademlia, we have made many improvements to it to make it suitable\r
+for this application. In addition to a novel storage technique to\r
+support piece hashes, we have improved the response time of looking\r
+up queries, allowed the storage of multiple values for each key, and\r
+incorporated some improvements from BitTorrent's tracker-less DHT\r
+implementation.\r
+\r
+\subsection{Kademlia Background}\r
+\label{kademlia}\r
+\r
+The Kademlia DHT, like most other DHTs, assigns IDs to peers from\r
+the same space that is used for keys. The peers with IDs closest to\r
+the desired key will then store the values for that key. Lookups are\r
+recursive, as nodes are queried in each step that are exponentially\r
+closer to the key than in the previous step.\r
+\r
+Nodes in a Kademlia system support four primitive requests.\r
+\texttt{ping} will cause a peer to return nothing, and is only used\r
+to determine if a node is still alive, while \texttt{store} tells a\r
+node to store a value associated with a given key. The most\r
+important primitives are \texttt{find\_node} and\r
+\texttt{find\_value}, which both function recursively to find nodes\r
+close to a key. The queried nodes will return a list of the nodes\r
+they know about that are closest to the key, allowing the querying\r
+node to quickly traverse the DHT to find the nodes close to the\r
+desired key. The only difference between them is that the\r
+\texttt{find\_value} query will cause a node to return a value, if\r
+it has one for that key, instead of a list of nodes.\r
+\r
+\subsection{Piece Hash Storage}\r
+\label{pieces}\r
+\r
+Hashes of pieces of the larger package files are needed to support\r
+their efficient downloading from multiple peers.\r
+For large files (5 or more pieces), the torrent strings described in\r
+section \ref{downloading}\r
+are too long to store with the peer's download info in the DHT. This\r
+is due to the limitation that a single UDP packet should be less\r
+than 1472 bytes to avoid fragmentation.\r
+% For example, for a file of 4 pieces, the peer would need to store a\r
+% value 120 bytes in size (IP address, port, four 20-byte pieces and\r
+% some overhead), which would limit a return value from including more\r
+% than 10 values to a request.\r
+\r
+Instead, the peers will store the torrent string for large files\r
+separately in the DHT, and only contain a reference to it in their\r
+stored value for the hash of the file. The reference is an SHA1 hash\r
+of the entire concatenated length of the torrent string. If the\r
+torrent string is short enough to store in the DHT (i.e. less than\r
+1472 bytes, or about 70 pieces for the SHA1 hash), then a lookup of\r
+that hash in the DHT will return the torrent string. Otherwise, a\r
+request to the peer for the hash (using the same method as file\r
+downloads, i.e. HTTP), will cause the peer to return the torrent\r
+string.\r
+\r
+Figure \ref{size_CDF} shows the package size of the 22,298 packages\r
+available in Debian in January 2008. We can see that most of the\r
+packages are quite small, and so most will therefore not require\r
+piece hash information to download. We have chosen a piece\r
+size of 512 kB, which means that 17,515 (78\%) of the packages will\r
+not require this information. There are 3054 packages that will\r
+require 2 to 4 pieces, for which the torrent string can be stored\r
+directly with the package hash in the DHT. There are 1667 packages\r
+that will require a separate lookup in the DHT for the longer\r
+torrent string as they require 5 to 70 pieces. Finally, there are\r
+only 62 packages that require more than 70 pieces, and so will\r
+require a separate request to a peer for the torrent string.\r
+\r
+\subsection{Response Time}\r
+\label{response_time}\r
+\r
+Most of our customizations to the DHT have been to try and improve\r
+the time of the recursive \texttt{find\_value} requests, as this can\r
+cause long delays for the user waiting for a package download to\r
+begin. The one problem that slows down such requests is waiting for\r
+timeouts to occur before marking the node as failed and moving on.\r
+\r
+Our first improvement is to retransmit a request multiple times\r
+before a timeout occurs, in case the original request or its\r
+response was lost by the unreliable UDP protocol. If it does not\r
+receive a response, the requesting node will retransmit the request\r
+after a short delay. This delay will increase exponentially for\r
+later retransmissions, should the request again fail. Our current\r
+implementation will retransmit the request after 2 seconds and 6\r
+seconds (4 seconds after the first retransmission), and then timeout\r
+after 9 seconds.\r
+\r
+We have also added some improvements to the recursive\r
+\texttt{find\_node} and \texttt{find\_value} queries to speed up the\r
+process when nodes fail. If enough nodes have responded to the\r
+current query such that there are many new nodes to query that are\r
+closer to the desired key, then a stalled request to a node further\r
+away will be dropped in favor of a new request to a closer node.\r
+This has the effect of leap-frogging unresponsive nodes and\r
+focussing attention on the nodes that do respond. We will also\r
+prematurely abort a query while there are still oustanding requests,\r
+if enough of the closest nodes have responded and there are no\r
+closer nodes found. This prevents a far away unresponsive node from\r
+making the query's completion wait for it to timeout.\r
+\r
+Finally, we made all attempts possible to prevent firewalled and\r
+NATted nodes from being added to the routing table for future\r
+requests. Only a node that has responded to a request from us will\r
+be added to the table. If a node has only sent us a request, we\r
+attempt to send a \texttt{ping} to the node to determine if it is\r
+NATted or not. Unfortunately, due to the delays used by NATs in\r
+allowing UDP packets for a short time if one was recently sent by\r
+the NATted host, the ping is likely to succeed even if the node is\r
+NATted. We therefore also schedule a future ping to the node to make\r
+sure it is still reachable after the NATs delay has hopefully\r
+elapsed. We also schedule future pings of nodes that fail once to\r
+respond to a request, as it takes multiple failures (currently 3)\r
+before a node is removed from the routing table.\r
+\r
+\subsection{Multiple Values}\r
+\label{multiple_values}\r
+\r
+The original design of Kademlia specified that each keywould have\r
+only a single value associated with it. The RPC to find this value\r
+was called \texttt{find\_value} and worked similarly to\r
+\texttt{find\_node}, iteratively finding nodes with ID's closer to\r
+the desired key. However, if a node had a value stored associated\r
+with the searched for key, it would respond to the request with that\r
+value instead of the list of nodes it knows about that are closer.\r
+\r
+While this works well for single values, it can cause a problem when\r
+there are multiple values. If the responding node is no longer one\r
+of the closest to the key being searched for, then the values it is\r
+returning will probably be the staler ones in the system, and it\r
+will not have the latest stored values. However, the search for\r
+closer nodes will stop here, as the queried node only returned the\r
+values and not a list of nodes to recursively query. We could have\r
+the request return both the values and the list of nodes, but that\r
+would severely limit the size and number of the values that could be\r
+returned.\r
+\r
+Instead, we have broken up the original \texttt{find\_value}\r
+operation into two parts. The new \texttt{find\_value} request\r
+always returns a list of nodes that the node believes are closest to\r
+the key, as well as a number indicating the number of values that\r
+this node has for the key. Once a querying node has finished its\r
+search for nodes and found the closest ones to the key, it can issue\r
+\texttt{get\_value} requests to some nodes to actually retrieve the\r
+values they have. This allows for much more control of when and how\r
+many nodes to query for values. For example, a querying node could\r
+abort the search once it has found enough values in some nodes, or\r
+it could choose to only request values from the nodes that are\r
+closest to the key being searched for.\r
+\r
+\subsection{BitTorrent's Improvements}\r
+\label{bittorrent_dht}\r
+\r
+In the many years that some BitTorrent clients have been using a\r
+Kademlia based DHT for tracker-less operation, the developers have\r
+made many enhancements which we can take advantage of. One of the\r
+most important is a security feature added to stop malicious nodes\r
+from subscribing other nodes as downloaders. When a node issues a\r
+request to another node to store its download info for that key, it\r
+must include a \emph{token} returned by the storing node in a recent\r
+\texttt{find\_node} request. The token is created by hashing the IP\r
+address of the requesting peer with a temporary secret that expires\r
+after several minutes. This prevents the requesting peer from faking\r
+its IP address in the store request, since it must first receive a\r
+response from a \texttt{find\_node} on that IP.\r
+\r
+We also made some BitTorrent-inspired changes to the parameters of\r
+the DHT originally specified by the authors of Kademlia. Since, as\r
+we will show later, peers stay online in our system for much longer\r
+periods of time, we reduced Kademlia's \emph{k} value from 20 to 8.\r
+The value is supposed to be large enough such that any given\r
+\emph{k} nodes are unlikely to fail within an hour of each other,\r
+which is very unlikely in our system given the long uptimes of\r
+nodes. We also increased the number of concurrent outstanding\r
+requests allowed from 3 to 6 to speed up the recursive key finding\r
+processes.\r
+\r
+\subsection{Other Changes}\r
+\label{other_changes}\r
+\r
+We added one other new RPC request that nodes can make:\r
+\texttt{join}. This request is only sent on first loading the DHT,\r
+and is usually only sent to the bootstrap nodes that are listed for\r
+the DHT. These bootstrap nodes will respond to the request with the\r
+requesting peer's IP and port, so that the peer can determine what\r
+its oustide IP address is and whether port translation is being\r
+used. In the future, we hope to add functionality similar to STUN\r
+\cite{STUN}, so that nodes can detect whether they are NATted and\r
+take appropriate steps to circumvent it.\r
+\r
+In addition, we have allowed peers to store values in the DHT, even\r
+if the hash they are using is not the correct length. Most of the\r
+keys used in the DHT are based on the SHA1 hash, and so they are 20\r
+bytes in length. However, some files in the targetted Debian package\r
+system are only hashed using the MD5 algorithm, and so the hashes\r
+retrieved from the server are 16 bytes in length. The DHT will still\r
+store values using these keys by padding the end of them with zeroes\r
+to the proper length, as the loss of uniqueness from the 4 less\r
+bytes is not an issue. Also, the DHT will happily store hashes that\r
+are longer than 20 bytes, such as those from the 32 byte SHA256\r
+algorithm, by truncating them to the correct length. Since the\r
+requesting peer will have the full length of the hash, this will not\r
+affect its attempt to verify the downloaded file.\r
+\r
+%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
+\r
+\section{Performance Evaluation}\r
+\label{analysis}\r
+\r
+Our \texttt{apt-p2p} implementation supporting the Debian package\r
+distribution system has been available to all Debian users since May\r
+3rd, 2008 \cite{apt-p2p-debian}, and will also be available in the\r
+next release of Ubuntu \cite{apt-p2p-ubuntu}. We have since created\r
+a \emph{walker} that will navigate the DHT and find all the peers\r
+currently connected to it. This allows us to analyze many aspects of\r
+our implementation.\r
+\r
+\subsection{Peer Lifetimes}\r
+\label{peer_life}\r
+\r
+\begin{figure}\r
+\centering\r
+\includegraphics[width=\columnwidth]{AptP2PWalker-peers.eps}\r
+\caption{The number of peers found in the system, and how many are\r
+behind a firewall or NAT.}\r
+\label{walker_peers}\r
+\end{figure}\r
+\r
+We first began analyzing the DHT on June 24th, and continue to this\r
+day, giving us 2 months of data so far. Figure~\ref{walker_peers}\r
+shows the number of peers we have seen in the DHT during this time.\r
+The peer population is very steady, with just over 50 regular users\r
+participating in the DHT at any time. We also note that we find 100\r
+users who connect regularly (weekly), and we have found 186 unique\r
+users in the 2 months of our analysis. We determined which users are\r
+behind a firewall or NAT, which is one of the main problems of\r
+implementing a peer-to-peer network. These peers will be\r
+unresponsive to DHT requests from peers they have not contacted\r
+recently, which will cause the peer to wait for a timeout to occur\r
+(currently set at 9 seconds) before moving on. They will also be\r
+unable to contribute any upload bandwidth to other peers, as all\r
+requests for packages from them will also timeout. From\r
+Figure~\ref{walker_peers}, we see that approximately half of all\r
+peers suffered from this restriction.\r
+\r
+\begin{figure}\r
+\centering\r
+\includegraphics[width=\columnwidth]{AptP2PDuration-peers.eps}\r
+\caption{The CDF of how long an average session will last.}\r
+\label{duration_peers}\r
+\end{figure}\r
+\r
+Figure~\ref{duration_peers} shows the cumulative distribution of how\r
+long a connection from a peer can be expected to last. Due to our\r
+software being installed as a daemon that is started by default\r
+every time their computer boots up, peers are expected to stay for a\r
+long period in the system. 50\% of connections last longer than 5\r
+hours, and 20\% last longer than 10 hours. These connections are\r
+much longer than those reported by Saroiu et. al. \cite{saroiu2001}\r
+for other P2P systems, which had 50\% of Napster and Gnutella\r
+sessions lasting only 1 hour.\r
+\r
+\begin{figure}\r
+\centering\r
+\includegraphics[width=\columnwidth]{AptP2PDuration-ind_peers.eps}\r
+\caption{The CDF of the average time individual peers stay in the\r
+system.}\r
+\label{duration_ind_peers}\r
+\end{figure}\r
+\r
+We also examined the average time each individual peer spends in the\r
+system. Figure~\ref{duration_peers} shows the cumulative\r
+distribution of how long each individual peer remains in the system.\r
+Here we see that 50\% of peers have average stays in the system\r
+longer than 10 hours.\r
+\r
+\begin{figure}\r
+\centering\r
+\includegraphics[width=\columnwidth]{AptP2PDuration-online_1.eps}\r
+\caption{The fraction of peers that, given their current duration in\r
+the system, will stay online for another hour.}\r
+\label{duration_online_1}\r
+\end{figure}\r
+\r
+Since our DHT is based on Kademlia, which was designed based on the\r
+probability that a node will remain up another hour, we also\r
+analyzed our system for this parameter.\r
+Figure~\ref{duration_online_1} shows the fraction of peers will\r
+remain online for another hour, as a function of how long they have\r
+been online so far. Maymounkov and Mazieres found that the longer a\r
+node has been online, the higher the probability that it will stay\r
+online \cite{kademlia}. Our results also show this behavior. In\r
+addition, similar to the Gnutella peers, over 90\% of our peers that\r
+have been online for 10 hours, will remain online for another hour.\r
+Our results also show that, for our system, over 80\% of all peers\r
+will remain online another hour, compared with around 50\% for\r
+Gnutella.\r
+\r
+\begin{figure}\r
+\centering\r
+\includegraphics[width=\columnwidth]{AptP2PDuration-online_6.eps}\r
+\caption{The fraction of peers that, given their current duration in\r
+the system, will stay online for another 6 hours.}\r
+\label{duration_online_6}\r
+\end{figure}\r
+\r
+Since our peers are much longer-lived than other P2P systems, we\r
+also looked at the fraction of peers that stay online for another 6\r
+hours. Figure~\ref{duration_online_6} shows that over 60\% of peers\r
+that are online for 10 hours will stay online for another 6.\r
+However, we see an interesting decrease in this fraction between 8\r
+and 12 hours, which can also be seen in\r
+Figure~\ref{duration_online_1}. We believe this to be due to desktop\r
+users, who regularly turn off their computers at night.\r
+\r
+\subsection{Peer Statistics}\r
+\label{peer_stats}\r
+\r
+On July 31st we enhanced our walker to retrieve additional\r
+information from each contacted peer. The peers are configured, by\r
+default, to publish some statistics on how much they are downloading\r
+and uploading, and their measured response times for DHT queries.\r
+Our walker can extract this information if the peer is not\r
+firewalled or NATted, it has not disabled this functionality, and if\r
+it uses the same port for both its DHT (UDP) requests and download\r
+(TCP) requests (which is also a configuration parameter).\r
+\r
+\begin{figure}\r
+\centering\r
+\includegraphics[width=\columnwidth]{AptP2PDownloaded-peers.eps}\r
+\caption{The number of peers that were contacted to determine their\r
+bandwidth, and the total number of peers in the system.}\r
+\label{down_peers}\r
+\end{figure}\r
+\r
+Figure~\ref{down_peers} shows the total number of peers we have been\r
+able to contact since starting to gather this additional\r
+information, as well as how many total peers were found. We were\r
+only able to contact 30\% of all the peers that connected to the\r
+system during this time.\r
+\r
+\begin{figure}\r
+\centering\r
+\includegraphics[width=\columnwidth]{AptP2PDownloaded-bw.eps}\r
+\caption{The bandwidth of data that the contacted peers have\r
+downloaded and uploaded.}\r
+\label{down_bw}\r
+\end{figure}\r
+\r
+Figure~\ref{down_bw} shows the amount of data the peers we were able\r
+to contact have downloaded. Peers measure their downloads from other\r
+peers and mirrors separately, so we are able to get an idea of how\r
+much savings our system is generating for the mirrors. We see that\r
+the peers are downloading approximately 20\% of their package data\r
+from other peers, which is saving the mirrors from supplying that\r
+bandwidth. The actual numbers are only a lower bound, since we have\r
+only contacted 30\% of the peers in the system, but we can estimate\r
+that \texttt{apt-p2p} has already saved the mirrors 15 GB of\r
+bandwidth, or 1 GB per day.\r
+\r
+We also collected the statistics on the measured response time peers\r
+were experiencing when sending requests to the DHT. We found that\r
+the recursive \texttt{find\_value} query, which is necessary before\r
+a download can occur, is taking 17 seconds on average. This\r
+indicates that, on average, requests are experiencing almost 2\r
+stalls while waiting for the 9 second timeouts to occur on\r
+unresponsive peers. This is longer than our target of 10 seconds,\r
+although it will only lead to a slight average delay in downloading\r
+of 1.7 seconds when the default 10 concurrent downloads are\r
+occurring.This increased response time is due to the number of peers\r
+that were behind firewalls or NATs, which was much higher than we\r
+anticipated. We do have plans to improve this through better\r
+informing of users of their NATted status, the use of STUN\r
+\cite{STUN} to circumvent the NATs, and by better exclusion of\r
+NATted peers from the DHT (which will not prevent them from using\r
+the system).\r
+\r
+We were also concerned that the constant DHT requests and responses,\r
+even while not downloading, would overwhelm some peers' network\r
+connections. However, we found that peers are using 200 to 300 bytes\r
+per second of bandwidth in servicing the DHT. These numbers are\r
+small enough to not affect any other network services the peer would\r
+be running.\r
+\r
+\r
+\section{Related Work}\r
+\label{others}\r
+\r
+There have also been preliminary attempts to implement peer-to-peer distributors for\r
+software packages. apt-torrent \cite{apttorrent} creates torrents\r
+for some of the larger packages available, but it ignores the\r
+smaller packages, which are often the most popular. DebTorrent\r
+\cite{debtorrent} makes widespread modifications to a traditional\r
+BitTorrent client, to try and fix the drawbacks mentioned in section\r
+\ref{bittorrent}. However, these changes also require some\r
+modifications to the distribution system to support it.\r
+\r
+Others have also used DHTs to support this type of functionality.\r
+Kenosis \cite{kenosis} is a P2P Remote Procedure Call\r
+client also based on the Kademlia DHT, but it is meant as a P2P\r
+primitive system on which other tools can be built, and so it has no\r
+file sharing functionality at all. Many have used a DHT as a drop-in\r
+replacement for the tracker in a BitTorrent system\r
+\cite{bittorrent-dht, azureus-dht}, but such systems only use the\r
+DHT to find peers for the same torrent, so the file sharing uses\r
+traditional BitTorrent and so is not ideal for the reasons listed\r
+above.\r
+\r
+Our solution differs from them in that ...\r
+\r
+\r
+%%%%%%%%%%%%%%%%%%%%%%%%%%%  Section  %%%%%%%%%%%%%%%%%%%%%%%%%%%%\r
+\r
+\section{Conclusion and Future Work}\r
+\label{conclusions}\r
+\r
+We have designed a generally applicable peer-to-peer content\r
+distribution model to be used by many of the free software\r
+distributors operating today. It makes use of already existing\r
+features of most package management systems to create a\r
+peer-to-peer distribution that should substantially reduce the costs\r
+of hosting the software packages.\r
+\r
+We have also implemented our design in freely available software to\r
+be used in conjuction with Debian-based distribution of Linux\r
+software packages. It is currently in use by some users of the\r
+Debian project's distribution, and so serves as an example of the\r
+possibilities that exist.\r
+\r
+We feel that our P2P software package distribution model is\r
+well-suited to be used by many free software distributors. We hope\r
+to convince them to adopt such a model in their distribution, or we\r
+may port our existing system to some of the other groups for them to\r
+try.\r
+\r
+One aspect missing from our model is the removal of old packages\r
+from the cache. Since our implementation is still relatively young,\r
+we have not had to deal with the problems of a growing cache of\r
+obsolete packages consuming all of a user's hard drive. We plan to\r
+implement some form of least recently used (LRU) cache removal\r
+technique, in which packages that are no longer available on the\r
+server, no longer requested by peers, or simply are the oldest in\r
+the cache, will be removed.\r
+\r
+The most significant area of improvement still needed in our sample\r
+implementation is to further speed up some of the slower recursive\r
+DHT requests. We hope to accomplish this by further tuning the\r
+parameters of our current system, better exclusion of NATted peers\r
+from the routing tables, and through the use of STUN \cite{STUN} to\r
+circumvent the NATs of the 50\% of the peers that have not\r
+configured port forwarding.\r
+\r
+\bibliographystyle{IEEEtran}\r
+\bibliography{./IEEEabrv,./all}\r
+\r
+\end{document}\r
author	Cameron Dale <camrdale@gmail.com>
	Tue, 26 Aug 2008 16:39:37 +0000 (09:39 -0700)
committer	Cameron Dale <camrdale@gmail.com>
	Tue, 26 Aug 2008 16:39:37 +0000 (09:39 -0700)
docs/abstract/apt-p2p-abstract.kilepr		patch \| blob \| history
docs/paper/apt-p2p-paper.kilepr		patch \| blob \| history
docs/paper/paper.tex		patch \| blob \| history