HTTP (HyperText Transfer Protocol) is the main protocol behind the World Wide Web. It is the protocol defining how to access the web servers and retrieve documents and other data from them. Below, is a brief description of how the HTTP protocol works.
The HTTP protocol has seen several revisions since the time it was originally proposed. The revision in use today is HTTP/1.1, proposed in June 1999 as RFC 2616. The differences between HTTP/1.0 and HTTP/1.1 will be indicated where appropriate.
HTTP is a client-server TCP-based protocol, where commands and replies (obviously excluding the sent data) are plain text. For secure communications, the HTTP protocol supports certificate-based authentication. In this case, the protocol is known as HTTPS, where S stands for 'secure'. The protocol is message-based, and forms a dialog between the client and server. However, the HTTP protocol is stateless - every message is treated separately. For that purpose, all data that should be kept persistent, such as session variables, should be transported in headers.
The HTTP request messages are of the following form:
The protocol defines eight basic commands, or 'methods' (three in HTTP/1.0). The methods are also classified by being 'safe' or 'unsafe' - safe methods do not have side effects and do not change the state of the server.
GET, but only retrieves the headers of a particular resource. Useful for checking resources for changes.POST, which uses the Request-URI header to indicate the URI of the resource that should process the data, PUT uses this header to indicate the location of the uploaded resource - the web server must not send the data to further processing.Request-URI - the web server then returns a list of methods that are server-wide. This is most useful for checking a proxy server for HTTP/1.1 compliance - an HTTP/1.0 proxy would ignore this method, as it is unknown in HTTP/1.0 spec.TRACE method is useful for debugging client-server interaction. Upon execution, the last web server to receive this request (either the web server with the indicated resource, or the proxy which gets a document with Max-Forwards header set to 0) should return the request data that it got as the body of the reply. This is useful both for checking whether the target web server is receiving the requests correctly, as well as probing the proxy chain between the client and the server for potential infinite loops.Upon receiving and processing a request, the web server must send a reply to the client. The general reply format is as follows:
The most important part of the reply is the status code. This status code indicates a success or a failure, as well as the location of thefailure. The status codes are divided into five groups, based on the first digit.
Codes starting with 1 are informational - they contain information about the current request, but do not constitute a full reply. The HTTP/1.0 protocol does not define any such codes, thus a web server must not send any to a client that identifies itself as HTTP/1.0. An example of such code is 100 Continue, which indicates that the server received the first part of the request, and that the client should continue sending the rest.
Codes starting with 2 indicate success. The most popular code in this group is 200 Success which indicates that the action requested by the client was successful. This code is followed by the appropriate reply to the client's request. Other used codes include 204 No Content and 206 Partial Content (HTTP/1.1 only)
Codes starting with 3 specify that a redirection has taken place, and another action is required on behalf of the client to reach the intended resource. It is the client's job to detect a redirection loop. The most popular codes are 302 Found, signifying a temporary change in the resource location and 301 Moved Permanently, signifying a permanent change in the resource location.
The next two classes are error codes. Codes starting with 4 indicate client-side errors, with the most popular being 401, 403 and 404, meaning Unauthorized, Forbidden and Not Found respectively. This class of codes is sent when an error occurs at the client. In this example, a 404 code is sent when the client attempts to access a URL that does not exist and a 403 code is sent when the client doesn't have rights to access the resource.
Finally, codes starting with 5 indicate an error that occurred on the server - for example, as a result of bad configuration, or a script that returned an error upon execution. Another cause of this error is an attempt to use the server as something that it doesn't support, or simply trying to access an overloaded server.
An example of a page load of Google.com follows below:
Request: GET / HTTP/1.1 Host: www.google.com User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.6) Accept: text/xml,application/xhtml+xml,text/html;q=0.9 Accept-Language: en-us,en;q=0.5 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive Reply: HTTP/1.1 200 OK Cache-Control: private Content-Type: text/html Content-Encoding: gzip Server: GWS/2.1 Content-Length: 1426 <html><head>...
There are some things worth mentioning in the above dump. First, the browser sends a string identifying itself, in the User-Agent header. This header can be used to serve different content based on the requesting browser. Second, the browser specifies which types of content it expects to receive, and whether it supports compression. Finally, it specifies that it's interested in keeping the particular TCP connection open and receiving the reply on it, instead of closing it immediately after making the request. These features were all introduced in the HTTP/1.1 version of the protocol. In the reply, the two most important headers are Content-Type and Content-Length. The former specifies the MIME type of the content - a standard identifier that can be used by a web browser or other programs to understand how to treat and display the incoming data. The latter simply states how long the data following the headers is, and is vital in order for the browser to retrieve and display the whole page.