EFFICIENT WEB SERVER LOG PROCESSING
TECHNICAL FIELD
The present invention is directed to data storage, retrieval, and processing techniques.
BACKGROUND
As computer use, and particularly the use of the World Wide Web, becomes more and more prevalent, the volume of traffic processed by web servers grows larger and larger.
Figure 1 is high-level block diagram the environment in which a web server operates. The diagram shows client computer systems, such as client computer systems 110 and 120. A user of client computer system 110 uses a web client 111, such as a web browser, to send an HTTP request via the Internet 130 to a server computer system 140. The HTTP request is directed to the server computer system by including a URL identifying the server computer system in the HTTP request. A web server program 141 executing in the server computer system processes the HTTP request by retrieving content referenced in the HTTP request from a content database 151 and returning it to the web client on the client computer system. Also in response to the HTTP request, the web server typically records information relating to the HTTP request in a web server log database 152.
By executing queries against the web server log database using a relational database engine, the contents of the log may be used to analyze the performance of the web server, or to discern the types of requests issued by client computer systems.
As the number of users using the World Wide Web grows, the rate at which a typical server computer system receives HTTP requests grows considerably. Accordingly, techniques for more efficiently storing, accessing, and analyzing web server log information would be of significant utility.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 is high-level block diagram the environment in which a web server operates.
In some embodiments, the facility stores each record of the web server log in a condensed binary form to minimize the amount of space on the web server log occupied by each record. The facility preferably further compresses each record using a compression technique optimized for highspeed decompression.
By storing and analyzing log entries in this manner, the facility achieves new levels of efficiency in performing this logging and analysis function. Further, the facility is highly scalable, and provides powerful log analysis capabilities.
Figure 2 is a high-level block diagram that shows the environment in which the facility operates.
Figure 3 is a chart diagram showing the contents of a typical placement web server log record.
Figure 4 is a chart diagram showing the contents of a typical advertiser web server log record.
Figure 5 is a data structure diagram showing the human-readable contents of a group of typical element server log records. Figure 6 is a data structure diagram showing the human-readable contacts of a group of typical advertiser server log records.
DETAILED DESCRIPTION
A software facility for efficiently storing and accessing web server logs ("the facility") is provided.
In certain embodiments, the facility distributes web server log entries across multiple log files to facilitate parallel reading and analysis of the web server log by a number of processors simultaneously. The facility achieves such distribution of web server log entries using a mapping that maps each new web server log entry into a single log file. In various embodiments, the mapping is based upon a unique identifier identifying a user that originated the web serving request to which the entry corresponds; the type of the web serving request to which the entry corresponds; and/or an effective time for the web serving request to which the entry corresponds. The facility also preferably stores web server log files without maintaining indices thereon, further reducing the processing requirements for storing the log. The facility preferably processes the logs using specialized analysis programs developed in an efficient general-purpose prograniining environment, rather than using a conventional relational database engine for such analysis. In some embodiments, these programs are written in C++, and call a standard template library. In some embodiments, the facility stores each record of the web server log in a condensed binary form to minimize the amount of space on the web server log occupied by each record. The facility preferably further compresses each record using a compression technique optimized for highspeed decompression. By storing and analyzing log entries in this manner, the facility achieves new levels of efficiency in performing this logging and analysis function. Further, the facility is highly scalable, and provides powerful log analysis capabilities.
Figure 2 is a high-level block diagram that shows the environment in which the facility operates. From this diagram, it can be appreciated that the facility 210 executes on the server computer system 140 within the environment described above in conjunction with Figure 1. The diagram also shows a number of log analysis computer systems, such as log analysis computer systems 221 and 222, that are connected to the server computer system 140, and that may be used by the facility to execute log analysis programs in parallel against different log partitions. Each log analysis computer system may have a single processor, or may have multiple processors.
While preferred embodiments are described in terms of the environment described above, those skilled in the art will appreciate that the facility may be implemented in a variety of other environments, including a single, monolithic computer system, as well as various other combinations of computer systems or similar devices.
To more fully illustrate its implementation and operation, the facility is described in conjunction with an example. The example relates to a specialized web server for managing and monitoring Internet advertising activity. In this specialized environment, the web server receives two types of HTTP requests: placement requests and advertiser requests. A placement request is sent when a user is visiting a web page that contains advertising, such as a banner advertising message. Retrieval of such a web page from its publisher by the user causes the client to issue a placement request to the web server to return the content of the advertising message. Placement requests are also received at the web server when the user clicks on the advertising message in order to browse to the web site of the advertiser. The web server receives an advertiser request when the user visits a particular page on the advertiser's web site. While the records stored in the web server log for placement and advertiser requests have several fields in common, they diverge in several respects.
Figure 3 is a chart diagram showing the contents of a typical placement web server log record. Each row of the chart corresponds to a different field defined for placement web server records. The Unix timestamp field in position one indicates the date and time that the placement request was received. In the example record, the Unix timestamp field, the Unix timestamp field contains the decimal value "921746335." This human- readable value is preferably stored in binary form as the hexadecimal value "36 F0 BB 9F."
The placement field in position two identifies the Internet publisher from which the placement request was received. In the example record, the human-readable placement label is "247_garden_030199sj_4."
This human-readable value is preferably stored in binary form as the hexadecimal value "00 00 12 9C."
The advertising message ID field in position three indicates the particular advertising message to which the record relates. The example record has an advertising message ID of "27440," which is preferably stored in binary form as the hexadecimal value "00 00 6B 30."
The cookie field in the fourth position is an 8-byte value stored on the user's computer system to uniquely identify it to the web server. In the example record, the cookie is "918503847-16924218," preferably stored in binary form as hexadecimal value "36 BF 41 A7 01 02 3E 3A."
The IP address field in position five indicates the Internet protocol address at which the placement request originated. In the example record, the IP address field contains "130.166.82.106," which is preferably represented in binary form as the hexadecimal value "82 A6 52 6A." The hash value of user-agent field in position 6 contains a hashed indication of the user- agent, and is optional.
The flags field in position 7 contains flags identifying the specific nature of the advertise request, such as an "impression" flag. Placement log entries do not have an eighth field.
Figure 4 is a chart diagram showing the contents of a typical advertiser web server log record. The Unix timestamp field in position one indicates the date in time that the request was received. In the example record, the Unix timestamp field contains the value "942933252," which is preferably stored in binary form as the hexadecimal value "38 34 05 04."
The action tag field in position 2 indicates the nature of the action that the user took on the advertiser's web page. In the example record, the action tag has a value of "ubid_prd," which is preferably stored in binary form as the hexadecimal value "00 00 43 Al." The third position contains a 4-byte field that is not used. The cookie field in the fourth position is an 8-byte value stored on the user's computer system to uniquely identify it to the web server. In the example record, the cookie contains the value "942120440-2699748," preferably stored in binary form as hexadecimal value "38 27 9D F8 00 29 31 E4." The IP address field in position five indicates the Internet protocol address at which the placement request originated. In the example record, the IP address field contains the value "24.218.79.61," which is preferably represented in binary form as the hexadecimal value "18 DA 4F 3D." The hash value of user-h in position 6 contains a hashed indication of the user-h and is optional.
The flags field in position 7 contains flags identifying the specific nature of the advertiser request, such as an "impression" flag. The flags field for the sample rows preferably includes both impression and action flags.
The extended data field in position 8 contains "extended data" further describing the user's action advertiser's web site. For example, the extended data field may contain coded indications of the amount of money that the user spent or the type of item that the user impressioned information about or purchased. The extended data is textual in nature, and preferably is
comprised of extended data pairs, separated by slashes that are of the form "[subfield name].[subfield data]." In the example, advertiser request, the extended data field contains the extended data pair "b.17," which indicates that subfield b has value 17. Because extended data is of a textual nature, is of variable length, and is of potentially substantial length, when an advertiser request is received by the web server, the facility preferably stores two different log records to represent each advertiser request: a log record in binary form excluding the extended data field, and a log record in textual form including the extended data field. Retrieval operations where the contents of the extended data field are not of interest may be performed quickly against a log containing the binary version, while retrieval operation in which the contents of the extended data field are significant may be performed against the textual version that contains the extended data. Figure 5 is a data structure diagram showing the human-readable contents of a group of typical element server log rows.
Figure 6 is a data structure diagram showing the human-readable contacts of a group of typical advertiser server log rows.
In order to compact the contents of each web server row, the facility preferably compresses each row with a compression algorithm optimized for fast decompression. In a preferred embodiment, the facility utilizes a variant of LEMPEL-ZIV-OBERHUMER ("LZO") compression algorithm described at:
HTTP : //wildshu. idv. uni-linz. ac . at/mfx/lzodoc . html
In particular, the facility preferably uses the LZO IX- 1 variant of the LZO compression algorithm.
Some embodiments of the facility partition the log into a number of different log files, also called log partitions. The log files are partitioned
based upon one or more of the following bases: a time period in which the effective time of each entry in the log file is contained; one or more log entry types that the log file contains; and the range of unique identifiers contained by entries in the log. The time period basis for partitioning the logs results in different sets of logs for different periods of time, such as each hour, each day, or each month. The entry type basis for partitioning the log produces different sets of log files for different entry types. This permits many log files to utilize fixed-length entries of the minimum necessary size, and minimizes the number of logs that must be read when analyzing entries of fewer than all of the types. The unique identifier basis for partitioning the log minimizes the number of log files that must be read in order to analyze the behavior of a single user.
Partitioning the log into separate files in this manner facilitates parallel processing of the log. For example, in performing an analysis of a portion of the log corresponding to 50 log files, the facility may distribute the processing of each of the 50 log files to a different processor, or to a different computer system.
Rather than analyzing the contents of the log and log files conventionally by submitting queries to a relational database engine, the facility preferably uses programs constructed in a general-purpose programming environment to directly read and analyze the log as partitioned into log files. In particular, embodiments of the facility utilize the Standard Template Library (STL) now provided as part of the C++ programming language to provide this functionality. The STL, which is well known to those skilled in the art, is described succinctly in the document "An Overview of The Standard Template Library" by G. Bowden Wise, currently available at http://www.cs.rpi.edu/~wiseb/xrds/ovp2-3b.html, and is more formally defined in chapters 23-25 of International Standard ISO/IEC 14882 for "Programming Languages - C++", adopted by the American National Standards Institute on July 27, 1998 and adopted by the International Standardization Organization on September 1, 1998, currently available at http://webstore.ansi.org/.
The STL provides an orthogonal component structure, in which data is collected in containers, accessed there by iterators, and thereby provided to algorithms for processing. In order to perform analysis on a log file, the facility first reads the log file into a container such as a vector so that each entry in the log file occupies an item in the vector. The facility sorts the vector first by unique identifier, then by effective time. The facility then uses an iterator to iterate through the sorted vector in order. Because the vector is sorted first by unique identifier, all of the entries relating to a particular unique identifier occur contiguously in the sorted vector. Because the vector is sorted second by effective time, these contiguous entries relating to a single unique identifier occur in the order of their effective times. Thus, an algorithm receiving data from an iterator iterating through the sorted vector receives log entries for each unique identifier in turn. For a particular unique identifier, the entries are received in the order of their effective times. The algorithm utilized by the facility performs the desired analysis, such as analyzing a correspondence between a web serving request for placing a product in a shopping cart, and a later web serving request for authorizing payment for the product. Those skilled in the art will appreciate that algorithms performing virtually any type of analysis may be incorporated in the facility. It will by understood by those skilled in the art that the above- described facility could be adapted or extended in various ways. For example, the facility may log and analyze transactions of various types other than web serving transactions. Also, log contents may be analyzed using tools other than C++ programs calling the standard template library. While the foregoing description makes reference to preferred embodiments, the scope of the invention is defined solely by the claims that follow and the elements recited therein.