WO2002015520A1

WO2002015520A1 - Packet scheduling methods and apparatus

Info

Publication number: WO2002015520A1
Application number: PCT/CA2000/000937
Authority: WO
Inventors: Renwei Li; Peter Wai-Tong Kwong; Paul Terry; Ronald Leonard Westfall
Original assignee: Redback Networks Inc.
Priority date: 2000-08-17
Filing date: 2000-08-17
Publication date: 2002-02-21
Also published as: AU2000266748A1

Abstract

Providing different levels of quality of service for different data flows being transported over a data link requires a very fast way to schedule individual packets for forwarding on the data link. The invention provides scheduling methods which give preference to higher priority packets while treating lower priority packets fairly. The methods can provide shorter latencies for higher priority packets than can many prior scheduling methods. The methods and apparatus of the invention are readily adaptable for use with scheduling rules provided in the form of hierarchical policy trees.

Description

PACKET SCHEDULING METHODS AND APPARATUS

Field of the Invention

This invention relates to the transmission of data over communications networks including wide area networks. More specifically, this invention relates to methods and apparatus for scheduling data packets for transmission over a data link. The scheduhng methods and apparatus may be used in systems for providing a plurality of differentiated services each providing a different level of Quality of Service ("QoS") over wide area networks. The scheduling methods and apparatus have particular application in Internet Protocol ("IP") networks.

Background of the Invention

Maintaining efficient flow of information over data communication networks is becoming increasingly important in today's economy. Telecommunications networks are evolving toward a connectionless model from a model whereby the networks provide end-to- end connections between specific points. In a network which estabhshes specific end-to-end connections to service the needs of individual applications the individual connections can be tailored to provide a desired bandwidth for communications between the end points of the connections. This is not possible in a connectionless network. The connectionless model is desirable because it saves the overhead implicit in setting up connections between pairs of endpoints and also provides opportunities for making more efficient use of the network infrastructure through statistical gains. Many networks today provide connectionless routing of data packets, such as Internet Protocol ("IP") data packets over a network which includes end-to-end connections for carrying data packets between certain parts of the network. The end-to-end connections may be provided by technologies such as Asynchronous Transfer Mode ("ATM"), Time Division Multiplexing ("TDM") and SONET/SDH.

A Wide Area Network ("WAN') is an example of a network in which the methods of the invention may be applied. WANs are used to provide interconnections capable of carrying many different types of data between geographically separated nodes. For example, the same WAN may be used to transmit video images, voice conversations, e-mail messages, data to and from database servers, and so on. Some of these services place different requirements on the WAN. For example, transmitting a video signal for a video conference requires fairly large bandwidth, short delay (or "latency"), small delay jitter, and reasonably small data loss ratio. On the other hand, transmitting e-mail messages or apphcation data can generally be done with lower bandwidth but can tolerate no data loss. Further, it is not usually critical that e-mail be delivered instantly. E-mail services can usually tolerate longer latencies and lower bandwidth than other services.

A typical WAN comprises a shared network which is connected by access links to two or more geographically separated customer premises. Each of the customer premises may include one or more devices connected to the network. More typically each customer premise has a number of computers connected to a local area network ("LAN'). The LAN is connected to the WAN access link at a service point. The service point is generally at a "demarcation" unit or "interface device" which collects data packets from the LAN which are destined for transmission over the WAN and sends those packets across the access link. The demarcation unit also receives data packets coming from the WAN across the access link and forwards those data packets to destinations on the LAN.

Currently an enterprise which wishes to link its operations by a WAN obtains an unallocated pool of bandwidth for use in carrying data over the WAN. While it is possible to vary the amount of bandwidth available in the pool (by purchasing more bandwidth on an as-needed basis), there is no control over how much of the available bandwidth is taken by each application.

As noted above, guaranteeing the Quality of Service ("QoS") needed by applications which require low latency is typically done by dedicating end-to-end connection-oriented links to each application. This tends to result in an inefficient allocation of bandwidth. Network resources which are committed to a specific link are not readily shared, even if there are times when the link is not using all of the resources which have been allocated to it. Thus committing resources to specific end- to-end links reduces or eliminates the ability to achieve statistical gains. Statistical gains arise from the fact that it is very unlikely that every apphcation on a network will be generating a maximum amount of network traffic at the same time. If applications are not provided with dedicated end-to-end connections but share bandwidth then each apphcation can, in theory, share equally in the available bandwidth. In practice, however, the amount of bandwidth available to each apphcation depends on things such as router configuration, the location(s) where data for each apphcation enters the network, the speeds at which the application can generate the data that it wishes to transmit on the network and so on. The result is that bandwidth may be allocated in a manner that bears no relationship to the requirements of individual apphcations or to the relative importance of the applications. There are similar inequities in the latencies in the delivery of data packets over the network.

The term Quality of Service is used in various different ways by different authors. In general, QoS refers to a set of parameters which describe the required traffic characteristics of a data connection. In this specification the term QoS refers to a set of one or more of the following interrelated parameters which describe the way that a data connection treats data packets generated by an application: Minimum Bandwidth - a minimum rate at which a data connection must be capable of forwarding data originating from the apphcation. The data connection might be incapable of forwarding data at a rate faster than the minimum bandwidth but should always be capable of forwarding data at a rate equal to the rate specified by the minimum bandwidth;

Maximum Delay - a maximum time taken for data from an apphcation to completely traverse the data connection. QoS requirements are met only if data packets traverse the data connection in a time equal to or shorter than the maximum delay; Maximum Loss - a maximum fraction of data packets from the application which may not be successfully transmitted across the data connection; and,

Jitter - a measure of how much variation there is in the delay experienced by different packets from the apphcation being transmitted across the data connection. In an ideal case, where all packets take exactly the same amount of time to traverse the data connection, the jitter is zero. Jitter may be defined, for example, as any one of various statistical measures of the width of a distribution function which expresses the probability that a packet will experience a particular delay in traversing the data connection.

Different applications require different levels of QoS.

Recent developments in core switches for WANs have made it possible to construct WANs capable of quickly and efficiently transmitting vast amounts of data. There is a need for a way to provide network users with control over the QoS provided to different data services which may be provided over the same network.

Service providers who provide access to WANs wish to provide their customers with Service Level Agreements rather than raw bandwidth. This will permit the service providers to take advantage of statistical gain to more efficiently use the network infrastructure while maintaining levels of QoS that customers require. To do this, the service providers need a way to manage and track usage of these different services. There is a particular need for relatively inexpensive apparatus and methods for facilitating the provision of services which take advantage of different levels of QoS. Applications connected to a network generate packets of data for transmission on the network. In providing different levels of service it is necessary to be able to sort or "classify" data packets from one or more applications into different classes which will be accorded different levels of service. The data packets can then be transmitted in a way which maintains the required QoS for each apphcation. Data packets generated by one or more applications may belong to the same class.

There are many known methods for scheduling the transmission of packets over a data link. These include simple round robin schemes, HSCF, CBQ, WF²Q and WF²Q+. All of these methods have disadvantages. HSCF is difficult to implement. CBQ, WF²Q and WF²Q+ aU introduce undesirably long queuing delays. A problem with many of these scheduhng protocols is that they introduce too much delay into the transmission of those packets which must be delivered with minimum latency. There is a need for a fast scheduhng method and apparatus which can transmit "real time" packets with very small delays but which can also schedule the transmission of non-real time packets fairly.

Summary of the Invention

This invention provides methods and apparatus for scheduling the forwarding of data packets over a data link. The methods of the invention involve receiving classified data packets. In one embodiment of the invention, the methods include selecting one of a plurality of data packets by selecting an ehgible group of data packets and determining whether data packets in the ehgible group all belong to classes having the same priority or belong to classes having different priorities. If the data packets in the eligible group belong to two or more classes having different priorities the method selects one data packet by applying a selection criterion to an ehgible sub-group containing those one or more data packets in the ehgible group which belong to classes having a highest priority. If the data packets in the ehgible group all belong to classes having the same priority, the method selects one data packet by applying a selection criterion to all data packets in the ehgible group. The method provides reduced queuing delays for packets belonging to higher priority classes.

In preferred embodiments the selection criterion comprises a first to finish selection criterion. The method preferably includes maintaining a virtual time value. Selecting the eligible group preferably comprises selecting packets having a start time less than or equal to the virtual time value.

The invention may be practised with a plurahty of scheduhng engines interlinked to form a hierarchical tree, the tree including at least a parent scheduhng engine and a plurahty of child scheduhng engines hnked to the parent scheduling engine. The parent scheduling engine selects one data packet from the data packets being held by the child scheduling engines. In some embodiments, whenever a data packet belonging to a high priority class becomes available for selection by a child scheduling engine and a data packet already selected and being held by that child scheduhng engine belongs to a lower priority class, the data packet belonging to the high priority class is made available for selection by the parent scheduhng engine in place of the already selected data packet.

The invention also provides apparatus for scheduhng transmission of data packets on a data hnk, the apparatus comprises: a) a memory capable of holding a plurality of data packets queued in a plurality of queues; b) means for keeping a start time, a finish time and a priority for a packet at a head of each of the queues; c) a scheduling engine adapted to select one packet from a plurahty of packets at the heads of the queues, the scheduhng engine comprising: i) a counter for maintaining a virtual time for the scheduhng engine; ii) means for comparing the start time for each packet to the virtual time for the scheduhng engine to select an eligible group of packets; hi) means for comparing the priorities of packets in the eligible group of packets and eliminating from the ehgible group packets having a priority lower than a priority for another packet in the eligible group; and, iv) means for selecting one packet from the ehgible group having an earhest finish time. Other aspects and features of the invention are described below.

Brief Description of the Drawings

In the attached drawings which illustrate non-limiting embodiments of the invention:

Figure 1 is a schematic view of a wide area network according to the invention which comprises enterprise service point ("ESP") devices for providing packet scheduhng functions according to the invention;

Figure 2 is a schematic view illustrating two flows in a communications network according to the invention;

Figure 3 is a dia ram illustrating the various data fields in a prior art IP data packet; Figure 4 is a schematic view showing an example a pohcy which may be implemented with the methods and apparatus of the invention;

Figure 5 is a schematic view of apparatus for scheduhng packets according to the invention; Figure 5A is a schematic illustration showing a structure of a scheduler according to the invention;

Figure 6 is a flow chart illustrating a method according to the invention by which leaf scheduhng engines may select and transmit packets; Figure 6A is a flow chart iUustrating a method according to the invention by which non-leaf scheduhng engines may select and transmit packets;

Figure 7 is a diagram of a scheduler implemented by a number of hierarchically arranged scheduhng engines according to the invention; and,

Figure 8 is a flow chart illustrating a simplified embodiment of the invention.

Detailed Description

This invention may be applied in many different situations where data packets are scheduled and dispatched. The following description discusses the apphcation of the invention to scheduhng onward transmission of data packets received at an Enterprise Service Point ("ESP"). The invention is not hmited to use in connection with ESP devices but can be applied in almost any situation where classified data packets are scheduled and dispatched.

Figure 1 shows a generalized view of a pair of LANs 20, 21 connected by a WAN 22. Each LAN 20, 21 has an Enterprise Service Point unit ("ESP") 24 which connects LANs 20, 21 to WAN 22 via an access hnk 26. LAN 20 may, for example, be an Ethernet network or a token ring network. Access hnk 26 may, for example, be an Asynchronous Transfer Mode ("ATM") hnk. Each LAN has a number of connected devices 28 which are capable of generating and/or receiving data for transmission on the LAN. Devices 28 typically include network connected computers. As required, various devices 28 on network 20 may establish connections with devices 28 on network 21 and vice versa. Each connection may be called a session. Each session comprises one or more flows. Each flow is a stream of data from a particular source to a particular destination. For example, Figure 2 illustrates a session between a computer 28A on network 20 and a computer 28B on network 21. The session comprises two flows 32 and 33. Flow 32 originates at computer 28A and goes to computer 28B through WAN 22. Flow 33 originates at computer 28B and goes to computer 28A over WAN 22. Computers 28A and 28B each have an address. Most typically data in a great number of flows will pass through each ESP 24 in any short period.

Each flow consists of a series of data packets. In general the data packets may have different sizes. Each packet comprises a header portion which contains information about the packet and a payload or datagram. For example, the packets may be Internet protocol ("IP") packets.

Figure 3 illustrates the format of an IP packet 35 according to the currently implemented IP version 4. Packet 35 has a header 36 and a data payload 38. The header contains several fields. The "version" field contains an integer which identifies the version of IP being used. The current IP version is version 4. The "header length" field contains an integer which indicates the length of header 36 in 32 bit words. The "type of service" field contains a number which can be used to indicate a level of Quality of Service required by the packet. The "total length" field specifies the total length of packet 35. The "identification" field contains a number which identifies the data in payload 38. The "flags" field contains 3 bits which are used to determine whether the packet can be fragmented. The "time-to-live"field contains a number which is decremented as the packet is forwarded. When this number reaches zero the packet may be discarded. The "protocol" field indicates which upper layer protocol apphes to packet 35. The "header checksum" field contains a checksum which can be used to verify the integrity of header 36. The "source address" field contains the IP address of the sending node. The "destination address" field contains the IP address of the destination node. The "options" field may contain information related to packet 35.

Each ESP 24 receives streams of packets from its associated LAN and from WAN 22. These packets typicahy belong to at least several different flows. The combined bandwidth of the input ports of an ESP 24 is typicahy greater than the bandwidth of any single output port of ESP 24. Therefore, ESP 24 typically represents a queuing point where packets belonging to various flows may become backlogged while waiting to be transmitted through a port of ESP 24. Backlogs may occur at any output port of ESP 24. While this invention is preferably used to manage the scheduhng of packets at ah output ports of ESP 24, the invention could be used at any one or more output ports of ESP 24.

For example, if the output port which connects ESP 24 to WAN 22 is backlogged then ESP 24 must determine which

over access link 26, in which order, to make the best use of the bandwidth available in access hnk 26 and to provide guaranteed levels of service to individual flows. To do this, ESP 24 must be able to classify each packet, as it arrives, according to certain rules. ESP 24 can then identify those packets which are to be given priority access to link 26. After the packets are classified they can be scheduled for transmission.

The packets must be classified, scheduled and forwarded extremely quickly. For example, a delay of much more than 1 millisecond is unacceptable for two-way voice conversations. If classifying and scheduling a packet takes 2 milliseconds then it would be impossible to provide a QoS sufficient for two-way voice conversations. This invention provides methods and apparatus for scheduhng the transmission of packets for transmission over a data connection in a data communication network. By way of example only, packets transmitted via the data connection may be carried over an ATM hnk. Incoming packets are sorted by a classifier into classes according to a policy which includes a set of classification rules. The rules set conditions on the values of one or more parameters which characterize the packets which belong to each class. A packet is assigned to a class if the parameter values for that packet match the conditions set by the classification rules for the class. The policy also estabhshes a QoS level which will be accorded to the packets in each of the different classes. Data packets in some classes may be treated differently from data packets in other classes to provide guaranteed levels of QoS to applications which generate data packets in selected classes. There is preferably a separate policy for each output port of

ESP 24. For example, There is a pohcy for the port of ESP 24 connected to outgoing hnk 26. There may be separate pohcies classifying and scheduhng packets which are received at an ESP 24 from a data link 26 and which are destined for each one of the one or more ports of ESP 24 connected to a LAN. The methods and apparatus of the invention may also be used in other network devices which schedule the forwarding of data packets.

Any suitable classifier may be used to classify data packets for scheduling according to this invention. For example, the classification methods and apparatus described in a co-pending commonly owned apphcation entitled METHODS AND APPARATUS FOR PACKET CLASSIFICATION WITH MULTI-LEVEL DATA STRUCTURE which is incorporated herein by reference, or the methods and apparatus described in METHODS AND APPARATUS FOR PACKET CLASSIFICATION WITH MULTIPLE ANSWER SETS which is incorporated herein by reference, may be used to classify packets so that the packets may be scheduled by the methods and apparatus of this invention.

At any given time ESP 24 may hold backlogged data packets which are waiting to be forwarded to a destination and which are classified in one or more of the classes. The relationship between different classes in a pohcy and the QoS accorded to different classes may be represented by a "classification tree" or "pohcy" tree 39 (Fig. 4). The leaf nodes of one or more pohcy trees 39 correspond to the individual classes identified by the classification rules of the pohcy. Other nodes of the pohcy tree may also be called classes.

Figure 4 schematically illustrates one possible pohcy tree 39. Pohcy tree 39 has a number of leaf nodes 40, 42, 44, 46. In the example policy tree of Figure 4 class 40 contains voice traffic. Class 40 may be termed a "real time" class because it is important to dehver packets in class 40 quickly enough to allow a real time voice conversation between two people. Packets in class 40 wih be scheduled so that each flow in class 40 will be guaranteed sufficient bandwidth to support a real time voice session. This may be done, for example, by specifying a particular minimum amount of bandwidth to be shared by the packets classified in class 40. Each flow in class 40 will be guaranteed a level of QoS sufficient for voice communication.

Classes 42 and 44 contain flows of Hyper Text Transfer Protocol ("HTTP") packets. Class 42 contains HTTP flows which originate in MARKETING. MARKETING may be, for example, sources 28 associated with a company's marketing department. Other HTTP flows fall into class 44. As indicated at 48, in the pohcy of Figure 4, classes 42 and 44 will share between themselves at least 40% of the bandwidth. 15% of the bandwidth is allocated to satisfy the flows of class 40. The other 45% of the bandwidth is allocated to class 46 which covers aU other flows. Of the bandwidth shared by classes 42 and 44, at least 30% is allocated to class 42 and at least 70% is allocated to class 44. The actual bandwidth available at a node may be greater than the minimum bandwidth allocated by pohcy 39. For example, packets coming through node 42 may enjoy more than 30% of the bandwidth of node 48 which is shared between nodes 42 and 44 if there is no backlog of packets at node 44 (i.e. node 44 is not using ah of the minimum bandwidth to which it is entitled). If, for example, at some time there are no packets for transmission which are associated with node 44 then aU of the bandwidth shared by nodes 42 and 44 is available to packets associated with node 42.

A pohcy tree typicahy has two or more levels. The pohcy tree 39 of Figure 4 has 3 levels. Nodes which are in the same level are ah separated from hnk 26 by the same number of nodes above them in pohcy tree 39. We can refer to the levels in increasing ordinahty starting from node 49 which can be termed a first level, or "root" level node. Nodes 40, 46 and 48 may be termed "second" level nodes because they are one node removed from hnk 26. Nodes 42 and 44 are third level nodes which are two nodes removed from hnk 26, and so on.

In Figure 4 lower level nodes of pohcy tree 39 are depicted as being above higher level nodes. Nodes in pohcy tree 39 are connected to one another as indicated in Figure 2 by lines 41. A higher level node connected to a lower level node by a hne 41 is said to be a child of the higher level node. A lower level node connected to a higher level node by a hne 41 is said to be a parent of the lower level node.

The policy represented by a policy tree 39 may specify QoS by providing a desired distribution of bandwidth between different higher level nodes which depend from the same lower level node. This may be done, for example, by specifying absolute amounts of bandwidth to be provided to individual higher level nodes, specifying percentages of available bandwidth to be shared by each of two or more higher level nodes (as described above with respect to nodes 42 and 44), a combination of these measures or any equivalent measure. In preferred embodiments of the invention, packets are classified and inserted into a scheduler which has a structure mirroring that of the pohcy tree. The packets enter the scheduler at a leaf node corresponding to the class. From there, the packets "percolate" from node to node up through the scheduler, until they reach a node corresponding to the root node of the pohcy tree. From there, the packets are sent out on the data hnk.

After a packet has been classified then the classification information for the packet is forwarded to a scheduler 50 (Fig. 5). Scheduler 50 schedules the transmission of the packet out an output port. Scheduler 50 uses the pohcy associated with the port to determine the sequence in which to send any packets which are backlogged waiting to be sent through the output port.

As shown in Figures 5 and 6, a scheduler 50 receives each incoming packet 51 together with a class identifier 53 generated by a classifier 52 (step 102). Scheduler 50 then places each packet in a queue 55 (step 104). Each queue 55 is associated with a leaf class. The particular queue 55 into which a packet is inserted is determined by the classification of the packet and, possibly, by the flow to which the packet belongs. Each queue 55 may contain zero, one, or more packets. Each active flow may have its own queue or, in the alternative, the packets for two or more flows may ah be directed to a single queue.

Queues 55 do not need to be physical queues in the sense that all packets in each queue 55 are located in sequence in the same storage device. Queues 55 are logical first in, first out ("FIFO") queues. Packets 51 are stored somewhere in a storage device accessible to scheduler 50. In Figure 5, the packets are stored in an RAM memory 64 accessible to scheduler 50. Scheduler 50 maintains a record of what packets 51 belong to each queue 55 and what is the order of packets 51 within each queue 55. Scheduler 50 selects packets which are at the heads of their respective queues 55 and a forwarder 58 associated with scheduler 50 sequentially transmits the selected packets over a data link 26. As is known in the art, data link 26 may include an adaptation layer. Each packet 51 may be transmitted on data hnk 26 as one or more data packets of the type carried by data link 26.

As shown in Figure 5A, the scheduler 50 of this invention preferably has a structure which mirrors that of a policy tree 39. Scheduler 50 has a scheduhng engine 60 corresponding to each node of policy tree 39. The scheduhng engines 60 are connected by data pathways 61which permit one scheduling engine to forward data packets to its parent scheduling engine. It is not necessary for data packets 51 to be physically transmitted from one scheduhng engine 60 to another. It is only necessary for information identifying individual data packets 51 to be sent from one scheduhng engine 60 to another. The data packet 51 in question could continue to reside in the same location in a storage device, such as RAM 64, until it is forwarded by forwarder 58.

Each group 56 of queues 55 corresponds to a leaf class in the pohcy tree 39. A scheduhng engine 60 corresponding to each leaf node (a "leaf scheduhng engine") selects packets from the queue(s) 55 in the^roup 56 corresponding to the same leaf node for passing to the scheduling engine 60 corresponding to the parent of the leaf node (a "parent scheduhng engine"). For example, leaf scheduhng engine 60A selects packets from the group 56 consisting of queues 55A, 55B, and 55C to be passed to parent scheduhng engine 60B along data path 61A. A child scheduhng engine 60 corresponding to a first node of a pohcy tree 39 can pass responsibihty for data packets 51 to a parent scheduhng engine 60 which corresponds to the parent node of the first node of the pohcy tree. A parent scheduhng engine corresponding to a first node of a policy tree can receive data packets 51 from one or more child scheduling engines which correspond to child nodes of the first node of the pohcy tree. A scheduling engine 60 may be a child of another scheduling engine 60 and, at the same time, may be a parent of one or more other scheduhng engines 60.

Scheduler 50 passes responsibility for each packet 51 from one scheduhng engine 60 to another upwards through the tree in stages until the packet 51 is associated with scheduhng engine 60C which corresponds to the first level node 49 of pohcy tree 39. The scheduhng engine 60C associated with the first level node 49 of policy tree 39 selects packets from its child scheduhng engines to be sent out the logical output port by forwarder 58. Each scheduling engine 60 can pass one packet at time to its parent (lower level) scheduhng engine. A scheduhng engine 60 which receives packets from more than one source (e.g. which corresponds to a node in a pohcy tree which has two or more child nodes. or which corresponds to a leaf node having a plurahty of corresponding queues) interleaves packets from the different sources so that aU packets 51 will eventually be passed by the scheduling engine 60.

Packets 51 are transmitted through a scheduhng engine 60 at a rate i?that corresponds to the bandwidth assigned to the scheduhng engine in pohcy tree 39. The bandwidth assigned to a parent scheduhng engine 60 must be equal to the aggregate bandwidth ahocated to the child scheduhng engines 60 of that parent scheduling engine.

The bandwidth assigned to a leaf scheduling engine 60 is shared equally by all queues associated with the leaf scheduhng engine. Each queue is assigned a bandwidth R_q of:

^Λ' ^" .v. ω

where R_lc is the bandwidth for the leaf class and N_q is the number of queues associated with the leaf class.

In general, the packets in different queues 55 will not be equal in len th. Therefore, a leaf scheduling engine 60 cannot fairly allocate bandwidth by simply transmitting one or more packets 51 from each active queue 55 with the number of packets 51 transmitted from each queue in a ratio equal to the proportion of bandwidth available for each one of the active queues. In the preferred embodiment of the invention, a notion of time is used to measure whether packets are being transmitted at an assigned rate. If a packet 51 of length L were transmitted at a rate R, its transmission wih be completed after an interval /given by:

L I = — (2) R

Each scheduling engine 60 maintains a virtual time Which advances by the interval /each time it passes a packet to its parent scheduhng engine (or to forwarder 58 in the case of scheduling engine 60C). Each interval is calculated from the length of the packet being passed. The virtual time of each scheduhng engine 60 is initialized to 0 when scheduler 50 is initiahzed. The virtual time of each scheduhng engine 60 is stored in an associated memory 64A as shown in Figure 5.

The packets in a queue 55 associated with a leaf class of tree 39 should ideahy be transmitted out of the queue 55 at the rate given by Equation (1). In a preferred implementation of scheduler 50, each leaf scheduling engine 60 calculates a start time ^and a finish time -F or packets 51 at the heads of its queues 55 (step 106). The start and finish times for a packet can be considered to be measures of when a packet 51 at the head of a queue 55 should ideaUy start to be transmitted and when it should finish transmission, ^and .Fare used by leaf scheduling engines 60 to select which packet to transmit next.

When a packet 51 first reaches the head of a queue 55, it is assigned a start time «-?and a finish time F. A packet 51 can reach the head of a queue 55 by being placed into an empty queue 55. In this case the packet 51 is assigned the virtual time of the leaf scheduler 60 to which the queue belongs as its start time. The other way a packet 51 can reach the head of a queue 55 is for it to replace a previous packet 51 that has just been transmitted out of the queue. In this case the start time of the packet 51 will be set to the finish time of the previous packet 51. When the start time for a packet 51 is known then the finish time for the packet 51 will be given by the equation:

Scheduler 50 keeps a record of for each scheduling engine 60 and also keeps records of Sand Fiox the packets at the head of each non-empty queue 55 managed by scheduler 50. In the embodiment of

Figure 5, this information is kept in an associated memory 64A. While S, Faxiά have been called "times" these parameters do not necessarily bear any relationship to actual time. S, Faxid. Fare similar to time in that they always increase. In commercial embodiments, S ^and will typicahy be values stored in memory locations. The values are periodically added to by scheduler 50.

As noted above, start times Sand finish times Fϊor each queue are calculated on the basis of the rate Rj/N_q. However, leaf schedulers 60 extract packets from queues 55 and forward those extracted packets at a rate R_Ic. The virtual time Ffor the leaf scheduler 60 is advanced on the basis of the rate R_lc.. This means that the values of Sand Fior a packet at the head of a queue 55 will tend to be in the future relative to the virtual time Voϊ the associated leaf scheduhng engine 60. This gives the leaf scheduling engine 60 time to service any other queues 55.

Where a leaf scheduhng engine 60 services more than one queue, the leaf scheduling engine 60 selects a next packet to be transmitted by using the start and finish times of the packets at the heads of the queues 55 associated with the leaf class. According to the preferred embodiment of the invention, each leaf scheduhng engine 60 selects a group of eligible packets 51 from the group of aU packets 51 at the heads of the queues 55 in the group 56 associated with that leaf scheduhng engine 60 (step 110). The eligible group comprises a set of packets which are ehgible for transmission according to an eligibility criterion.

Preferably the set of eligible packets is constructed by selecting those packets which have a start time S smaller than or equal to the virtual time Voi the scheduler 60.

When this eligibility criterion is used, the ehgible packets are packets whose predicted start times have passed. If the scheduling engine 60 does not send a packet 51 from that queue 55 soon, the queue 55 wih not have the benefit of the bandwidth calculated by equation (1). If a packet 51 at the head of a queue 55 is not ehgible, its start time is greater than the virtual time Vo£ the scheduhng engine 60. This indicates that the queue 55 has already received the benefit of its assigned bandwidth. If there are no ehgible packets in any queue 55 associated with a leaf class (i.e. the set of ehgible packets is empty), but there are packets in one or more of the queues 55 associated with the leaf class, then the virtual time Vof the scheduling engine 60 associated with the leaf class is advanced to the start time Sof the packet or packets with the earliest start time S. A set of ehgible packets is then identified by applying the eligibility criteria to the packets using the new virtual time V (step 110).

In preferred embodiments of the invention, the leaf scheduhng engine 60 will select for transmission the ehgible packet 51 which meets a selection criterion (step 114). Preferably the selection criterion is a first to finish selection criterion so that the eligible packet that has the earliest finish time Fis selected. An alternative, less preferable, approach is to use a selection criterion which selects for transmission the eligible packet with the earliest start time S. If two or ore packets have the same finish time (or start time), scheduling engine 60 may select one of the two or more packets at random (step 114).

A simplified method is possible whereby leaf scheduhng engine 60 simply selects for transmission the packet which has the smallest finish time F(or earhest start time S) without considering ehgibihty. The use of only finish time (or start time) provides coarsegrained control over bandwidth usage, but there will be short term fluctuations either side of the assigned bandwidth.

After leaf scheduhng engine 60 selects a packet 51, the selected packet 51 is removed from its queue 55 and is held at leaf scheduling engine 60. In preferred embodiments of the invention only a single packet 51 can be held at a scheduhng engine 60. Once again, it is not necessary for the packet 51 to be physically moved. Eventually the selected packet will be passed to the parent of the leaf scheduhng engine 60 (step 122). At that time, the virtual time Voϊ the leaf scheduhng engine 60 will be updated (step 125) and leaf scheduhng engine 60 wih select a new packet 51 (step 114) from a queue 55 for eventual transmission.

In the preferred embodiment of the invention, scheduling engines 60 corresponding to non-leaf classes use a similar method to select a packet for transmission as shown in Figure 6A. Each scheduling engine 60 which corresponds to a non-leaf class selects packets 51 from among those packets 51 which are being held by its child scheduhng engine(s) 60 (step 109). In a preferred implementation of the invention, each child scheduling engine 60 assigns new start and finish times to a packet 51 when the packet is transferred to the child scheduling engine 60. If a child scheduhng engine 60 passes a packet to its parent scheduling engine 60 and immediately receives a new packet 51 in the same operation then the new packet 51 is assigned a start time that is the same as the finish time of the previously passed packet. Otherwise, the virtual time¹ of the child scheduling engine 60 is set equal to that of the parent scheduhng engine 60 and the new packet 51 is assigned a start time equal to the newly assigned virtual time Vof e child scheduling engine 60.

First level scheduhng engine 60C has no parent scheduhng engine 60. Scheduhng engine 60C does not need to maintain start and finish times for the packet that it is holding because forwarder 58 simply forwards the packets held by scheduling engine 60C as quickly as possible.

The finish time for a packet 51 being held at a child scheduling engine 60 will be given by the equation:

Where R_cc is the data rate assigned to the child scheduhng engine in pohcy tree 39. The start and finish times of packets δlheld at ah scheduhng engines 60 are stored in associated memory 64A.

Start and finish times for a packet 51 being held at a child scheduling engine 60 are calculated on the basis of the rate R_cc. A parent scheduling engine 60 is assigned a greater data rate R_pc in pohcy tree 39 than its child scheduling engines. The virtual time of the parent scheduling engine 60 will advance on the basis of the rate R_pc. This means that the packet's calculated start and finish times wih tend to be in the future relative to the virtual time of the parent scheduhng engine. This gives the parent class time to service other child scheduhng engines.

Each leaf class of pohcy tree 39 has a priority. Each packet that passes through a leaf scheduhng engine 60 is assigned the priority of the leaf class. Information identifying the priority of a packet is passed to each scheduhng engine 60 which handles the packet. A scheduler 50 may support two or more levels of priority. A simple two level priority scheme, as shown in the priority tree of Figure 4, designates high priority classes as "real-time" and lower priority classes as "best effort". A non-leaf scheduling engine 60 selects the next packet to be transmitted to its parent scheduling engine 60 from among the zero or more packets which are being held by its child scheduhng engines 60. If there are two or more packets being held by its child scheduhng engines 60 then the non-leaf scheduhng engine 60 uses the priority, start time, and finish time of the two or more packets to select one packet to hold and eventually transmit to its parent scheduhng engine 60. As a strategy, high priority is assigned to classes that require small transmission delays. Lower priorities are assigned to classes that can tolerate larger delays.

Each parent scheduhng engine 60 selects a group of packets which are ehgible for transmission according to an ehgibility criterion. Preferably the set of eligible packets is constructed by identifying those packets being held by child scheduling engines 60 of the parent scheduhng engine 60 whose start times are smaller than or equal to the virtual time of the parent scheduhng engine 60 (step 110). In other words a packet is ehgible if its predicted start time has passed.

If one or more packets are being held by child scheduhng engines 60 but none of them are eligible then the virtual time of the parent scheduling engine is advanced to the start time of the packet or packets being held by child scheduling engines 60 which have the earhest start time. The set of ehgible packets is then identified based on the new virtual time (step 110).

After a set of ehgible packets has been identified, the parent scheduhng engine 60 determines whether the eligible packets all have the same priority or have different priorities (step 112). If the set of ehgible packets includes packets which have two or more different priorities, parent scheduhng engine 60 identifies the highest priority assigned to one or more packets in the ehgible set. Any packet in the ehgible set which does not have the highest priority is removed from the set (step 118).

As an alternative to constructing an initial set of ehgible packets and subsequently modifying the set to create a sub-set which contains only the highest priority eligible packets, a scheduling engine 60 could take priority into consideration while identifying eligible packets. The ehgible set would then contain only those packets which have a start time which makes them eligible to be transmitted and which also have a highest priority. After an ehgible set has been constructed then the parent scheduling engine 60 selects one packet to pass on next to its parent scheduling engine according to a selection criterion (step 114 or 120). For example, in preferred embodiments of the invention, the scheduhng engine 60 selects for transmission the highest priority ehgible packet 51 which has the earhest finish time. A less preferable selection criterion selects the highest priority ehgible packet with the earhest start time. If two or more packets have the same finish time (or start time), the scheduling engine 60 may select one of the packets at random.

Parent scheduhng engines 60 could use a simplified method which does not use start time to determine ehgibility. Figure 8 ihustrates this simplified embodiment of the invention being used in a situation where packets have one of two priority levels. Each packet may be a high priority (or "real time") packet or a low priority (or "best effort") packet. Simplified method 200 begins by selecting aU high priority packets which are currently queued (step 204). The method continues by passing the one high priority packet having the smallest finish time nstep 206). In the alternative, step 206 could pass the packet having the smaUest start time S. If there are no queued high priority packets then the method selects all queued low priority packets (step 208) and continues by forwarding the low priority packet with the smallest finish time nstep 210). In the alternative, step 210 could pass the packet having the smallest start time S. If there are no packets in any queue then the scheduhng engine simply waits. The steps of selecting and forwarding high priority packets may be performed as a single step (e.g. if there are any queued high priority packets, selecting and forwarding the queued high priority packet with the smallest finish time) as indicated by 207 and the step of selecting and forwarding the low priority packet may also be performed as a single step (e.g. if there are any queued low priority packets, selecting and forwarding the queued low priority packet with the smaUest finish time) as indicated by 211. The use of finish time as a selection criterion still provides coarse- grained control over bandwidth usage, but there wih be short term fluctuations either side of the assigned bandwidth. A disadvantage of the simplified method of Figure 8 is that no lower priority packets will be forwarded over the data hnk as long as there are higher priority packets to be sent. Each time a parent scheduling engine 60 selects a packet being held by one of its child scheduhng engines, scheduler 50 removes the selected packet from the child scheduhng engine to the parent scheduhng engine, where it is held. After the packet moves from a child scheduhng engine 60 to the scheduling engine which is the parent of that child scheduling engine 60 (step 122) then the virtual time of the child scheduhng engine is updated (step 125) and the child scheduhng engine will select a new packet.

As noted above, first level scheduhng engine 60C, which may be termed a "root" scheduhng engine does not have a parent class that pulls packets upwards. Instead a forwarder 58 iteratively retrieves packets from root scheduling engine 60C and sends the packets out the logical output port. Each time a packet is retrieved by scheduler 58, root scheduling engine 60C selects another packet from among packets being held by its child scheduling engines for transmission. There are two main different ways of implementing scheduler

50. Scheduler 50 could be a single entity that traverses pohcy tree 39, stopping at each node to provide the function of each scheduling engine 60. Such a scheduler 50 could be implemented as software running on a general purpose CPU or it could be implemented as a hardware device (e.g. an ASIC). In the alternative, scheduler 50 could be implemented as a set of much simpler entities, with a separate entity providing the function of each scheduhng engine 60. Each simple scheduling engine 60 could be implemented as a software entity running on a general purpose CPU. Alternatively each simple scheduler could be implemented as a hardware entity and combined with other simple schedulers into a parallel processing hardware device.

In some cases it is desirable to expedite the transmission of high priority packets which arrive after a packet has been selected by a scheduler 50. Consider, for example, the scheduler 150 of Figure 7. Scheduler 150 has 9 leaf scheduhng engines, 160A through 1601. Each leaf scheduling engine receives packets which have been classified in a particular class by a classifier. Scheduler 150 has 5 non-leaf scheduhng engines 160J through 160N. Each scheduhng engine uses the methods of the invention to select and hold one data packet. That one packet is then available for selection by the parent of the scheduhng engine holding the packet.

In Figure 7, leaf scheduhng engines 160D and 160G correspond to real time classes. The other leaf scheduhng engines correspond to best effort classes. Consider the situation that would exist for a high priority packet received at scheduhng engine 160D when scheduler 150 system is backlogged. If the high priority packet is received after scheduhng engine 160E has already selected a lower priority packet to be held for future selection by scheduling engine 160L then the high priority packet would normaUy need to wait until after the selected lower priority packet has been selected by scheduhng engine 160L before it can itself become ehgible to be selected and held by scheduling engine 160E. This might unduly delay transmission of the high priority packet.

According to an alternative embodiment of the invention, scheduhng engines could pass a newly arrived high priority packet in place of an already selected lower priority packet. The virtual time ^at scheduling engine 160E is updated after the higher priority packet is sent. The already selected lower priority packet retains its place in line and will be forwarded to scheduhng engine 160L next (as long as another higher priority packet does not arrive in the meantime). If each scheduhng engine encountered by the higher priority packet implements this alternative embodiment of the invention then high priority packets can flow quickly upward through scheduler 150 along lines 137. This alternative embodiment of the invention provides lower latency for high priority packets at the possible expense of unfairness to lower priority packets. This method for expediting the scheduling of high priority data packets may be combined with the simplified method for selecting data packets, which is described above.

For example, to implement this alternative embodiment of the invention each non-leaf scheduhng engine 60 may be capable of holding a packet for each of two or more priority levels supported by scheduler 50. In a scheduler 50 that supports two priorities, real time and best effort, each non-leaf scheduhng engine 60 would be capable of holding two packets. Since leaf scheduling engines 60 are associated with a single priority in preferred embodiments of the invention it is not necessary for leaf scheduhng engines 60 to hold more than a single packet at a time. Each scheduling engine 60 continues to have a single virtual time. Each packet that is held by a non-leaf scheduhng engine 60 has its own start and finish time.

When a parent scheduling engine 60 selects a packet from one of its child scheduhng engines 60, it initially considers only the highest priority packets being held by the child scheduling engines 60. If none of those packets are ehgible, it considers the next highest priority packets being held by the child scheduling engines 60. The parent scheduhng engine 60 continues checking for packets of ever lower priority until it finds an eligible packet. If no ehgible packets are found, but the child scheduling engines 60 are holding on to one or more packets, the virtual time of the parent scheduling engine 60 is advanced to the earliest start time of those packets being held. The selection algorithm is repeated again starting at the highest priority.

Those skilled in the art will appreciate that with the methods of this invention one can provide a scheduler for forwarding a mixture of higher and lower priority data packets. The algorithm used by the preferred embodiment of this invention is similar to a WF²Q+ algorithm, but with the methods of this invention, packets can be scheduled in a manner that simultaneously takes into consideration bandwidth allocation and priorities. Previous implementations of WF²Q+ algorithms have been able to schedule on the basis of bandwidth allocation, but not on the basis of priority.

Another advantage of preferred embodiments of this invention is that unused bandwidth in one part of a pohcy tree can be used by another part of the pohcy tree. A sub-tree of the pohcy tree may hold no packets. At the top of the sub-tree wih be a single class which does not hold a packet. Its parent class will use the bandwidth assigned to the subtree by selecting packets from its other child classes more frequently.

As will be apparent to those skilled in the art in the hght of the foregoing disclosure, many alterations and modifications are possible in the practice of this invention without departing from the spirit or scope thereof. For example, while the invention has been described primarily with reference to IP packets, the invention could also be practised with packets formatted for other network protocols.

While the invention has been described as providing a separate scheduling engine corresponding to each leaf class in a priority tree, some benefits of the invention could be obtained by providing a single leaf scheduhng engine 60 responsible for selecting and forwarding packets from two or more sets of queues containing packets classified in two or more different classes. Where packets classified in the two or more different classes have different priorities then the leaf scheduling engine could be implemented in a manner similar to that described above for a non-leaf scheduling en ine. While this approach is not generally desirable it does provide a method for scheduling packets in a manner that simultaneously takes into consideration bandwidth allocation and priorities. For example, where it is desired to forward data packets which may be classified in a high priority class or a lower priority data packets over a data link, one could practice the invention by providing a plurahty of queues each capable of holding one or more of the data packets. If there is a data packet which is classified in a high priority class at the head of any of the queues, that data packet, or another data packet at the head of a queue and classified in a class having the same high priority should be sent next. The method therefore selects one data packet from a first eligible group consisting of the one or more data packets which are at heads of the queues and are classified in the one or more equally high priority classes to forward over the data hnk. The method preferably apphes a first to finish selection criterion to the data packets in the first eligible group. If there are no data packets in the first eligible group but there are data packets in the queues which are classified in one or more lower priority classes, the method selects one data packet from a second ehgible group consisting of data packets which are at heads of the queues and are classified in the one or more lower priority classes to forward over the data link. Once again, the method preferably applies a first to finish selection criterion to data packets in the second eligible group. The selected data packet is then forwarded over the data link. This variant of the invention is considered to come within the scope of the invention. Preferred implementations of the invention may include a computer system programmed to execute a method of the invention. The invention may also be provided in the form of a program product. The program product may comprise any medium which carries a set of computer-readable signals corresponding to instructions which, when run on a computer, cause the computer to execute a method of the invention. The program product may be distributed in any of a wide variety of forms. The program product may comprise, for example, physical media such as floppy diskettes, CD ROMs, DVDs, hard disk drives, flash RAM or the like or transmission-type media such as digital or analog communication links. Accordingly, the scope of the invention is to be construed in accordance with the substance defined by the following claims.

Claims

WE CLAIM:

1. A method for scheduhng transmission of data packets on a data link, the method comprising: a) receiving data packets, each data packet belonging to one of a plurahty of classes, the classes having priorities, and assigning each data packet to one of a plurality of queues, each queue capable of accommodating at least one data packet; b) from a group comprising data packets in the plurahty of queues selecting an ehgible group of data packets, the ehgible group comprising data packets which satisfy an ehgibility criterion; c) determining whether data packets in the ehgible group all belong to one or more classes having the same priority or belong to two or more classes having different priorities; d) if the data packets in the eligible group belong to two or more classes having different priorities, selecting one data packet for transmission on the data hnk by applying a selection criterion to an ehgible sub-group, the ehgible sub-group containing those one or more data packets which are in the eligible group and belong to one or more classes having a highest priority; e) if the data packets in the ehgible group all belong to classes having the same priority, selecting one data packet for transmission on the data link by applying a selection criterion to ah data packets in the ehgible group; and, f) forwarding the selected packet.

2. The method of claim 1 wherein the selection criterion comprises a first to finish selection criterion.

3. The method of claim 2 wherein the first to finish selection criterion comprises selecting a packet having a smallest finish time F where Fvs> given by:

F ;, = S i. + p.xR

where S_y is a start time for the packet, Z- is a length of the packet, J? is a data rate of the data link, and p₁ is a proportion of the capacity of the data hnk to which the packet is entitled.

4. The method of claim 3 wherein p_s = Q/Nwheve Q is a proportion of the capacity of the data link to which a leaf node with which the packet is associated is entitled and iVis a number of active queues at the leaf node.

5. The method of claim 1 wherein each queue is associated with a single class and receives only packets classified in the single class.

6. The method of claim 1 wherein each class has one of two priorities.

7. The method of claim 2 comprising maintaining a virtual time value wherein selecting packets which satisfy the ehgibility criterion comprises selecting packets having a start time less than or equal to the virtual time value.

8. The method of claim 7 comprising updating the virtual time value after each time a packet is forwarded.

9. The method of claim 8 wherein the updated virtual time value, V_it is given by:

, 1 _R

where V_s.j is a previous virtual time value, L _£ is a length of the forwarded packet and i?is a data rate of the link on which the packet is forwarded.

10. A method for scheduling the forwarding of data packets over a data hnk, the data packets comprising data packets classified in one or more high priority classes and data packets classified in one or more low priority classes, the method comprising: a) providing a plurality of queues, each queue capable of holding one or more data packets; b) if there is a data packet which is classified in a high priority class at a head of any of the queues, selecting one data packet from a first group consisting of one or more data packets which are at heads of the queues and are classified in the high priority class to forward over the data hnk by applying a first to finish selection criterion to the data packets in the first group; c) if there are no data packets in the first group but there are data packets classified in a lower priority class in the queues, selecting one data packet from a second group consisting of data packets which are at heads of the queues and are classified in the lower priority class to forward over the data link by applying a first to finish selection criterion to data packets in the second group; and, d) forwarding the selected data packet over the data hnk.

11. A method for scheduhng transmission of data packets on a data link, the method comprising: a) providing a plurality of scheduhng engines interlinked to form a hierarchical tree, the tree including at least a parent scheduhng engine and a plurahty of child scheduling engines linked to the parent scheduling engine, each of the child scheduling engines adapted to select and hold a data packet for eventual selection by the parent scheduling engine, the data packets each belonging to one of a plurality of classes, the classes each having a priority; b) in the parent scheduhng engine selecting one data packet from among the data packets being held by the child scheduling engines: i) if there are any high priority data packets being held by any of the child scheduhng engines, selecting one high priority data packet by applying a selection criterion to high priority data packets held by the child scheduling engines; ii) if there are no high priority data packets held by any of the child scheduhng engines but there are low priority data packets held by one or more of the child scheduhng engines, selecting one low priority data packet by applying a selection criterion to low priority data packets being held by the child scheduling engines.

12. The method of claim 11 wherein selecting one data packet from the data packets being held by the child scheduhng engines comprises selecting an eligible group of data packets, the ehgible group consisting of fewer than ah of the data packets being held by the child scheduhng engines and then selecting the one data packet from among data packets in the ehgible group.

13. The method of claim 12 wherein selecting the ehgible group comprises selecting data packets being held by the child scheduling engines which have a finish time less than a virtual time value for the parent scheduhng engine.

14. The method of claim 13 comprising updating the virtual time value each time a packet is passed on by the parent scheduling engine.

15. The method of claim 14 wherein the updated virtual time value, V_b is given by: i- 1 R

where V^ is a previous virtual time value, Z^ is a length of the packet passed on and .Bis a data rate of the link on which the packet is forwarded.

16. The method of claim 12 wherein the selection criterion is a first to finish selection criterion.

17. The method of claim 11 comprising, whenever a data packet belonging to a high priority class becomes available for selection by a child scheduhng engine and a data packet already selected and being held by that child scheduhng engine belongs to a lower priority class, making the data packet belonging to the high priority class available for selection by the parent scheduhng engine in place of the already selected data packet.

18. The method of claim 11 wherein the tree comprises a plurahty of leaf nodes, one or more queues are associated with each leaf node, the one or more queues associated with one leaf node receive only data packets belonging to a class having a high priority and the one or more queues associated with another leaf node receive only data packets belonging to a class having a lower priority.

19. The method of claim 11 comprising passing a value representing a priority of a class to which the selected packet belongs to the parent scheduling engine.

20. The method of claim 11 wherein the selection criterion is a first to finish selection criterion.

21. A method for scheduling transmission of data packets on a data hnk, the method comprising: a) providing a plurality of schedulers interlinked to form a hierarchical tree, the tree including a first scheduler adapted to select data packets from among data packets selected by one or more child schedulers, the first scheduler having a parent scheduler adapted to select data packets from a group of one or more data packets including a data packet selected by the first scheduler each child scheduler adapted to select data packets from data packets at heads of one or more queues, each queue capable of receiving one or more data packets, the data packets each belonging to a class, each class having one of two or more priorities; in the first scheduler: i) from a group comprising data packets selected by the child schedulers, selecting an eligible group of data packets, the eligible group comprising data packets ehgible for transmission according to an eligibility criterion; ii) if the data packets in the ehgible group do not ah belong to classes having the same priority, selecting one data packet from the ehgible group by applying a selection criterion to an ehgible sub-group, the ehgible sub-group containing those one or more data packets which are in the eligible group and belong to classes having a priority higher than or equal to a priority of every other class of packet in the eligible group; iii) if the data packets in the eligible group ah belong to classes having the same priority, selecting one data packet by applying a selection criterion to ah data packets in the eligible group; and, iv) making the selected data packet available for forwarding by the parent scheduler.

22. The method of claim 21 wherein the ehgibility criterion selects packets having a finish time smaller than or equal to a virtual time of the first scheduler.

23. The method of claim 21 wherein the ehgibility criterion selects packets having a start time smaher than or equal to a virtual time of the first scheduler.

24. The method of claim 22 wherein the selection criterion comprises a first to finish selection criterion.

25. The method of claim 24 wherein the first to finish selection criterion comprises selecting a packet having a smahest finish time

Fw eτe Fis given by:

F l, = S_; + p.xR

where S₂- is a start time for the packet, L_sis a length of the packet, Ria a data rate associated with the first scheduler, and p_sis a proportion of the data rate to which the child scheduler is entitled.

26. Apparatus for scheduhng transmission of data packets on a data link, the apparatus comprising: a) a memory capable of holding a plurahty of data packets queued in a plurahty of queues; b) means for keeping a start time, a finish time and a priority for a packet at a head of each of the queues; c) a scheduling engine adapted to select one packet from a plurahty of packets at the heads of the queues, the scheduhng engine comprising: i) a counter for maintaining a virtual time for the scheduling engine; ii) means for comparing the start time for each packet to the virtual time for the scheduhng engine to select an ehgible group of packets; hi) means for comparing the priorities of packets in the ehgible group of packets and ehminating from the ehgible group packets having a priority lower than a priority for another packet in the ehgible group;, and, iv) means for selecting one packet from the eligible group having an earhest finish time.

27. The apparatus of claim 26 comprising a plurahty of scheduhng engines linked to form a hierarchical tree, the tree comprising one or more parent scheduhng engines each linked to one or more child scheduhng engines, each parent scheduling engine comprising i) a counter for maintaining a virtual time for the parent scheduling engine; ii) means for comparing the start time for each packet held by a child scheduling engine hnked to the parent scheduling engine to the virtual time for the parent scheduhng engine to select an ehgible group of packets; iii) means for comparing the priorities of packets in the ehgible group of packets and ehminating from the ehgible group packets having a priority lower than a priority for another packet in the eligible group; and, iv) means for selecting one packet from the ehgible group having an earliest finish time.

28. Apparatus for scheduhng the transmission of data packets on a data link, the apparatus comprising a plurahty of scheduhng engines linked to form a hierarchical tree, the tree comprising one or more parent scheduling engines each linked to one or more child scheduling engines, the one or more parent scheduhng engines comprising: i) a counter for maintaining a virtual time for the parent scheduhng engine; ii) means for comparing the start time for each packet held by a child scheduhng engine hnked to the parent scheduhng engine to the virtual time for the parent scheduhng engine to select an eligible group of packets; and, iv) means for selecting one packet having a first priority from the eligible group; and, v) means for selecting another packet having a second priority different from the first priority from the ehgible group.

29. A method for scheduhng transmission of data packets on a data hnk, the method comprising: a) providing a plurahty of scheduhng engines interhnked to form a hierarchical tree, the tree including at least a parent scheduhng engine and a plurahty of child scheduhng engines hnked to the parent scheduling engine, each of the child scheduling engines adapted to select and hold a data packet for eventual selection by the parent scheduhng engine, the data packets each belonging to one of a plurality of classes, the classes each having a priority; b) in the parent scheduhng engine: i) if any of the child scheduling engines are holding any data packets classified as having a first priority and the parent scheduhng engine is not already holding a first priority data packet, selecting one of the first priority data packets by applying a selection criterion to first priority data packets held by the child scheduhng engines; and, ii) if any of the child scheduling engines are holding any data packets classified as having a second priority and the parent scheduhng engine is not already holding a second priority data packet, selecting one of the second priority data packets by applying a selection criterion to second priority data packets held by the child scheduhng engines.

30. A method for scheduling transmission of data packets on a data hnk, the method comprising: a) providing a plurahty of schedulers interhnked to form a hierarchical tree, the tree including a first scheduler adapted to select data packets from among data packets selected by one or more child schedulers, the first scheduler having a parent scheduler adapted to select data packets from a group of one or more data packets including a data packet selected by the first scheduler, each child scheduler adapted to select data packets from data packets at heads of one or more queues, each queue capable of receiving one or more data packets, the data packets each belonging to a class, each class having one of two or more priorities; b) in the first scheduler: i) providing a plurahty of locations each able to hold one data packet, each of the locations corresponding to a different one of the two or more priorities; ii) whenever one or more of the locations is vacant, selecting an eligible group of data packets from a group comprising data packets selected by the child schedulers, the eligible group comprising data packets eligible for transmission according to an ehgibihty criterion; and, iii) for each of the vacant locations for which the ehgible group comprises one or more packets belonging to a class having a priority equal to the priority of the vacant location, selecting one data packet from the eligible group by applying a selection criterion to an eligible sub-group, the ehgible sub-group containing those one or more data packets which are in the eligible group and belong to classes having a priority equal to the priority of the vacant location; and, iv) holding the selected data packets available for forwarding by the parent scheduler.

31. The method of claim 21 comprising, in the first scheduler, providing locations for holding one data packet from each of a plurahty of different priorities, and, if any of the locations is vacant and the eligible group includes one or more data packets belonging to classes having the same priority as a priority corresponding to the vacant location, selecting from the eligible group one data packet belonging to a class having the same priority as the priority corresponding to the vacant location.