SYSTEMS AND METHODS FOR PACKET FLOW REGULATION AND TRANSMISSION INTEGRITY VERIFICATION OF A SWITCHING ENTITY
CROSS-REFERENCES TO RELATED APPLICATION(S)
The present invention claims the benefit under 35 USC §119(e) of prior U.S. provisional patent application Serial no. 60/407,356 to Sammour et al., filed on September 3, 2002 and U.S. provisional application Serial no. 60/407,357 to De Maria et al, filed on September 3, 2002, both incorporated by reference herein.
FIELD OF THE INVENTION
The present invention relates generally to switching of packets by a switching entity and, in particular, to methods and apparatus for regulating the flow of packets through the switching entity and verifying the integrity of transmission through the switching entity.
BACKGROUND OF THE INVENTION
Switch fabrics are often used to route traffic between end points in a network. The need for regulating the flow of traffic through a switching entity arises whenever there is a risk of congestion in the switch fabric. Congestion results in the ingress or switch fabric buffers holding a large amount of packets which, for some reason, cannot leave the switch fabric at the rate they are entering. This leads to two major problems. Firstly, the degree of out-of- order packets that may be received at the egress will be higher than what the egress is dimensioned for, which can cause significant reordering problems. Disadvantageously, receipt of out-of-order packets can lead to corruption in the reassembled frames. Secondly, when packets are transmitted in groups (i.e., frames), the degree of congestion in the switch fabric may cause the frame reassembly process of certain frames to take a very long time, resulting in a reduction in the outgoing bit rate, effectively introducing blocking.
In parallel with the need to manage congestion, there is also the need to verify transmission integrity through the switch fabric, especially when there is a reliability issue with the switch fabric, e.g., when there is a risk that the switch fabric will lose packets or when lost
packets are detected. The loss of packets is of course detrimental to the integrity of the traffic being transmitted between the end points, as it may cause an interruption in information flow between the end points of a traffic connection.
Current solutions to these problems are not satisfactory and thus there remains a need in the industry to regulate packet flow through a switching entity so as to reduce congestion and also there remains a need in the industry to verify transmission integrity of the switch fabric.
SUMMARY OF THE INVENTION
In accordance with a broad aspect, the present invention provides a scheme used for regulating packet flow through the switch fabric. The principle behind this scheme is based on exchanging tokens between the egress and ingress. Accordingly, the present invention may be broadly summarized as a system for regulating packet flow through a switching entity. The system comprises an ingress capable of sending packets to the switching entity in a designated order and an egress capable of receiving packets from the switching entity, re-ordering the packets in the designated order and sending an acknowledgement of receipt of the packets to the ingress upon re-ordering. The ingress is adapted to receive acknowledgements of receipt of packets from the egress, maintain an indication of a number of packets for which an acknowledgement of receipt has not yet been received from the egress, perform a comparison between the number of packets for which an acknowledgement of receipt has not yet been received from the egress and a threshold, and regulate sending packets to the switching entity on the basis of the comparison.
In accordance with another broad aspect, the present invention provides a method of regulating packet flow through a switching entity. The method comprises sending packets from an ingress to the switching entity in a designated order, receiving packets from the switching entity at an egress, re-ordering the packets in the designated order and sending an acknowledgement of receipt of the re-ordered packets to the ingress, receiving acknowledgements of receipt of packets from the egress, maintaining an indication of a number of packets for which an acknowledgement of receipt has not yet been received from the egress, performing a comparison between the number of packets for which an acknowledgement of receipt has not yet been received from the egress and a threshold, and regulating sending packets to the switching entity on the basis of the comparison.
In accordance with yet another broad aspect, the present invention provides computer- readable storage media containing a program element for execution by a computing device to implement an ingress for regulating packet flow through a switching entity. The ingress comprises a control entity operative to send packets to the switching entity in a designated order, receive acknowledgements of receipt of re-ordered packets from an egress, maintain an indication of a number of packets for which an acknowledgement of receipt has not yet been received from the egress, perform a comparison between the number of packets for which an acknowledgement of receipt has not yet been received from the egress and a threshold, and regulate sending packets to the switching entity on the basis of the comparison.
In accordance with another broad aspect, the present invention provides a scheme used for verifying the transmission integrity of the switch fabric. The principle behind this scheme is based on exchanging marker packets at identifiable reference instants. Accordingly, the present invention may be broadly summarized as a system for assessing integrity of a flow of packets through a switching entity. The system comprises an ingress capable of sending packets to the switching entity, the packets including a reference packet sent to the switching entity at a reference instant and an egress capable of receiving packets from the switching entity and sending to the ingress an acknowledgement of receipt of packets from the switching entity. The ingress is adapted to receive acknowledgements of receipt of packets from the egress, maintain a current indication of packets for which an acknowledgement of receipt has not yet been received from the egress, store a first data element indicative of packets for which an acknowledgement of receipt had not yet been received from the egress at the reference instant, maintain a second data element indicative of packets for which an acknowledgement of receipt is received from the egress between the reference instant and the instant at which an. acknowledgement of receipt of the reference packet is received from the egress, perform a comparison of the first and second data elements, and assess integrity of the flow of packets on the basis of the comparison.
In accordance with still another broad aspect of the present invention, there is provided a method of assessing integrity of a flow of packets through a switching entity. The method comprises sending packets from an ingress to the switching entity, the packets including a reference packet sent to the switching entity at a reference instant, receiving packets from
the switching entity at an egress and sending to the ingress an acknowledgement of receipt of packets from the switching entity, receiving acknowledgements of receipt of packets from the egress, maintaining a current indication of packets for which an acknowledgement of receipt has not yet been received from the egress, storing a first data element indicative of packets for which an acknowledgement of receipt had not yet been received from the egress at the reference instant, maintaining a second data element indicative of packets for which an acknowledgement of receipt from the egress is received between the reference instant and the instant at which an acknowledgement of receipt of the reference packet is received from the egress, performing a comparison of the first and second data elements, and assessing integrity of the flow of packets on the basis of the comparison.
In accordance with still another broad aspect, the present invention provides computer- readable storage media containing a program element for execution by a computing device to implement an ingress for regulating packet flow through a switching entity. The ingress comprises a control entity operative to send packets to the switching entity, the packets including a reference packet sent to the switching entity at a reference instant, receive acknowledgements of receipt of packets from an egress, maintain a current indication of packets for which an acknowledgement of receipt has not yet been received from the egress, store a first data element indicative of packets for which an acknowledgement of receipt had not yet been received from the egress at the reference instant, maintain a second data element indicative of packets for which an acknowledgement of receipt from the egress is received between the reference instant and the instant at which an acknowledgement of receipt of the reference packet is received from the egress, perform a comparison of the first and second data elements, and assess integrity of the flow of packets on the basis of the comparison.
Various other aspects of the invention address complexity issues arising from the addition of further functionality to each scheme.
These and other aspects and features of the present invention will now become apparent to those of ordinary skill in the art upon review of the following description of specific embodiments of the invention in conjunction with the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
In the accompanying drawings:
Fig. 1 is a block diagram of a switching entity disposed between a plurality of ingresses and a plurality of egresses;
Fig. 2 illustrates steps executed by the ingress (on the left-hand side) and egress (on the right-hand side) in order to implement a packet flow regulation scheme in accordance with an embodiment of the present invention;
Figs. 3 and 4 show additional steps executed by the ingress in order to implement a transmission integrity verification scheme in accordance with an embodiment of the present invention.
DETAILED DESCRIPTION
With reference to Fig. 1, there is shown a system 10 that comprises plural ingresses 12 and plural egresses 14 connected on either side of a switching entity, such as a switch fabric 16. The ingresses 12 and the egresses 14 are functional entities that are logically interconnected to one another although they may or may not be physically distinct. Typically, the ingresses
12 and the egresses 14 are implemented using a combination of hardware, software and control logic.
The ingresses 12 receive data from a source external to the system 10. The data is received via a plurality of physical or logical input queues (IQ) 24. The ingresses 12 process the data and provide packets 13 to the switch fabric 16. The ingresses 12 can be uniquely associated with individual input ports 18 of the switch fabric 16. The ingresses 12 provide the packets
13 to the switch fabric 16 via its input ports 18.
The switch fabric 16 can be a centralized entity or a distributed entity made up of smaller interconnected entities, and those smaller entities may physically reside in different chassis. Thus, it will be appreciated that the present invention applies to both centralized and distributed architectures of the switch fabric 16. The switch fabric 16 generally has the
capacity to switch packets from its input ports 18 to selected ones of a plurality of output ports 20, in accordance with routing instructions. The routing instructions provided to the switch fabric 16 may arrive from an external source and may take the form of instructions to "switch input port A to output port B". However, it is more common in the field of Internet routing that the routing instructions will be implicit in the information carried by the received packets themselves. For example, a given packet provided at input port A may specify an end destination, which is examined by the switch fabric 16 and translated by the switch fabric 16 into an output port B based on routing tables and the like. Such routing tables could be stored in a memory 21 that is internal or external to the switch fabric 16.
The egresses 14 can be uniquely associated with individual ones of the output ports 20 of the switch fabric 16. Each of the egresses 14 receives packets 13' from the output ports 20 of the switch fabric 16, performs processing and provides processed data to an optoelectronic converter (O/E) 28 via an output queue (OQ) 26. The opto-electronic converter 28 transforms the processed data received from the output queue 26 into an optical signal that is sent to an entity external to the system 10.
It is quite common for different ones of the packets 13 being released by a particular one of the ingresses 12 to be destined for different end destinations and, thus, different ones of the output ports 20 of the switch fabric 16. For example, it is possible that a first packet being received at input port A of the switch fabric 16 will need to be routed to output port B, but that the next packet received at that same input port A will need to be routed to output port B'. The relationship of A to B for the first packet and A to B' for the next packet can be referred to as a "source-destination pair".
For simplicity, but without limiting the scope of the invention, it may be useful to provide features for regulating the flow of packets on a per-source-destination pair basis. In other words, multiple packet flow regulation schemes are implemented, each for regulating the flow of packets sharing a common source-destination pair. These various packet flow regulation schemes may share some common resources, such as total memory or capacity, but the individual processes that implement flow regulation operate independently.
In fact, it is within the scope of the present invention to implement multiple packet flow regulation schemes for regulating the flow of packets sharing common characteristics other
than, or in addition to, the source-destination pair. For example, packets sharing a common quality of service requirement may have their own packet flow regulation schemes, irrespective of source-destination pair. Similarly, the characteristic may be priority, bandwidth class, and so on, or any combination of two or more characteristics.
In the following, the specific case is considered where the characteristic is the source- destination pair (also known as a "flow"), so that the ingress-side functionality of the packet flow regulation scheme is localized to a single one of the ingresses, denoted 12, and the egress-side functionality of the packet flow regulation scheme is localized to a single one of the egresses, denoted 14. However, it will be appreciated that in other embodiments, the ingress-side functionality and the egress-side functionality of each one of a plurality of packet flow regulation schemes may be distributed across multiple ones of the ingresses 12 and egresses 14, respectively.
With reference now to Fig. 2, the ingress 12 is adapted to send the packets 13 to the switch fabric 16 in a designated order. This can be referred to as performing a "sequencing operation" on a series of packets arriving at the ingress from an external entity. The sequencing operation is shown as box 202. The designated order is maintained / differentiated through the use of numbers (called sequence numbers). The size of the sequence number space is N. The value of N is chosen to be sufficiently large so as to account for the maximum delay through the switch fabric 16.
As the incoming packets typically consist of a header containing control information (e.g., implicit or explicit routing instructions) and a body containing data, the designated order can be established, for example, by modifying the header of each packet 13 so as to implement a linked list of numbers from zero to (N-l) and back to zero again. This is but one simple technique that permits the packets 13 to be eventually re-sequenced upon receipt at the egress 14 in a possibly out-of-order fashion.
Other boxes pertaining to the ingress 12 in Fig. 2 represent operations that form part of a packet flow regulation scheme that controls the rate at which the packets 13 are released into the switch fabric 16. The packet flow regulation scheme will be described in further detail herein below. It should also be appreciated that the ingress 12 may include various other features, such as an arbiter or scheduler connected to the ingress queues 24, which
controls the release of packets into the switch fabric 16, subject to the packet flow regulation scheme.
At the egress 14, the order of the packets 13' received from the switch fabric 16 may be different from the designated order. This can be the result of queuing, arbitration or other processing features of the switch fabric 16. Accordingly, the egress 14 receives packets 13' from the switch fabric 16 and is adapted to re-order the packets 13' so that they re-acquire the designated order. This is known as a re-sequencing operation as shown as box 252. The re-sequencing operation can be effected by, for example, examining the contents of the header in each packet 13' received from the switch fabric 16. The re-sequenced packets undergo further processing and the processed data is then sent to the opto-electronic converter 28 via the output queue 26 (see box 254). It will be understood that independent re-sequencing operations may be performed for each of the packet flow regulation schemes being implemented, e.g., on the basis of a characteristic such as source-destination pair, quality of service, priority, bandwidth class, etc.
The egress 14 keeps track of successfully re-sequenced packets by updating a "token credit counter" (see box 256), that may be implemented in software, for example. When the value of the token credit counter has reached a certain level, i.e., upon re-sequencing a certain number of the received packets 13 ' , an acknowledgement of receipt of the re-sequenced packets is send back to the ingress 12. The acknowledgement can be sent to the ingress 12 via the switch fabric 16 or through a separate external link (not shown). In a specific embodiment, the acknowledgement is sent back to the ingress 12 after the processed data corresponding to the packets 13' has exited the output queue 26 (i.e., just prior to opto- electronic conversion by the opto-electronic converter 28).
Specifically, the acknowledgement may take the form of a special packet, referred to as a "token credit packet" (TCP) 22, which distinguishes itself from other packets by, e.g., a code in its header. A token credit packet 22 may be sent each time a packet is re-sequenced at the egress 14; however, this approach may create congestion in the direction from the egress 14 to the ingress 12. Thus, in a specific embodiment, a token credit packet 22 is sent once the value of the token credit counter exceeds a value M, where is a desired integer, possibly although not necessarily less than N. This is shown in box 258. ean be fixed or time varying. One rationale for varying M would be to ensure that the token credit packet
22 will not be unduly congested on its way back to the ingress 12. Thus, M could be made to vary on the basis of a measure of the amount of available return bandwidth through the switch fabric 16. This measure of the amount of available return bandwidth may be effected by the switch fabric 16 and communicated to the egress 14. Alternatively, Mean remain fixed, but an additional condition for sending a token credit packet could be that the available return bandwidth be above a particular threshold. Other embodiments contemplate allocating the token credit packet 22 with the highest possible priority in order to ensure that it is timely received at the ingress 12 so as not to unduly slow down the flow through the switch fabric 16.
A function of the token credit packet 22 is to acknowledge the receipt of M successfully re- sequenced packets with which a particular token credit packet 22 is associated. In one embodiment, where M would be fixed, it is not necessary to transmit the value M, as this information can be implicitly known by the ingress 12 from the mere fact that a token credit packet 22 is received (e.g., which will impliedly convey that the token credit counter has reached the value M). In other cases, where Mis time varying, this information can be specifically embedded in the body of the token credit packet 22. Once the token credit packet 22 is sent to the ingress 12, the token credit counter is updated again. This is shown in box 260. In particular, if the egress 14 is designed so as to send a token credit packet 22 after the token credit counter reaches a value M, then updating the token credit counter consists of decrementing it by M
Another option is to send an acknowledgement on a periodic, irregular or spontaneous basis, regardless of the value of the token credit counter. In such instances, the acknowledgement can take the form of a "token refresh packet" (TRP) 40. The appropriate moment for triggering the transmittal of a token refresh packet may be dependent on received information (e.g., receipt of a special type of packet, called a "marker packet" as will be described later on or an amount of elapsed time since previous transmittal of an acknowledgement.) Once the appropriate moment is reached, the token refresh packet 40 is sent. This is shown in box 262. Once the token refresh packet 40 is sent to the ingress 12, the token credit counter is updated. This is shown in box 264 and basically consists of resetting the token credit counter to zero.
A function of the token refresh packet 40 is to acknowledge that a certain number of packets have been successfully received and re-sequenced at the egress 14. This number is equal to the value of the token credit counter at the time the acknowledgement is generated. Since this number is not known ahead of time, the value of the token credit counter can be specifically embedded in the body of the token refresh packet 40. Once the token refresh packet 40 is sent to the ingress 12, the token credit counter is updated again. This is shown in box 264. In particular, this consists of resetting the token credit counter to zero.
It will be appreciated that because the token credit counter is updated, read and updated again, only once the egress 14 has ensured that the received packets 13 ' have re-acquired the designated order, this ensures that new traffic can be fed to the switch fabric 16 by the ingress 12 without endangering the integrity of the re-sequencing operation.
In accordance with an embodiment of the present invention, the ingress 12 applies a packet flow regulation scheme in order to alleviate congestion at the switch fabric 16. In order to apply the packet flow regulation scheme, the ingress 12 keeps track of the number of packets that have been fed to the switch fabric 16 and for which an acknowledgement of receipt has not yet been received from the egress 14. Let the number of such as-yet- unacknowledged packets equal L(t), where t is a time variable denoting that the quantity L(t) will change over time. The ingress 12 keeps track of L(t) by, for example, implementing what can be referred to as a "reverse token bucket", where a "token" refers to a packet for which an acknowledgement has not yet been received. As will be shown in greater detail below (see also boxes 204, 206 and 208 in Fig. 2), the fill level E(t) of the reverse token bucket is generally increased upon sending packets to the switch fabric 16 and is reduced upon receipt of a token credit packet 22 or token refresh packet 40 from the egress 14. The reverse token bucket may be one of many similar buckets stored in a table in memory, such table being referred to as a "token bucket table". The token bucket table is indexed on a per-FLOW basis and thus, for each different FLOW being handled by the ingress, the appropriate entry in the token bucket table is consulted.
By way of a simplistic example, let Mbe equal to 5 and let the expected delay through the switch fabric 16 be equivalent to the duration often (10) packets. Also, it will be assumed that that the packets 13' received by the egress 14 are already in the designated order so that re-sequencing is not required. In such a scenario, the first fifteen (15) packets will be sent
out of the ingress 12 without yet receiving an acknowledgement. The reverse token bucket would thus hold a fill level L(t) of 15. Meanwhile, due to the delay through the switch fabric 16, five (5) packets 13' will have emerged from the switch fabric 16 at the egress 14. Assuming that the egress 14 instantaneously recognizes that it has received M= 5 packets 13' and that the egress 14 instantaneously transmits a token credit packet 22 to the ingress 12, and that the ingress 12 is instantaneously capable of updating the reverse token bucket, the reverse token bucket will, after the fifteenth (15 ) packet 13 is sent into the switch fabric 16, drop to the value 10. Subsequently, the fill level L(f) of the reverse token bucket will climb gradually again to 15 and then drop to 10; this pattern will continue indefinitely, assuming the ideal circumstances whereby there is no congestion through the switch fabric 16. Thus, under ideal circumstances, the fill level L(t) of the reverse token bucket is roughly indicative of the expected latency of the switch fabric 16.
However, under non-ideal circumstances, the fill level L(t) of the reverse token bucket will rise above the expected delay through the switch fabric 16. Thus, the fill level L(t) of the reverse token bucket effectively becomes a measure of the congestion through the switch fabric 16. Accordingly, in an embodiment of the present invention, the reverse token bucket is compared to a threshold, denoted a, which represents a demarcation between an acceptable and an unacceptable delay through the switch fabric 16. This is shown in box 210 in Fig. 2. The comparison can be effected in hardware, software or a combination thereof. The comparison can be effected for each packet or for each group of packets. On the basis of this comparison, the ingress 12 can regulate the transmission of packets being fed to the switch fabric 16. For example, if the fill level L(t) of the reverse token bucket is less than the threshold a, then a next group of packets could continue to be sent to the switch fabric 16 (see box 212), while if the fill level Z(t) of the reverse token bucket is greater than or equal to the threshold a, then the ingress 12 could be made to refrain from sending the next group packets to the switch fabric 16 and, instead, placing the packets in temporary storage, such as a buffer (see box 214). The act of refraining from sending packets to the switch fabric 16 need not be applied to the next packet (or group of packets) but rather to the following one. Thus, for example, transmission of the next packet (or group of packets) may be permitted regardless of the fill level L(f) of the reverse token bucket, although transmission of the subsequent packet (or group of packets) will be placed on hold.
A similar effect could be achieved if a memory element in the ingress is made to contain a value S, which is the time-varying result of the operation sgn(E(t) - a). In this case, a binary decision to continue to send, or refrain from sending, packets to the switch fabric 16 is made on the basis of whether S is or is not equal to -1. The value of S may be conveniently stored in a table, which may be termed a "backpressure table" 30. The backpressure table 30 is indexed on a per-FLOW basis and thus, for each flow being handled by the ingress 12, the entry in the backpressure table corresponding to the appropriate FLOW is consulted. An advantage of using the backpressure table 30 instead of performing the comparison of the fill level L(t) of the reverse token bucket to the threshold stems from the fact that the entries in the backpressure table 30 can be modified from "behind the scenes", namely as a result of the occurrence of other events, not only on the basis of the difference between the fill level L(t) of the reverse token bucket and the threshold a. For example, it may be useful to allow a software module that monitors error conditions or that performs debugging to have access to the backpressure table 30 so as to allow it to exert control over the transmission of packets to the switch fabric 16.
The decision to continue to send, or refrain from sending, packets to the switch fabric 16 can be made continually, periodically or sporadically, depending various design parameters, such as the desired responsiveness of the ingress 12 to changes in the latency through the switch fabric 16, the available processing power of the ingress 12, and so on. For instance, this may lead to a design in which a comparison between the fill level E(t) of the reverse token bucket and the threshold a, or alternatively the reading of the value S, is performed by the ingress 12 for each group of successive packets, so that the result of the comparison will lead to a decision to transmit or not transmit the next (or subsequent group of) packets. It is noted that upon transmission of a group of packets to the switch fabric 16, the reverse token bucket is incremented by the number of packets in the group (see box 208). This update to the reverse token bucket may be done shortly before or shortly after packets have in actuality been sent to the switch fabric 16. It is recalled that the fill level of the reverse token bucket is decremented upon receipt of a token credit packet or token refresh packet 40 from the egress 14 (see box 206). Thus, it is envisaged that the fill level L(t) of the reverse token bucket may be increased in increments greater than one and decremented in increments of M
It is recalled that the threshold a represents the maximum number of unacknowledged packets that are allowed to leave the ingress 12. The threshold a can thus be referred to as a "backpressure watermark". While in some embodiments, a may fixed, in other embodiments it may vary over time from amongst a set of possible backpressure watermark values, for example. This means that traffic can be throttled even though the switch fabric 16 is not operating at maximum capacity. This enhancement may be used by the egress 14 to slow down (regulate) the rate at which each ingress 12 is sending traffic to it, based on a set of regulation criteria.
Example of such regulation criteria include an indication of resource (e.g., memory) utilization at the egress 14. In one embodiment, the egress 14 may be equipped with suitable circuitry, software and/or control logic for monitoring usage of a resource (such as memory) to determine a resource utilization level. If multiple egresses share the same resources (e.g., memory space), then the same resource utilization level could be used by all of the egresses concerned. The egress 14 is adapted to sent the resource utilization level to the ingress 12. The ingress 12 then determines the backpressure watermark in accordance therewith, e.g., by assessing whether the resource utilization is considered low, medium or high.
In another embodiment, the egress 14 is adapted to perform the further step of determining the backpressure watermark. This backpressure watermark is then sent to the ingress, e.g., via the switch fabric 16 or an external link. Alternatively, the backpressure watermark may be selected from a fixed set of watermarks that are associated with respective codes known to the ingress 12. In such a scenario, the egress 14 may sent the code to the ingress 12, allowing the ingress 12 to set the backpressure watermark in accordance with the code. Again, transmission of such information from the egress 14 to the ingress 12 can be achieved through the switch fabric 16 or via a separate link (not shown).
Also, there may exist a plurality of thresholds, at least one of which is time- varying. For example, a first threshold may be set by the current level of resource utilization and a second threshold may be set by a known storage capacity of the switch fabric 16. In this case, it is reasonable for the actual threshold used by the ingress 12 as the backpressure watermark to be the minimum of the two thresholds.
In other embodiments, it is envisaged that the above feature could also be used to implement a slow-start mechanism or a token-based bandwidth allocation and control mechanism.
In yet another embodiment, instead of implementing a reverse token bucket, which keeps track of as-yet-unacknowledged packets, it is within the scope of the present invention to implement a "positive" token bucket. For example, a positive token bucket keeps track of a number of unacknowledged packets that are still allowed to be emitted by the ingress 12. Each time that a packet is sent to the switch fabric 16 by the ingress 12, the number of packets so transmitted is subtracted from the positive token bucket. Meanwhile, acknowledgements received from the egress 14 will tend to add to the fill level of the positive token bucket. If the fill level of the positive token bucket falls to zero, the ingress 12 is adapted to refrain from sending further packets to the switch fabric 16. Thus, the positive token bucket is used to regulate the flow of packets through the switch fabric 16.
The above packet flow regulation scheme implemented by the cooperation of the ingress 12 and egress 14 in the system 10 controls the extent to which out-of-order packets occur, and hence can control congestion through the switch fabric 16 by ensuring that the egress 14 will not receive more packets than what it is prepared to accept.
The packet flow regulation scheme of the present invention can also be made to operate in a frame-based mode. Specifically, each of the "packets" described above can be considered to be a "frame" consisting of multiple "segments". Multiple frames are assumed to be received from an external entity by the ingress 12 and occupy the designated order. In addition, the segments within a given frame also define a specific order within their
"parent" frame. Thus, the segments received at the egress 14 (in an unknown order) need to be reassembled into the appropriate parent frame and the frames themselves need to be reordered into the designated specific order prior to being and released by the egress 14.
In a first variant, the ingress 12 and egress 14 implement a pure frame-based mode of the packet flow regulation scheme, which can be specified by a software flag at the ingress 12. The re-ordering of segments within a frame is not considered in the pure frame-based mode of operation, although in another embodiment, both frame-based and non-frame-based modes of the packet flow regulation scheme may coexist simultaneously and independently.
To implement the pure frame-based mode of operation of the packet flow regulation scheme, a sequencing and re-sequencing operation is performed amongst the frames themselves. It is recalled that the "packets" referred to above now represent "frames". Each frame has a sequence number which ranges from zero to (N-1). The designated order referred to above refers to the order in which the frames are sent into the switching entity 16 by the ingress 12. Thus, the reverse token bucket now counts unacknowledged frames sent by the ingress 12 into the switch fabric 16.
The comparison between the fill level L(t) of the reverse token bucket and threshold a needs to take the frame-based mode of operation into account. Thus, L(t) represents the number of as yet unacknowledged frames from the point of view of the ingress 12. For its part, the token credit counter tracks the number of properly re-sequenced frames at the egress 14. When the token credit counter reaches a value of M, the egress 14 sends a token credit packet 22 to the ingress 12. Because the ingress 12 operates a frame-based packet flow regulation scheme, the ingress will know that the token credit packet 22 represents a total of Mproperly re-sequenced frames and updates the fill level L(f) of the reverse token bucket accordingly.
Although the above-described frame-based mode of operation tends to be characterized by a slower responsiveness to congestion, it tends utilizes less bandwidth in the egress-to-ingress direction than a non-frame-based mode of operation of the packet flow regulation scheme, since a token credit packet is sent by the egress 14 only once M frames have been successfully re-sequenced.
In another variant of the present invention, which can be used with either mode of operation, the ingress 12 is adapted to perform a monitoring operation on the token credit packets 22 received from the egress 14. The token credit packets 22 can be enhanced so that they identify not only the number of packets for which they are acknowledging receipt, but also the identity of those packets themselves (e.g., by specifying a range of packet numbers). Thus, if the ingress 12 detects that there is a gap in the packets being acknowledged by the token credit packets 22, the ingress 12 can re-transmit the missing packets 13. It should also be noted that missing token credit packets 22 may be symptomatic of a more significant transmission integrity problem, which could necessitate
the invocation of a packet verification process. A verification process may also be triggered by the egress 14 which can be designed to monitor transmission integrity through the switch fabric 16 and to detect anomalies such as missing packets or extensive delays, as well as generalized failures of the ingress 12, egress 14 or switch fabric 16.
With reference to Fig. 3, a verification process, or "transmission integrity verification scheme", in accordance with an embodiment of the present invention contemplates the use of reference, or "marker", packets. Specifically, the ingress 12 is adapted to send a marker packet 50 through the switch fabric 16 at a reference instant, hereinafter denoted TREF- Although the reference instant may be a reference instant in time, it may also represent, e.g., the sequence number of the most recently received packet from the outside world. Generally, the reference instant TREF refers to an event whose origins are detectable by the ingress 12 and need not be an absolute time reference.
It will be appreciated that different transmission integrity verification schemes may be operating in parallel, each such scheme being applied to packets that undergo a common packet flow regulation scheme. It is recalled that a variety of packet flow regulation schemes may be implemented to handle packets sharing common characteristics such as source-destination pair, bandwidth class, priority and quality of service, to name a few. However, it is also within the scope of the present invention to implement a transmission integrity verification scheme without recourse to an associated packet flow regulation scheme, or, alternatively, to use a packet flow regulation scheme that does not require sequencing and re-sequencing of packets.
The reference instant TREF may occur at predetermined intervals or it may be determined dynamically by a control entity inside or outside the ingress 12. For example, T -may a instant in the future that includes an offset indicative of a maximum elapsed time (or number of received packets from the external world) since receipt of the most recent acknowledgement of received packets from the egress 14. In another embodiment, TREF may be triggered by a condition in the ingress 12, the egress 14 or the switch fabric 16, such as attaining a maximum permitted resource utilization (e.g., memory or processing). In yet another embodiment, TREF is the time at which an integrity problem is detected, such as a missing acknowledgement or detection of a lost packet.
On the other hand, T EF may be neither pre-determined nor determined by a control entity. Instead, it may be set by the arrival of a marker packet 50 from outside the ingress 12, in which case TREF is taken to be the current time. In order to determine whether a packet received by the ingress 12 is a marker packet, the contents (e.g., header or body) of each received packet can be examined. It should also be noted that the marker packet 50 could originate as an ordinary (non-marker) packet received from outside the ingress 12 which is then modified (e.g., by changing its header) to turn it into a marker packet; again TREF is taken to be the current time.
In any event, the ingress 12 checks to see whether the current time has reached TREF (box 302) and, if so, sends the marker packet 50 into the switch fabric 16. This is shown in Fig. 3 at box 302. It is noted that the steps in Fig. 3 could be executed by the ingress 12 at a point marked by the circled number "3" in Fig. 2, i.e., upon receipt of a packet from an entity external to the ingress 12. Also at the reference instant TREF, the ingress 12 resets a second counter (see box 304), which starts counting the number of packets for which an acknowledgement of receipt is received from the egress 14, starting at time TREF- The value of the second counter at a given time t can be denoted D(f). The manner in which the second counter is incremented will be described in further detail below.
Also upon transmittal of the marker packet 50 at the reference instant TREF, the ingress 12 takes note of the current fill level of the reverse token bucket (see box 306). This value is denoted L(TREF) and represents the number of packets for which an acknowledgement had not been received at the reference instant TREF- It will be appreciated that L(TREF) is in fact the sum of the number of packets in four categories:
A) packets in transit between the ingress 12 and the egress 14, at time TREF',
B) packets accounted for by the value of the token credit counter at time TREF,
C) packets accounted for by token credit packets 22 on their way towards the ingress 22; D) lost packets.
Meanwhile, the egress 14 continues to operate in the manner described above with reference to Fig. 2. Specifically, the egress 14 sends token credit packets 22 at certain instants when the token credit counter in the egress 14 reaches the value M (see box 258). In addition, the
egress 14 is adapted to send a token refresh packet 40, which, as recalled, is sent to the ingress 12 without waiting for the token credit counter to reach the value M (see box 262). In accordance with the present embodiment, one of the circumstances under which a token refresh packet 40 is sent to the ingress 12 (i.e., one of the conditions under which box 262 will yield a result of "YES") is upon receipt of the marker packet 50 from the switch fabric 16. For the purposes of more clearly describing the present embodiment, the token refresh packet sent to acknowledge successful receipt of the marker packet 50 is specially denoted 4050. Such a token refresh packet 4050 indicates to the ingress 12 the value of the token credit counter in the egress 14 at the time the token refresh packet 405o is being sent to the ingress 12. Following transmission of the token refresh packet 405o, the token credit counter in the egress 14 is reset as for any other token refresh packet (see box 264).
Now, returning to the description of operation of the ingress 12, and with reference to Figs. 2 and 4, the value, £>(t), of the second counter increases as acknowledgements are received by the ingress 12. The circle in Fig. 2 which contains the numeral "4" indicates a possible point in the processing effected by the ingress 12 to execute the steps indicated by the boxed in Fig. 4. Box 308 is indicative of the fact that the second counter is increased by the number of acknowledgements registered by the ingress 12. The acknowledgements registered in this manner include those received by the ingress 12 in the form of a token credit packet 22 that explicitly or implicitly acknowledges successful receipt of M packets by the egress 14, as well as those received by the ingress 12 in the form of a token refresh packet 40 (or 4050) that acknowledges successful receipt of a number of packets specified in the token refresh packet itself. It is noted that the second counter is being incremented by an amount equal to that by which the reverse token bucket counter is being decremented.
Thus, each time such a token credit packet 22 or token refresh packet 40 (or 4050) is received, the second counter is incremented (see box 308). Let TwF+dt denote the time at which the token refresh packet 4050 is received by the ingress 12 (see boxes 310 and 312). It is at this point that the ingress 12 can be assured that it has received acknowledgements for all packets that belong to categories A), B) and C) above. Moreover, the number of packets for which these acknowledgements have been received is stored in the value D(TREF+dt) of the second counter, since the second counter started counting when there were still no acknowledgements from any of the packets belonging to categories A), B) and C) above.
Stated differently, the marker packet 50 transmitted at time T F flushes out the token credit counter in the egress 14 and thus, none of the packets which may have been in transit prior to the transmission of the marker packet 50 should be unaccounted for at the time that the acknowledgement of the marker packet 50 is received at the ingress 12. Hence, if there is a difference (see box 314) between L(TREF) and D(TREF+dt), this difference is attributable to the number of lost packets, i.e., category D) above. In fact, this difference can be used to correct (i.e., calibrate) the current value L(f) of the reverse token bucket at time, by subtracting from it the difference of L(TREF) - D(TREF+ t (see box 316). If the difference is negative, then the net change to L(f) will be positive and signifies the possibility that packets were not lost but that excessive / erroneous acknowledgements may have been produced.
In a variant, all token refresh packets 40 acknowledge the receipt of marker packets 50 at the egress 14 (i.e., 40 is equivalent to 4050). In this variant, the token credit counter in the egress 14 is not reset following the transmission of a token refresh packet 40 (i.e., box 264 is eliminated). In addition, box 206 will only apply when a token credit packet 22 is received (and not when a token refresh packet 40 is received) at the ingress 12. Meanwhile, the algorithm/arithmetic used by the ingress 12 to assess the integrity of the flow of packets will continue to operate as shown in Figs. 3 and 4. This provides yet further independence between the flow regulation and transmission integrity verification schemes.
In a further variant, the system 10 can be made to account for loss of the token refresh packet 4050. Specifically, according to this variant, the ingress 12 starts a timer at time TREF, i.e., when the marker packet 50 is sent into the switch fabric 16. The timer has an expiry time that is arbitrary and may be designed to take into account an expected reasonable delay before receiving the token refresh packet 4050 from the egress 14. If no token refresh packet is received prior to the expiry time of the timer, then this may mean that the token refresh packet 40 0 is lost. At this point, the ingress 12 may decide to send a second marker packet at an instant TREF2- The ingress 12 then considers T EF2 as being the reference instant.
Hence, upon receipt of a token refresh packet acknowledging the second marker packet at time T EF2 +dt, the ingress 12 computes the number of lost packets (which refers to the number of packets that had been lost at time TREF2) as being L(TREF2) - D(TREF2+ t).
Those of ordinary skill in the art will also appreciate that the ingress 12 and egress 14 may be implemented as a processor having access to a code memory which stores program instructions for the operation of the processor. The program instructions could be stored on a medium which is fixed, tangible and readable directly by the processor, (e.g., removable diskette, CD-ROM, ROM, or fixed disk), or the program instructions could be stored remotely but transmittable to the processor via a modem or other interface device (e.g., a communications adapter) connected to a network over a transmission medium. The transmission medium may be either a tangible medium (e.g., optical or analog communications lines) or a medium implemented using wireless techniques (e.g., microwave, infrared or other transmission schemes).
Those skilled in the art should also appreciate that the program instructions stored in the code memory can be compiled from a high level program written in a number of programming languages for use with many computer architectures or operating systems. For example, the high level program may be written in assembly language, while other versions may be written in a procedural programming language (e.g., "C") or an object oriented programming language (e.g., "C++" or "JAVA").
Those skilled in the art should further appreciate that in some embodiments of the invention, the functionality of the processor may be implemented as pre-programmed hardware or firmware elements (e.g., application specific integrated circuits (ASICs), electrically erasable programmable read-only memories (EEPROMs), etc.), or other related components.
While specific embodiments of the present invention have been described and illustrated, it will be apparent to those skilled in the art that numerous modifications and variations can be made without departing from the scope of the invention as defined in the appended claims.