WO2006020232A1 - Network interconnect crosspoint switching architecture and method - Google Patents

Network interconnect crosspoint switching architecture and method Download PDF

Info

Publication number
WO2006020232A1
WO2006020232A1 PCT/US2005/025438 US2005025438W WO2006020232A1 WO 2006020232 A1 WO2006020232 A1 WO 2006020232A1 US 2005025438 W US2005025438 W US 2005025438W WO 2006020232 A1 WO2006020232 A1 WO 2006020232A1
Authority
WO
WIPO (PCT)
Prior art keywords
packet
scheduler
packets
crosspoint
serial link
Prior art date
Application number
PCT/US2005/025438
Other languages
French (fr)
Inventor
Jacob V. Nielsen
Claus F. Hoyer
Jacob J. Schroeder
Original Assignee
Enigma Semiconductor
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Enigma Semiconductor filed Critical Enigma Semiconductor
Priority to EP05772308A priority Critical patent/EP1779607B1/en
Priority to DE602005012278T priority patent/DE602005012278D1/en
Publication of WO2006020232A1 publication Critical patent/WO2006020232A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/10Packet switching elements characterised by the switching fabric construction
    • H04L49/101Packet switching elements characterised by the switching fabric construction using crossbar or matrix
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/50Queue scheduling
    • H04L47/52Queue scheduling by attributing bandwidth to queues
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/25Routing or path finding in a switch fabric
    • H04L49/253Routing or path finding in a switch fabric using establishment or release of connections between ports
    • H04L49/254Centralised controller, i.e. arbitration or scheduling
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/70Virtual switches
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/15Interconnection of switching modules
    • H04L49/1515Non-blocking multistage, e.g. Clos
    • H04L49/1523Parallel switch fabric planes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/30Peripheral units, e.g. input or output ports
    • H04L49/3045Virtual queuing

Definitions

  • the invention relates generally to backplane interconnect switching
  • a central switch device implements a
  • the buffer is shared by all input and output ports of the
  • central switch device and the buffer allows for simultaneous reads and writes
  • variable size packets and the memory buffer already holds variable size packets for all output ports, the memory buffer must provide a bandwidth equivalent to
  • N the number of ports (N input ports and N output ports equals 2N total
  • VoQs virtual output queues
  • switch devices provided, e.g., by redundant switch cards.
  • the shared memory architecture has several advantages including that it
  • QoS Quality of Service
  • the crosspoint buffer architecture is very similar to the shared
  • crosspoint buffer architecture is less of a constraint as compared to the shared
  • N the number of individual crosspoint buffers
  • a buffer is shared among a
  • crosspoint buffer architecture has similar advantages to those discussed
  • arbitrated crosspoint architecture is based on a crosspoint switch device that
  • any buffering for traffic as it passes through the crosspoint switch device is located on the ingress and egress line cards.
  • VoQs virtual output queues
  • switch device provided, e.g., by redundant switch cards.
  • the scheduler can either be implemented in a stand-alone device, or
  • the arbitration decision is updated on a regular time slotted
  • crosspoint architecture is dependent on the type of scheduling algorithm
  • the scheduler function can either be performed by a single scheduler unit
  • the centralized scheduler configures all crosspoints in the system with the result of its
  • a distributed scheduler typically only configures
  • the arbitrated crosspoint architecture scales better since the
  • crosspoint device does not use an integrated memory, and can relatively easily be
  • crosspoint configuration is performed on a per timeslot basis. This introduces a
  • the device also includes a plurality of input line cards.
  • the device also includes a plurality of input line cards.
  • transceiver devices respective provided for a plurality of output line cards.
  • the device further includes a switch device communicatively coupled to each of
  • the switch device is capable of
  • the device also includes
  • the device further includes a switch device communicatively
  • the switch device includes a crosspoint matrix for communicatively
  • the device also includes at least one output queue for temporarily storing at least one
  • At least one cell or variable size packet being sent to one of the output line cards.
  • the switch device outputs the at least one cell or variable size packet based on
  • scheduling commands provided by way of a scheduling unit.
  • the method also includes temporarily storing the received information in
  • the method includes outputting the first
  • the method further includes temporarily storing, if is determined to be
  • the method also includes
  • Figure 1 shows a switch device configured as a number of independent
  • Figure 2 shows an interconnect topology for a system having parallel
  • Figure 3 shows an interconnect topology for a system having parallel
  • FIG. 4 is a functional block diagram of a switch device according to at
  • FIG. 5 is a functional block diagram of a switch device according to at
  • Figure 6 shows a system level overview of a queuing system according to
  • Figure 7 shows a queue structure of an ingress transceiver according to at
  • Figure 8 shows an example of transceiver packet collapsing and scheduler
  • Figure 9 shows a structure of an egress transceiver according to at least
  • Figure 10 shows a structure of a crosspoint queue according to at least one
  • Figure 11 shows an example of a request and acknowledge handshake
  • Figure 12 shows a scheduler's request FIFO structure according to an aspect of the invention
  • Figure 13 shows a scheduler's multicast request FIFO and multicast
  • Figure 14 shows the relationship between the scheduler's serial link status
  • Figure 15 shows an overview of the scheduler's egress serial link
  • Figure 16 shows how the scheduler's Egress Serial Link Allocation Status
  • Timers operate, according to at least one aspect of the invention.
  • Figure 17 shows an overview of the scheduler's ingress serial link
  • Figure 18 shows how the scheduler's Ingress Serial Link Allocation Status
  • Timers operate, according to at least one aspect of the invention.
  • FIG. 19 shows an overview of the scheduler's long connection monitors
  • Figure 20 shows the transmission path for ingress flow control signals
  • Figure 21 shows the transmission path for egress flow control signals
  • Figure 22 shows a Scheduler Decision Time Line versus a Serial Link Path
  • Figure 23 shows a time line example of worst-case CQ contention
  • An aspect of present invention provides
  • At least one embodiment of the present invention is directed
  • crosspoint implements a buffer structure that traffic passes through.
  • This buffer structure according to at least one embodiment of the invention implements a buffer structure that traffic passes through.
  • a scheduler According to at least one embodiment of the present invention, a scheduler
  • arbitration process is performed on variable size packet units, and combining the
  • interconnect architecture provides for back-to-back switching of variable size
  • crosspoints and the scheduler is not required, since the crosspoint buffers can
  • variable size packets from inputs to outputs of the crosspoint.
  • variable size is variable size
  • packets can be distributed and switched back-to-back across parallel crosspoints
  • one cell equals
  • variable size packet when a variable size packet does not equal an integer number of cell units.
  • variable size packet tail when estimating the size of the variable size
  • variable size packets have a tail, i.e., where the
  • variable size packet is not an integer multiple of a fixed size unit (e.g. ,
  • variable size packets may collide when they arrive at ingress
  • An embodiment of the present invention provides a
  • collision buffer per ingress and egress serial link, which absorbs (e.g. , is
  • variable size packets back-to-back with byte-level granularity towards a destination. If data accumulates in the collision buffers due to variable
  • control mechanism is used to temporarily halt the central scheduling mechanism
  • a scheduler is capable of
  • the scheduler accordingly represents multiple outstanding requests in
  • connection-oriented switching according to at least one
  • At least one embodiment ensures that the amount of congestion per crosspoint output has a small upper limit, and thus substantial buffering per crosspoint is not
  • a bit expresses the choice between two possibilities and is represented by a logical one (1) or zero (0).
  • Byte An 8-bit unit of information. Each bit in a byte has the value 0 or 1.
  • Control Record A link layer data unit, which is transmitted on control serid links.
  • Control Serial Link A serial link connecting a transceiver to a scheduler.
  • Crosspoint A virtual switch device operating in crosspoint mode
  • Crosspoint Mode A virtual switch device mode of operation, see Crosspint.
  • Cut-Through Switching A low-latency packet switching technique, where the packet switching can begin as soon as the packet header has arrived.
  • Data Serial Link A serial link connecting a transceiver to a crosspoint.
  • Egress The direction of the information flow from the virtual switch devices to the egress transceiver user interface via the egress serial links and egress transceiver.
  • Egress Port A logical transmission channel, which carries data across the transceivers egress user interface. See also Port.
  • Egress Serial Link A simplex serial electrical connection providing transmission capacity between two devices.
  • Egress Tick An egress serial link flow control command.
  • Egress Transceiver The part of a physical transceiver device, which handles tie forwarding of data from the virtual switch devices to the egress user interfaces.
  • Egress User Interface The part of a user interface, which transfers packets in the egress direction. See also User Interface.
  • Head of Line Refers to the element positioned as the next outgoing element in a FIFO mechanism.
  • Ingress The direction of the information flow originating from the ingress user interfaces to the virtual switch devices via the ingress serial links and ingress transceiver.
  • Ingress Serial Link A simplex serial electrical connection providing transmission capacity between two devices.
  • Ingress Tick An ingress serial link flow control command.
  • Ingress Transceiver The part of a physical transceiver device, which handles Hie forwarding of data from the ingress user interfaces to the virtual switch devices.
  • Ingress User Interface The part of a user interface, which transfers packets in the ingress direction. See also User Interface. Long Connection A switch connection established to accommodate the switching of a long packet
  • Long Packet A packet of size Long Min bytes or more Long Release
  • the ingress serial link flow control mechanism that can generate Long Release commands.
  • Packet Per Second (pps) Applied to packet streams. This is the number of transmitted packets on a data stream per second.
  • Prefix M or G (Mega or
  • Packet Tail The remaining packet size when excluding the maximum number of fixed size N-byte units that the packet can be divided into.
  • Physical Switch Device A physical device, which can be logically partitioned into multiple virtual switch devices. The physical switch device is typically located on the switch cards in a modular system.
  • Physical Transceiver Device A physical device, which is divided into two logical components: Ingress and an Egress Transceiver. The transceiver device is typically located on the line cards in a modular system.
  • Port A logical definition of users traffic streams, which enter and exit the switching system via the physical transceiver device.
  • a port can provide QoS services and flow control mechanisms for the user.
  • QoS Class A QoS class of service definition which may include strict priority, weighted allocations, guaranteed bandwidth allocation service schemes, as well as any other bandwidth allocation service scheme.
  • QoS Quality of Service
  • Scheduler A virtual switch device operating in scheduler mode Scheduler Mode A virtual switch device mode of operation, see Scheduler.
  • Serial Link A serial electrical connection providing transmission capacity between a transceiver and a virtual switch device, which includes an ingress and egress serial link.
  • Short Connection A switch connection established to accommodate the switching of a short packet Short Packet A packet of size (Long Min - 1) bytes or less Store-and-Forward Switching A packet switching technique, where the packet forwarding
  • Switching can begin once the entire packet has arrived and has been buffered.
  • User Interface A physical interface on the physical transceiver device, across which packet can be transferred in botfi the ingress and egress direction. See also Ingress User Interface and Egress User
  • Virtual Output Queue A data queue representing an egress port, but which is maintained on the ingress side of the switching system.
  • Virtual Switch Device A logical switch element, which can operate, for example, either as a crosspoint or as a scheduler. See also "Physical
  • Switch Device An N-bit (e.g., 16 bit, 32 bit, etc.) unit of information. Each bit in a byte has the value 0 or 1.
  • invention provides backplane connectivity and switching functionality between
  • line cards and switch cards that may be disposed, for example, in a modular
  • PCB chassis printed circuit board
  • switch cards are
  • switch card implements one or more switch devices to perform the actual
  • switch card With a 1 : 1, N+ l or N+M redundancy scheme, two or
  • construction of the first embodiment includes two different devices: a physical
  • transceiver device and a physical switch device.
  • a physical switch device One or more physical
  • transceiver devices are implemented per line card.
  • One or more physical switch is
  • a switch device can be configured as a number of virtual switch devices
  • switch device is partitioned into V x virtual switch devices, with L vx ingress and
  • Such a switch device 100 is
  • Figure 1 which includes Virtual Switch Devices 0, 1, . . . , V x - I.
  • V x which is an integer
  • Each virtual switch device is connected to L vx ingress serial links and L vx egress serial
  • L vx is an integer greater than or equal to one.
  • Each virtual switch device utilized in the first embodiment can operate in
  • switch devices in the system operate as crosspoints in the crosspoint mode.
  • a physical switch device may correspond to a single
  • a physical switch device can be partitioned into
  • Figure 2 shows the serial link connectivity, i.e. , the interconnect topology
  • the system 200 has four parallel physical switch devices 0, 1, 2, 3, each physical switch device implemented with one virtual switch device. Note the
  • transceiver (#0, #1, . . . , or # ⁇ L ⁇ - 1) is capable of outputting packets to each of
  • Figure 3 shows the serial link connectivity for an example system
  • a transceiver connects one serial link to one virtual switch device.
  • transceivers in the systems are not required to be connected to all virtual switch
  • transceiver operate with the same serial link speed and physical layer encoding overhead, but multiple physical layer encodings schemes can also be
  • each transceiver or switch device can operate from
  • Figure 4 is a functional block diagram of a virtual switch device 400 that
  • the switch device is connected to N
  • serial inputs also referred to as ingress serial links
  • N serial outputs also referred to as ingress serial links
  • the switch device includes a crosspoint matrix
  • the memory storage device may be implemented as a memory storage device that stores packets and that
  • matrix 410 provides connectivity between all inputs and outputs, as shown in
  • An input driver is used to forward packets from each ingress serial link to a corresponding input port of the crosspoint matrix, in a manner known to
  • An output driver is used to provide packets from each
  • Figure 5 is a functional block diagram of a virtual switch device 500 that
  • N is an integer representing the
  • the switch device
  • 500 also includes for each of the N inputs, a number of request FIFOs 560 equal
  • the Scheduler Unit 580 performs
  • Ingress port to egress port store-and-forward switch mode b) Ingress port to egress port cut-through switch mode
  • the ingress transceiver implements the following queues for packet
  • VoQ Virtual Output Queue
  • ILQ Ingress Serial Link Queue
  • ERQ Egress Reordering Queue
  • EPQ Egress Port Queues
  • crosspoints 620A, 620B, . . . , 62On-I and scheduler 62On are part of a single
  • An incoming packet arrives at an ingress port and is directly coupled to
  • the incoming packet is received at an ingress transceiver.
  • crosspoint 620A crosspoint 620A, . . . , or 62On-I under command of the scheduler 62On.
  • VoQ to an ILQ also referred to herein as "input FIFO"
  • Each egress transceiver 630A, . . . , or 630n Each egress
  • transceiver 630A, . . ., 63On comprises an EPQ, whereby a packet is then output
  • the packet is removed from the buffer memory.
  • a unicast packet entering the ingress transceiver is written into a buffer
  • the ingress transceiver forwards a request to the scheduler, when the following conditions are met:
  • VoQ Reqs The current number of pending requests for the VoQ is less than VoQ Reqs .
  • the value of VoQ Reqs is defined depending on the required latency and jitter packet switching performance for the individual VoQs.
  • the size of the packets for which a corresponding request is currently pending is within a defined range. The value of this range is defined depending on the required latency and jitter packet switching performance for the individual VoQs. [0087] The forwarding of the packets from the VoQs to the ILQs is controlled by
  • FIG. 11 shows an
  • a new request is forwarded to the scheduler for that new packet. Also, a head-of-line packet unit is forwarded to the ILQ when an
  • unicast packets are
  • the egress transceivers perform re ⁇
  • ordering function is based on a sequence ID, which is carried by every packet as
  • the ingress transceiver devices maintain
  • sequence IDs preferably
  • a sequence ID counter is incremented each time a packet is
  • a packet is sequence ID
  • sequence ID "stamp”.
  • a sequence ID counter is incremented and stamped into
  • An Ingress Serial Link queue (ILQ) is implemented for each ingress serial
  • Each ILQ can accept packets at the performance dictated by the scheduler
  • the size of the ILQ is preferably determined based on an analysis of worst
  • the scheduler operation ensures
  • the transceiver is capable of
  • the transceiver unicast packet collapse function is performed per VoQ.
  • the incoming packet is smaller than Request Min bytes, or the size of the current last packet in the VoQ is smaller than Request Min bytes, where the last packet in the VoQ may itself include packets that have previously been collapsed.
  • the current last packet in the VoQ includes no more than T Col - 1 collapsed packets, where T Col is an integer value greater than one.
  • a Request for the current last packet in the VoQ has not yet been generated and forwarded to the scheduler.
  • the transceiver multicast packet collapse function is performed per VoQ.
  • the incoming packet's egress transceivers destination fanout is identical to the egress transceiver fanout of the current last packet in the VoQ.
  • the incoming packet's size is smaller than Request Min bytes, or the size of the current last packet in the VoQ is smaller than Request Min bytes, where the last packet in the VoQ may itself include packets which have previously been collapsed.
  • the current last packet in the VoQ includes no more than T Col - 1 collapsed packets.
  • a Request for the current last packet in the VoQ has not yet been generated and forwarded to the scheduler.
  • scheduler collapses requests 4, 5, and 6 into a single equivalent packet unit in the
  • transceiver has collapsed 2 VoQ packets as shown by label "A", and collapsed 3
  • VoQ packets as shown by label "B". Only a single request is sent to the
  • buffer memory is assigned a buffer memory location, and written into buffer memory (not
  • the egress transceiver may need to perform packet sequence re-ordering
  • class > basis between the systems ingress and egress ports.
  • ILQs and CQs are sized so as to be shallow queues in a preferred
  • the re-ordering can therefore be performed per ⁇ ingress transceiver > ,
  • the packet buffer capacity of the CQ is preferably determined based on
  • the egress transceiver maintains an Egress Port Queue (EPQ) 830 per QoS
  • the entire packet is stored in the egress memory buffer.
  • the packet is qualified for final forwarding
  • one CQ is implemented per egress serial link in the crosspoint.
  • a CQ buffer unicast and multicast packets, and can accept up to CQ L packets
  • the packets are forwarded out of the CQ towards its
  • the lane buffer structure 920 of the CQ 900 is preferably implemented
  • word can be marked as empty per lane.
  • Each lane can occupy one lane.
  • Each lane supports a throughput equivalent to the
  • CQ LS CQ LS dimensioning guideline
  • dimensioning of each lane is made so as to ensure that a short packet
  • CQ LS is defined as
  • This dimensioning guideline provides more buffer space than what is
  • a lane buffer write pointer 940 is
  • a packet is read out from one or more of the lanes of the CQ 900 in the
  • short packets are immediately qualified for readout, while long
  • the packet is scheduled for transmission across the same egress serial link, the packet
  • an additional amount (e.g., 40 bytes) is added to the
  • each serial link in the physical switch device has an associated
  • resulting number represents the egress serial link across which the crosspoint will
  • scheduler mode is
  • control serial link protocol is running on the serial links connecting the
  • transceivers with the scheduler, and exchanges ingress transceiver queue status
  • the data serial link protocol is running on the serial links connecting the
  • transceivers with the crosspoints, and includes a small header per packet carrying
  • Each ingress transceiver is capable of forwarding multiple requests
  • packets are switched using a 'one packet
  • the first step is the establishment
  • the scheduler For such packets, the scheduler
  • connection is referred to as a "short connection.”
  • Congestion may occur at the ingress serial link queues (ILQ) and egress
  • serial links queues (CQ) for the following reasons:
  • the crosspoint When traffic is accumulated in such a crosspoint queue, the crosspoint sends a flow control command (Egress Tick) to the scheduler via the transceiver
  • the ingress transceiver implements a small queue per ingress serial link (ILQ).
  • ILQ ingress serial link
  • transceiver sends a flow control command (Ingress Tick) to the scheduler, and
  • the scheduler then delays any re-allocation of the ingress serial link for a small
  • serial link resources are made available to the scheduler
  • the scheduler calculates the next scheduling decision.
  • the scheduler preferably
  • the scheduler preferably maintains a serial link state table
  • the crosspoint queue's (CQ) fill level determine the switching delay across
  • the scheduler ensures an upper bound on the latency and fill level of the CQ.
  • a switching connection includes a switch path across an ingress and egress
  • the scheduler supports the
  • a) Short connections A short connection is requested for packets shorter than Long Min .
  • the scheduler will allocate the ingress and egress serial link(s) for a period of time equivalent to the packet size ignoring any packet tail (because the size of the corresponding Request did not include any packet tail), which may cause the corresponding short packet to overlap with the previous packet at both the ingress and egress serial link.
  • b) Long connections A long connection is requested for packets that are equal to, or longer than Long Min .
  • the transceiver will request the scheduler to de ⁇ allocate the connection when only T ILQ bytes remain in the ingress serial link ILQ, which may cause the corresponding long packet to overlap with the next packet at both the ingress and egress serial link.
  • transceiver and scheduler related to the establishment of serial link connections.
  • [Size] A specific size of a short packet (Request Mm ⁇ Size ⁇ Long MIN ) excluding any packet tail, or an unknown size long packet ( Long MIN ).
  • [QoS] The QoS priority of the packets pending for acknowledgement in the VoQ.
  • the scheduler Once the scheduler has granted a serial link connection for a unicast packet, it will send an acknowledge back to the transceiver.
  • the acknowledge has the following format: Unicast acknowledge: ⁇ Switch, Transceiver, Size, Timer > , where:
  • Transceiver An integer number defining the destination transceiver and VoQ.
  • [Size] A specific size short packet (Request Mm ⁇ Size ⁇ Long MIN ), or an unknown size long packet ( Long MIN ).
  • Timer The value of ingress serial link allocation state variable Timer, just before the acknowledge was generated and Timer was updated accordingly. This field is used by the ingress serial link flow control function. [ooi26] After receipt of the acknowledge, forwarding of the corresponding unicast
  • the ingress transceiver may receive an
  • the ingress transceiver calculates how many
  • connections requests the scheduler collapsed, and then forwards all of these as a
  • acknowledge is different and multiple acknowledges may be generated as a result
  • Multicast acknowledge ⁇ Multicast ID, Switch, Size > , where:
  • MulticastID The multicast ID associated with the multicast packet.
  • Switch The virtual switch device across which the multicast packet shall be switched
  • [Size] A specific size short packet (Request Min ⁇ Size ⁇ Long MIN ), or an unknown size long packet ( Long MIN ).
  • Figure 11 shows an example of a
  • the scheduler implements a unicast request FIFO for each VoQ in the
  • the unicast request FIFO structure is shown in Figure 12.
  • request FIFO is dimensioned to hold a number of requests equal to the maximum
  • connection requests except the last FIFO entry that can be either a short
  • connection request FIFO If the last entry in the connection request FIFO is a long request, the incoming connection request is added to the connection request FIFO as a new entry. b) If the last entry in the connection request FIFO is a short request, it is collapsed with the incoming connection request. The size of the incoming request is added to the size of the last entry, and if the result of the addition is > Long Min , the last entry is converted to a long request.
  • Each unicast request FIFO also maintains a QoS variable.
  • the QoS
  • variable is preferably set equal to the QoS priority of the last incoming request.
  • the request FIFO structure can be implemented with a low gate count using the
  • the scheduler implements one multicast request FIFO per ingress
  • the multicast request FIFO is identical to the unicast request FIFO except
  • Each entry in the multicast request FIFO can be either short or long and each entry also holds the multicast ID associated with the request.
  • the incoming request can only be collapsed with the last entry in the multicast request FIFO if the two multicast IDs are identical.
  • Requests are forwarded from the request FIFOs to the request queue by a strict
  • the ingress transceiver that generated the request is also added to the
  • the multicast request queue supports collapsing of requests in a manner
  • the scheduler maintains allocation status per serial links in the system
  • the scheduler algorithm also relates to
  • each egress serial link allocation status receives egress serial
  • scheduler algorithm uses the information it receives to generates connection
  • the allocation status per egress serial link tracks the allocation and de-allocation
  • the allocation status is
  • Timer based on an integer variable named Timer, which is defined in the range from 0
  • the scheduler algorithm can allocate the egress serial link for a
  • Timer is allocated for a long connection
  • Timer is assigned a value of T if no Egress Tick has arrived while
  • serial link flow control de-allocates long connections before their transmission is
  • T is dimensioned such that if there is a very high probability
  • Timer which tracks the allocation and de-allocation of the egress serial link for both short and long connections (implemented with LOg 2 (N + 1) register bits).
  • EgressTick which remembers if one or more Egress Ticks have arrived (implemented with 1 register bit).
  • Timer Timer + (ET Slze / (Tail Max + 1) - 1);
  • the allocation status per ingress serial link tracks the allocation and de-allocation
  • the allocation status is
  • Timer based on an integer variable named Timer, which is identical to the equivalent
  • Timer is assigned a value of T. tooi5i] When the ingress serial link is allocated for a short connection, Timer is
  • T is preferably dimensioned such that if there is a very high
  • Timer which tracks the allocation and de-allocation of the egress serial link for both short and long connections (can be implemented using Log 2 (N+ l) register bits).
  • Long Allocate which detects if a long connection is never released due to a bit transmission error.
  • the scheduler tracks the allocated ⁇ ingress, egress > serial link pairs for
  • the scheduler itself tracks which egress serial link was associated with the long connection and then de-allocate both the ingress and egress serial link. This is
  • serial link is setup to point to the allocated ingress serial link.
  • the ingress serial link flow control mechanism uses two
  • Ingress Tick - compensates for tails accumulated during transmission of short packets.
  • the ingress transceiver generates a Long Release command for a specific
  • B ILQ The current number of bytes in the ILQ queue.
  • T jLQ An ILQ queue threshold (bytes)
  • Data Parameter T ILQ is preferably dimensioned to be as small as possible, to reduce
  • T ILQ D LR . Data .
  • the Ingress Tick flow control command is used to compensate for the
  • the scheduler When the scheduler generates a short acknowledge, it inserts
  • the allocation status provided by the scheduler is compared with the actual fill level of the ILQ, to
  • ILQ accumulation This can occur when the packet tails are greater than zero.
  • the ingress transceiver generates one or more
  • BILQ Number of bytes in the ILQ queue before the short packet is added.
  • PTail Size of the short packets tail.
  • Tuner The ingress serial link allocation status (Timer) right before the scheduler made the scheduler decision for the short connection (Timer T) multiplied by the Timer unit which is (TailMax + 1).
  • PTicks Number of previous generated Ingress Ticks, which have not yet expired multiplied by ITSize. A previous generated Ingress Tick expires after a period of DTick-Ack, and it is the ingress transceiver, which tracks the expiration of previous Ingress Ticks per ingress serial link.
  • the number of generated Ingress Ticks is:
  • the egress serial link flow control mechanism uses one signaling
  • Egress Tick - compensates for tails accumulated during transmission of short packets.
  • the transceiver collects the Egress Tick messages from the parallel
  • the flow control loop latency is preferably minimized.
  • B CQ The current number of bytes in the CQ queue.
  • T CQ is an integer variable that is incremented by ET Size when condition 1 below is true, and T CQ is decremented by ET Size when B CQ ⁇ T CQ - ET Size , but T CQ can never be assigned a value less then 2xET Size which is also the reset value of T CQ .
  • M CQ is an integer with a value slightly lower than ET Size to ensure that back-to-back generated Egress Ticks over a long period of time will completely halt the allocation of the serial link throughout that entire period of time. This is done due to small ppm differences in clock frequency between the transceiver and crosspoint device and non-deterministic latency in the forwarding of tick signals across clock domains in the transceiver.
  • Egress Tick at a time may therefore be 'lost' upon arrival at
  • the scheduler if the serial link by then is allocated for a long connection.
  • connection oriented scheduler approach is utilized in a preferred embodiment
  • connection means a switch path between an ingress transceiver and an egress
  • connection is established, at least one packet is streamed across the connection, and then the connection is removed.
  • Figure 22 shows a Scheduler Decision Time Line 1510 versus a Serial
  • the scheduler makes a first decision (Decision 1)
  • a long connection termination request is received at the scheduler, which effectively notifies the scheduler that the packet has been switched, and
  • the scheduler assigns a short connection for a fixed
  • connection is removed by the scheduler itself, and the allocated serial link
  • termination request is output by the corresponding ingress serial link flow control
  • transceiver will request for a connection excludes any potential tail of the short
  • Tail Max 19 bytes
  • the ingress transceiver will request a short connection for a
  • the packet tails ignored by the scheduler may accumulate at the ILQs on
  • the structure of the ILQs and the CQs allows for absorbing the packet collisions
  • an Egress Tick is output by the CQ where the packet overlapping is occurring.
  • the Egress Tick is provided to the scheduler by the transceiver of the destination
  • the Egress Tick causes the scheduler to force the corresponding egress
  • each Egress Tick causes an 80 byte idle time for the particular
  • Tick or a 120 byte Egress Tick could be utilized, for example.
  • the scheduler by it first being sent to an egress transceiver from the congested
  • Ingress Tick control signal
  • Ingress Tick flow control message forces the corresponding link idle for a period
  • Figure 20 shows an ingress
  • Tick is sent directly from a congested ILQ of an ingress transceiver to the
  • the left side portion 1710 of Figure 23 shows an example whereby worst-
  • the scheduler schedules a first and a second 99-byte packet from two
  • This ILQ delay is the worst case equivalent to a current ILQ fill
  • transceiver is less than the delay for the first and second 99-byte packets.
  • the CQ of the crosspoint unit receives six separate
  • the CQ is capable of storing each of these six separate 99-byte packets in a respective one of its six
  • number of lanes for the CQ is configurable for any given implementation.

Abstract

A network switching system includes transceiver devices respectively provided for a plurality of input line cards. The switching system also includes transceiver devices respective provided for a plurality of output line cards. The switching system further includes a switch device communicatively coupled to each o~ the plurality of input line cards and the plurality of output line cards. The switch device includes a crosspoint matrix for communicatively connecting one of the input line cards to one of the output line cards. The switch device is capable of operating in either a crosspoint mode for routing cells or packets from to one of the input line cards to one of the output line cards, or a scheduler mode for controlling flow of cells and/or packets through at least one other switch device.

Description

NETWORK INTERCONNECT CROSSPOINT SWITCHING ARCHITECTURE AND METHOD
BACKGROUND OF THE INVENTION
A. Field of the Invention
[0001] The invention relates generally to backplane interconnect switching
architectures, and more particularly it relates to an arbitrated crosspoint switching
architecture for use within a network switch.
B. Description of the Related Art
[ooo2] Conventional modular backplane interconnect architectures can be
classified into three separate groups: 1) shared memory, 2) crosspoint buffered,
and 3) arbitrated crosspoint.
[ooo3] In the shared memory architecture, a central switch device implements a
relatively large pool of memory (also referred to as a buffer or a memory buffer).
Any incoming variable size packet or cell is stored in this buffer until the variable
size packet or cell is read from the buffer and forwarded out of the switch device
(towards its destination). The buffer is shared by all input and output ports of the
central switch device, and the buffer allows for simultaneous reads and writes
from all ports. In a scenario where all input ports have simultaneously arriving
variable size packets and the memory buffer already holds variable size packets for all output ports, the memory buffer must provide a bandwidth equivalent to
the bandwidth of 2N system ports to support full traffic throughout on all ports,
where N is the number of ports (N input ports and N output ports equals 2N total
input and output ports) of the switch device. This means that memory bandwidth
limits the switch capacity per switch device for the shared memory architecture.
[ooo4] The shared memory switch element is typically combined with additional
buffering, or queuing, on both ingress and egress line cards, since the amount of
buffering it is possible to implement on the central switch chip cannot meet the
overall buffer requirements of the system. The queues located on the ingress line
cards are typically called virtual output queues (VoQs), which eliminate head-of-
line blocking effects, and the amount of buffering required on the egress line
cards is influenced by the amount of speedup capacity available in the central
switch devices provided, e.g., by redundant switch cards. The memory
limitations of the central switch device typically introduce some extra internal
control overhead to the overall system.
[0005] The shared memory architecture has several advantages including that it
can switch both variable size packets and cells in their native format, thereby
avoiding having to segment and reassemble a variable size packet into a cell
format. It is also easy to ensure that no bandwidth is wasted regardless of the packet or cell size, which minimizes backplane over-speed requirements. Other
advantages include that Quality of Service (QoS) capabilities and low latency cut-
through switching can relatively easily be provided.
[0006] The second type of conventional modular backplane interconnect
architecture, the crosspoint buffer architecture, is very similar to the shared
memory architecture. However, instead of a single shared memory, the switch
device of the crosspoint buffer architecture implements a matrix of buffers with
one buffer per input/output combination. This reduces the bandwidth
requirements per individual buffer to the equivalent of only two (2) system ports
(one input port and one output port), as compared to 2N system ports for the
shared memory architecture, which means that memory bandwidth for the
crosspoint buffer architecture is less of a constraint as compared to the shared
memory architecture.
[ooo7] However, a drawback of the crosspoint buffer architecture is that the
number of individual crosspoint buffers is proportional to N2, where N is the
number of ports (e.g., N input ports and N output ports) in the switch. Since it
is difficult to statistically share memory between the individual crosspoint
buffers, the total memory requirements of the crosspoint buffer architecture
exceeds that of the shared memory architecture, and the amount of memory and number of memory building blocks per switch device therefore limits the switch
capacity per switch device.
[oooδ] Practical implementations also include hybrid shared and crosspoint
buffered architectures. In these hybrid architectures, a buffer is shared among a
number of crosspoint nodes to achieve the optimal tradeoff between memory
bandwidth and memory size requirements from a die size and performance
perspective.
[ooo9] For the shared and crosspoint buffered architectures, capacity is scaled
using multiple switch devices in parallel. This can be done using byte slicing or
by using a flow control scheme between the switch devices to avoid cell or packet
re-ordering problems at the output ports. In general, both of these schemes are
difficult to scale due to timing constraints and due to the complexity of flow
control algorithms.
[0010] The crosspoint buffer architecture has similar advantages to those discussed
above with respect to the shared memory architecture.
loon] The third type of conventional modular backplane architecture, the
arbitrated crosspoint architecture, is based on a crosspoint switch device that
provides connectivity between any input to any output, but that does not provide
any buffering for traffic as it passes through the crosspoint switch device. The buffering is located on the ingress and egress line cards. The queues located on
the ingress line cards are virtual output queues (VoQs), which eliminate head-of-
line blocking effects, and the amount of buffering required on the egress line
cards is influenced by the amount of speedup capacity available to the central
switch device provided, e.g., by redundant switch cards.
[ooi2] The crosspoint portion of the switch device is managed by a scheduler
function, which differentiates arbitrated crosspoint architectures from other
techniques. The scheduler can either be implemented in a stand-alone device, or
integrated into the crosspoint device itself. The latter approach improves the
redundancy features and eliminates the need for dedicated communication links
between the scheduler and crosspoint switch devices. The scheduler function
performs the arbitration process of assigning an input-to-output pair in the
crosspoint device. The arbitration decision is updated on a regular time slotted
basis, which corresponds to the transmission period of a single fixed sized cell
unit of data. The performance, capacity, and QoS support of the arbitrated
crosspoint architecture is dependent on the type of scheduling algorithm
employed.
[ooi3] The scheduler function can either be performed by a single scheduler unit
(centralized) or by multiple schedulers operating in parallel (distributed). The centralized scheduler configures all crosspoints in the system with the result of its
crosspoint arbitration decision. A distributed scheduler typically only configures
a single crosspoint device with the result of its crosspoint arbitration decision. A
centralized scheduler approach typically offers better control of basic QoS
features but can be difficult to scale in capacity and number of ports. A
distributed scheduler approach can offer improved scalability since the scheduling
decisions have reduced timing constraints relative to the centralized scheduler
approach, but it introduces synchronization issues among the parallel schedulers.
[ooi4] Compared with the shared memory architecture and the crosspoint buffered
architecture, the arbitrated crosspoint architecture scales better since the
crosspoint device does not use an integrated memory, and can relatively easily be
scaled in capacity by adding crosspoints in parallel.
[ooi5] The configuration of the crosspoint in the arbitrated crosspoint architecture
is locked to the arbitration process, such that the arbitration process and the
crosspoint configuration is performed on a per timeslot basis. This introduces a
cell-tax when the switched data units do not occupy a full timeslot switching
bandwidth across a given crosspoint. The majority of scheduler algorithms are a
replication or a derivation of the widely adopted iSLIP algorithm developed by
Nick McKeown of Stanford University, whereby this algorithm uses an iterative approach to find a good match between inputs and outputs in a timeslot-based
arbitration process. See, for example, U.S. Patent Number 6,515,991 , of which
Nick McKeown is the listed inventor.
[0016] As described above, each of the conventional backplane interconnect
architectures has disadvantages. Accordingly, it is desirable to improve the
backplane interconnect architecture and to reduce or eliminate the disadvantages
of conventional backplane interconnect architectures.
SUMMARY OF THE INVENTION
[ooi7] According to one aspect of the invention, there is provided a network
switching device, which includes a plurality of transceiver devices respectively
provided for a plurality of input line cards. The device also includes a plurality
of transceiver devices respective provided for a plurality of output line cards.
The device further includes a switch device communicatively coupled to each of
the plurality of input line cards and the plurality of output line cards, the switch
device including a crosspoint for communicatively connecting one of the input
line cards to one of the output line cards. The switch device is capable of
operating in either a crosspoint mode for routing cells or variable size packets
from one of the input line cards to one of the output line cards, or for operating in a scheduler mode for controlling flow of cells and/or variable size packets
through at least one other switch device.
[0018] According to another aspect of the invention, there is provided a network
switching device, which includes a plurality of first transceiver devices
respectively provided for a plurality of input line cards. The device also includes
a plurality of second transceiver devices respectively provided for a plurality of
output line cards. The device further includes a switch device communicatively
coupled to each of the plurality of input line cards and the plurality of output line
cards. The switch device includes a crosspoint matrix for communicatively
connecting one of the input line cards to one of the output line cards. The switch
device also includes at least one output queue for temporarily storing at least one
partial cell or variable size packet output from the crosspoint matrix prior to the
at least one cell or variable size packet being sent to one of the output line cards.
The switch device outputs the at least one cell or variable size packet based on
scheduling commands provided by way of a scheduling unit.
[ooi9] According to yet another aspect of the invention, there is provided a
method of switching variable size packets or cells via a switch device, which
includes receiving a variable size packet or cell, as received information, from
one of a plurality of input line cards communicatively connected to the switch device. The method also includes temporarily storing the received information in
an ingress queue of the ingress transceiver, as first stored information. Based on
a command provided by a scheduler unit, the method includes outputting the first
stored information, as output information, from the ingress queue to a crosspoint
unit. The method further includes temporarily storing, if is determined to be
required by the switch unit, the output information in an egress queue of the
switch device, as second stored information. The method also includes
outputting the second stored information, to one of a plurality of output line cards
communicatively connected to the switch device.
BRIEF DESCRIPTION OF THE DRAWINGS
[0020] The foregoing advantages and features of the invention will become
apparent upon reference to the following detailed description and the
accompanying drawings, of which:
[0021] Figure 1 shows a switch device configured as a number of independent
virtual switch devices, according to at least one aspect of the invention;
[0022] Figure 2 shows an interconnect topology for a system having parallel
physical switch devices, in accordance with at least one aspect of the invention; [0023] Figure 3 shows an interconnect topology for a system having parallel
virtual switch devices that correspond to a single physical switch device, in
accordance with at least one aspect of the invention;
[0024] Figure 4 is a functional block diagram of a switch device according to at
least one aspect of the invention that is operating in the crosspoint mode;
[0025] Figure 5 is a functional block diagram of a switch device according to at
least one aspect of the invention that is operating in the scheduler mode;
[oo26] Figure 6 shows a system level overview of a queuing system according to
at least one aspect of the invention;
[0027] Figure 7 shows a queue structure of an ingress transceiver according to at
least one aspect of the invention;
[0028] Figure 8 shows an example of transceiver packet collapsing and scheduler
request collapsing according to an aspect of the invention;
[0029] Figure 9 shows a structure of an egress transceiver according to at least
one aspect of the invention;
[0030] Figure 10 shows a structure of a crosspoint queue according to at least one
aspect of the invention;
[0031] Figure 11 shows an example of a request and acknowledge handshake
according to an aspect of the invention; [0032] Figure 12 shows a scheduler's request FIFO structure according to an
aspect of the invention;
[0033] Figure 13 shows a scheduler's multicast request FIFO and multicast
request queue structure according to at least one aspect of the invention;
[0034] Figure 14 shows the relationship between the scheduler's serial link status,
scheduler algorithm, and ingress and egress serial link flow control, according to
at least one aspect of the invention;
[0035] Figure 15 shows an overview of the scheduler's egress serial link
allocation status, according at least one aspect of the invention;
[0036] Figure 16 shows how the scheduler's Egress Serial Link Allocation Status
Timers operate, according to at least one aspect of the invention;
[0037] Figure 17 shows an overview of the scheduler's ingress serial link
allocation status, according to at least one aspect of the invention;
[0038] Figure 18 shows how the scheduler's Ingress Serial Link Allocation Status
Timers operate, according to at least one aspect of the invention;
[0039] Figure 19 shows an overview of the scheduler's long connection monitors,
according to at least one aspect of the invention;
[0040] Figure 20 shows the transmission path for ingress flow control signals,
according to at least one aspect of the invention; [0041] Figure 21 shows the transmission path for egress flow control signals,
according to at least one aspect of the invention;
[oo42] Figure 22 shows a Scheduler Decision Time Line versus a Serial Link Path
Allocation Time Line, according to at least one aspect of the invention; and
[0043] Figure 23 shows a time line example of worst-case CQ contention,
according to at least one aspect of the invention.
DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS
[0044] Different embodiments of the present invention will be described below
with reference to the accompanying figures. Each of these different
embodiments are combinable with one or more other embodiments to produce
additional embodiments of the invention. An aspect of present invention provides
a backplane interconnect architecture that has the attractive benefits of the shared
memory architecture, while greatly reducing concerns about scalability into the
terabit capacity range in standard single chassis form factors.
[0045] Briefly stated, at least one embodiment of the present invention is directed
to a crosspoint, which is managed by a scheduler that performs a crosspoint
arbitration process, similar to the conventional arbitrated crosspoint architecture.
In contrast to the conventional arbitrated crosspoint architecture, however, the
crosspoint according to at least one embodiment of the invention implements a buffer structure that traffic passes through. This buffer structure according to at
least one embodiment of the invention is different from a shared memory
architecture and a crosspoint buffer architecture in structure and in purpose, and
is constructed such that it does not limit scalability due to memory constraints.
Accordingly, the embodiments of present invention are much different than any
of the conventional backplane interconnect architectures.
[0046] According to at least one embodiment of the present invention, a scheduler
arbitration process is performed on variable size packet units, and combining the
scheduler arbitration process with the buffer structure of the crosspoint ensures
that bandwidth is not wasted when switching variable size packets across the
crosspoints regardless of the packet size. This is different than a conventional
arbitrated crosspoint architecture where a cell-tax is introduced due to a timeslot-
based scheduler operation.
[0047] According to at least one embodiment of the present invention, an
interconnect architecture provides for back-to-back switching of variable size
packets across crosspoints, using a central scheduler to distribute the traffic
across the parallel crosspoints. The combination of the central scheduler, buffers
per ingress serial link in the ingress transceiver and buffers per egress serial link
in the crosspoint to absorb overlapping packet tails, and a flow control mechanism to synchronize the central scheduler and these buffers, effectively
provides efficient byte-level scheduling and back-to-back switching of variable
size packets across parallel crosspoints.
[0048] Since the crosspoint buffers used in accordance with at least one
embodiment of the invention are relatively much smaller than the conventional
shared memory type full-length packet buffers, the crosspoint latency variation
(e.g. , due to the size of the buffer) is comparably negligible and the amount of
packet sequence re-ordering memory that is needed to compensate for the load
balancing is therefore also very low. Also, synchronous operation of the
crosspoints and the scheduler is not required, since the crosspoint buffers can
absorb non-synchronous timing, and since the crosspoints themselves route the
variable size packets from inputs to outputs of the crosspoint.
[oo49] According to at least one embodiment of the invention, variable size
packets can be distributed and switched back-to-back across parallel crosspoints,
requiring only very small amounts of packet buffering per crosspoint egress serial
link and no over-speed requirements to compensate for cell-tax or cell-timeslot
alignment (e.g. , the packet tail problem).
[0050] The central scheduling algorithm and tight timing synchronization
requirements are traditionally the primary bottlenecks for achieving fully efficient utilization of the serial links in an arbitrated crosspoint switching system. The
complexity of even the simplest scheduling algorithms cannot be executed at a
speed equivalent to the transmission of a byte unit and the scheduler-to-crosspoint
synchronization cannot be performed with byte level granularity. The scheduling
and scheduler-to-crosspoint timeslot synchronization is therefore typically
performed at intervals corresponding to one cell unit of data (e.g. , one cell equals
40 bytes or 80 bytes), which leads to inefficient utilization of switching capacity
when a variable size packet does not equal an integer number of cell units.
[0051] According to at least one embodiment of the invention, a central scheduler
schedules a variable size packet for an ingress and an egress serial link while
ignoring the variable size packet tail when estimating the size of the variable size
packet to be scheduled. When variable size packets have a tail, i.e., where the
size of the variable size packet is not an integer multiple of a fixed size unit (e.g. ,
a cell unit of data), variable size packets may collide when they arrive at ingress
or egress serial links. An embodiment of the present invention provides a
"collision" buffer per ingress and egress serial link, which absorbs (e.g. , is
capable of simultaneously holding some or all data from multiple packets output
by the crosspoint matrix) the collision without any loss of information, and
forwards the variable size packets back-to-back with byte-level granularity towards a destination. If data accumulates in the collision buffers due to variable
size packet tails not accounted for by the scheduler, a relatively simple flow
control mechanism is used to temporarily halt the central scheduling mechanism,
using "tick" flow control messages. This effectively provides scheduling with
byte level granularity, even though the scheduler only needs to perform
scheduling decisions per serial link on a standard cell-timeslot basis.
[0052] The collision buffers mitigate the tight clock level timing requirements
otherwise required between the scheduler operation and corresponding crosspoint
configuration.
[0053] Further, according to at least one embodiment, a scheduler is capable of
maintaining multiple outstanding requests per destination, whereby these requests
can be assigned to one or more of the parallel crosspoints in a load balancing
operation. The scheduler accordingly represents multiple outstanding requests in
a compact representation, which minimizes gate counts for a circuit implementing
the scheduler functions.
[0054] Still further, connection-oriented switching according to at least one
embodiment of the invention is capable of providing support for cut-through
switching. Also, the central scheduler and flow control operation according to at
least one embodiment ensures that the amount of congestion per crosspoint output has a small upper limit, and thus substantial buffering per crosspoint is not
required.
[0055] The lists of acronyms and definitions provided in Table 1 are helpful in
understanding the description of the various embodiments of the invention. This
list of acronyms and definitions provides a general description, and is not
intended to be limiting on the present invention, as these terms may have other
meanings as are known to those skilled in the art.
[0056] Table 1 - Glossary of Terms
Bit A unit of information used by digital computers. Represents the smallest piece of addressable memory within a computer.
A bit expresses the choice between two possibilities and is represented by a logical one (1) or zero (0).
Byte An 8-bit unit of information. Each bit in a byte has the value 0 or 1.
Collapsed Packet The resulting logical packet entity when the transceiver collapses packets or when the scheduler collapses packet requests.
Control Record A link layer data unit, which is transmitted on control serid links.
Control Serial Link A serial link connecting a transceiver to a scheduler. Crosspoint A virtual switch device operating in crosspoint mode Crosspoint Mode A virtual switch device mode of operation, see Crosspint. Cut-Through Switching A low-latency packet switching technique, where the packet switching can begin as soon as the packet header has arrived.
Data Serial Link A serial link connecting a transceiver to a crosspoint. Egress The direction of the information flow from the virtual switch devices to the egress transceiver user interface via the egress serial links and egress transceiver.
Egress Port A logical transmission channel, which carries data across the transceivers egress user interface. See also Port. Egress Serial Link A simplex serial electrical connection providing transmission capacity between two devices.
Egress Tick An egress serial link flow control command. Egress Transceiver The part of a physical transceiver device, which handles tie forwarding of data from the virtual switch devices to the egress user interfaces.
Egress User Interface The part of a user interface, which transfers packets in the egress direction. See also User Interface. Head of Line (HOL) Refers to the element positioned as the next outgoing element in a FIFO mechanism.
Ingress The direction of the information flow originating from the ingress user interfaces to the virtual switch devices via the ingress serial links and ingress transceiver.
Ingress Serial Link A simplex serial electrical connection providing transmission capacity between two devices.
Ingress Tick An ingress serial link flow control command. Ingress Transceiver The part of a physical transceiver device, which handles Hie forwarding of data from the ingress user interfaces to the virtual switch devices.
Ingress User Interface The part of a user interface, which transfers packets in the ingress direction. See also User Interface. Long Connection A switch connection established to accommodate the switching of a long packet
Long Packet A packet of size LongMin bytes or more Long Release The ingress serial link flow control mechanism that can generate Long Release commands.
Packet One variable sized unit of binary data, which is switched transparently across the system
Packet Per Second (pps) Applied to packet streams. This is the number of transmitted packets on a data stream per second. Prefix M or G (Mega or
Giga)
Packet Tail The remaining packet size when excluding the maximum number of fixed size N-byte units that the packet can be divided into.
Parts per Million (ppm) Applied to frequency, this is the difference, inmillionms of a
Hertz, between some stated ideal frequency, and the measured long-term average of a frequency. Physical Switch Device A physical device, which can be logically partitioned into multiple virtual switch devices. The physical switch device is typically located on the switch cards in a modular system. Physical Transceiver Device ,A physical device, which is divided into two logical components: Ingress and an Egress Transceiver. The transceiver device is typically located on the line cards in a modular system.
Port A logical definition of users traffic streams, which enter and exit the switching system via the physical transceiver device. A port can provide QoS services and flow control mechanisms for the user.
Quality of Service (QoS) Class A QoS class of service definition, which may include strict priority, weighted allocations, guaranteed bandwidth allocation service schemes, as well as any other bandwidth allocation service scheme.
Quality of Service (QoS) Priority A strict priority level associated with each QoS class definition.
Scheduler A virtual switch device operating in scheduler mode Scheduler Mode A virtual switch device mode of operation, see Scheduler. Serial Link A serial electrical connection providing transmission capacity between a transceiver and a virtual switch device, which includes an ingress and egress serial link.
Short Connection A switch connection established to accommodate the switching of a short packet Short Packet A packet of size (LongMin - 1) bytes or less Store-and-Forward Switching A packet switching technique, where the packet forwarding
(switching) can begin once the entire packet has arrived and has been buffered. User Interface A physical interface on the physical transceiver device, across which packet can be transferred in botfi the ingress and egress direction. See also Ingress User Interface and Egress User
Interface. Virtual Output Queue (VoQ) A data queue representing an egress port, but which is maintained on the ingress side of the switching system. Virtual Switch Device A logical switch element, which can operate, for example, either as a crosspoint or as a scheduler. See also "Physical
Switch Device". Word An N-bit (e.g., 16 bit, 32 bit, etc.) unit of information. Each bit in a byte has the value 0 or 1.
[0057] The parameter definitions included in Table 2 are used in the description of
the various embodiments of the invention provided herein.
[0058] Table 2 - Parameter Definitions
Figure imgf000021_0001
Figure imgf000022_0001
[0059] The switch interconnect architecture according to a first embodiment of the
invention provides backplane connectivity and switching functionality between
line cards and switch cards that may be disposed, for example, in a modular
slotted chassis. By way of example and not by way of limitation, these cards are
connected via electrical serial link electrical traces running across a passive
chassis printed circuit board (PCB) backplane.
[0060] According to at least one embodiment of the invention, switch cards are
responsible for the switching and exchange of data between the line cards. The
line cards are connected with high-speed serial links to the switch cards. Each
switch card implements one or more switch devices to perform the actual
switching function.
[0061] The number of switch cards found in the system according to the first
embodiment depends on the type of system redundancy utilized, if any. In a
system without any redundancy of the switching function, typically only one
switch card is provided. With a 1 : 1, N+ l or N+M redundancy scheme, two or
more switch cards are provided. [0062] The line cards utilized in a preferred construction of the first embodiment
implement one or more optical and/or electrical-based user chassis I/O interfaces
that the user will typically connect to via the front panel of the chassis.
[0063] The backplane interconnect system that is utilized in a preferred
construction of the first embodiment includes two different devices: a physical
transceiver device and a physical switch device. One or more physical
transceiver devices are implemented per line card. One or more physical switch
devices are implemented per switch card.
[0064] An explanation will now be provided with regard to how these two devices
are interconnected to provide a complete backplane interconnect system
according to the first embodiment.
[0065] A switch device can be configured as a number of virtual switch devices,
which may be independent from one another. For example, a single physical
switch device is partitioned into Vx virtual switch devices, with Lvx ingress and
egress serial links per virtual switch device. Such a switch device 100 is
illustrated in Figure 1, which includes Virtual Switch Devices 0, 1, . . . , Vx - I.
[0066] At least one, or preferably each, physical switch device in the backplane
interconnect system according to the first embodiment is partitioned into the same
number (Vx, which is an integer) of virtual switch devices 0 - Vx -I. Each virtual switch device is connected to Lvx ingress serial links and Lvx egress serial
links, whereby Lvx is an integer greater than or equal to one.
[0067] Each virtual switch device utilized in the first embodiment can operate in
one of the following two modes of operation: a) Crosspoint mode, which
switches packets from ingress to egress serial links; or b) Scheduler mode, which
generates scheduling decisions for other virtual switch devices operating in the
crosspoint mode.
[0068] As an example, one of the virtual switch device implemented in the system
operates as a scheduler in the scheduler mode, while all of the other virtual
switch devices in the system operate as crosspoints in the crosspoint mode.
Other configurations are possible while remaining within the scope of the
invention.
[0069] For larger systems, a physical switch device may correspond to a single
virtual switch device, which operates in either scheduler mode or in crosspoint
mode. For smaller systems, a physical switch device can be partitioned into
multiple virtual switch devices.
[0070] Figure 2 shows the serial link connectivity, i.e. , the interconnect topology,
for an example system 200 constructed in accordance with the first embodiment.
The system 200 has four parallel physical switch devices 0, 1, 2, 3, each physical switch device implemented with one virtual switch device. Note the
correspondence numbering of the serial links on the transceivers 0 - Lx -I, and
the parallel physical switch devices 0, 1, 2, 3. In this implementation, one
transceiver (#0, #1, . . . , or #~Lτ - 1) is capable of outputting packets to each of
the four physical switch devices 0, 1, 2, 3, under control of a scheduler.
[0071] Figure 3 shows the serial link connectivity for an example system
constructed in accordance with the first embodiment having four parallel virtual
switch devices, whereby the four parallel virtual switch devices correspond to a
single physical switch device 300. Note the correspondence numbering of the
serial links on the transceivers 0 - Lx -I, and the numbering of the parallel
switch devices 0, 1, 2, 3.
[0072] A transceiver connects one serial link to one virtual switch device. Each
of the transceivers is not required to connect to the same number of virtual switch
devices. Transceivers supporting less traffic throughput relative to other
transceivers in the systems are not required to be connected to all virtual switch
devices.
[0073] In a preferred construction of the first embodiment, the serial links on a
transceiver operate with the same serial link speed and physical layer encoding overhead, but multiple physical layer encodings schemes can also be
implemented for the same transceiver.
[0074] In a preferred construction of the first embodiment, a reference clock per
device is externally provided. There is no requirement for distribution of a
common or partial shared reference clock in the system and method according to
the first embodiment, since each transceiver or switch device can operate from
their own local reference clocks.
[0075] Figure 4 is a functional block diagram of a virtual switch device 400 that
may be utilized in a system according to a preferred construction of the first
embodiment, where the switch device operates in the crosspoint mode (herein
also referred to as a crosspoint device). The switch device is connected to N
serial inputs (also referred to as ingress serial links) and N serial outputs (also
referred to as egress serial links). The switch device includes a crosspoint matrix
410, and N crosspoint queues (CQs) 420 on the output side of the crosspoint
matrix (the CQs are also referred to herein as "output queues" , whereby the CQs
may be implemented as a memory storage device that stores packets and that
outputs the stored packets according to a particular scheme). The crosspoint
matrix 410 provides connectivity between all inputs and outputs, as shown in
Figure 4. An input driver is used to forward packets from each ingress serial link to a corresponding input port of the crosspoint matrix, in a manner known to
those skilled in the art. An output driver is used to provide packets from each
crosspoint queue to a corresponding egress serial link, in a manner known to
those skilled in the art.
[0076] Figure 5 is a functional block diagram of a virtual switch device 500 that
may be utilized in the system according to a preferred construction of the first
embodiment, where the virtual switch device operates in the scheduler mode. It
includes N Control Record Termination units 510 on the input side of the
crosspoint matrix 410, and N Control Record Generation units 540 on the output
side of the crosspoint matrix 410 (whereby N is an integer representing the
number of inputs and outputs of the crosspoint matrix 410). The switch device
500 also includes for each of the N inputs, a number of request FIFOs 560 equal
to the number of VoQs in the corresponding ingress transceiver, N acknowledge
FIFOs 570, and a Scheduler Unit 580. The Scheduler Unit 580 performs
scheduling for the crosspoint units, such as by utilizing a variation of the iterative
scheduling algorithm of N. McKeown discussed earlier.
[0077] The backplane interconnect system according to a preferred construction of
the first embodiment supports two modes of switching operation at the system
level, which are: a) Ingress port to egress port store-and-forward switch mode b) Ingress port to egress port cut-through switch mode
[0078] Regardless of the selected switch mode, the switch architecture internally is
capable of performing cut-through switching along a path from the ingress
transceiver to the egress transceiver for both unicast and multicast packets. This
way, packets can be transferred as quickly as possible across the crosspoints.
[0079] The queuing structures implemented in an ingress transceiver, an egress
physical transceiver, and a scheduler, in a preferred construction of the first
embodiment, is described in detail below.
[0080] The ingress transceiver implements the following queues for packet
buffering and transmission:
VoQ (Virtual Output Queue) ILQ (Ingress Serial Link Queue) [0081] The egress transceiver implements the following queues for packet
buffering and transmission:
ERQ (Egress Reordering Queue) EPQ (Egress Port Queues) [0082] The crosspoint implements the following queues for packet switching and
transmission:
CQ (Crosspoint Queue) [0083] A system level overview of the queuing system according to a preferred
implementation of the first embodiment is shown in Figure 6. In Figure 6, crosspoints 620A, 620B, . . . , 62On-I and scheduler 62On are part of a single
physical switch device (or corresponding parts of several physical switch
devices). An incoming packet arrives at an ingress port and is directly coupled to
an ingress transceiver. The incoming packet is received at an ingress transceiver
610A, . . . , or 61On and is stored in a VoQ, where the packet awaits transfer to a
crosspoint 620A, . . . , or 62On-I under command of the scheduler 62On. Based
on a scheduling decision made by the scheduler 62On, a packet is output from the
VoQ to an ILQ (also referred to herein as "input FIFO"), and it is then provided
to a corresponding crosspoint 620A, . . ., or 62On-I, to be then switched self-
routing through the crosspoint matrix of the crosspoint and to be then stored in a
corresponding CQ at the output side of the crosspoint. The packet is then output
from the CQ to a designated egress transceiver 630A, . . . , or 630n. Each egress
transceiver 630A, . . ., 63On comprises an EPQ, whereby a packet is then output
from a corresponding egress transceiver to an egress port directly coupled to the
egress transceiver, and the packet is sent on its way to its eventual destination.
[0084] The queue structure of an ingress transceiver 610 is shown in Figure 7.
When a packet arrives at an ingress transceiver port 710, it is assigned a buffer
memory location and written into a buffer memory (not shown), and a pointer to
that buffer location is passed to a VoQ 740, and eventually to an ILQ 750 by exchanging packet pointers. Once the packet gets forwarded across the ingress
serial link, the packet is removed from the buffer memory.
[0085] A unicast packet entering the ingress transceiver is written into a buffer
memory. Preferably, as quickly as possible, it is determined whether the packet
is long or short (as defined by LongMin), and the packet is then qualified for
forwarding to the VoQ corresponding to its destination port and associated QoS
class as defined by any scheduler algorithms implemented in the central scheduler
and transceiver units.
[0086] For packets in a VoQ, the ingress transceiver forwards a request to the scheduler, when the following conditions are met:
1. The current number of pending requests for the VoQ is less than VoQReqs. The value of VoQReqs is defined depending on the required latency and jitter packet switching performance for the individual VoQs.
2. The size of the packets for which a corresponding request is currently pending is within a defined range. The value of this range is defined depending on the required latency and jitter packet switching performance for the individual VoQs. [0087] The forwarding of the packets from the VoQs to the ILQs is controlled by
the scheduler via the acknowledge messages transmitted from the scheduler to the
VoQ of the ingress serial link in response to the request. Figure 11 shows an
example of a Request and Acknowledge handshake. In Figure 11, a new packet
enters the VoQ, and a new request is forwarded to the scheduler for that new packet. Also, a head-of-line packet unit is forwarded to the ILQ when an
acknowledge arrives from the scheduler at the VoQ.
[0088] In a preferred construction of the first embodiment, unicast packets are
forwarded one-by-one from the unicast Virtual Output Queues (VoQs) to the
ILQs corresponding to the order that the unicast acknowledge messages are
received from the scheduler.
[0089] On the other side of the crosspoint unit, the egress transceivers perform re¬
ordering of the packets, if necessary, to ensure that the original packet sequence
is maintained on a per <port, QoS class > basis. The egress transceiver re¬
ordering function is based on a sequence ID, which is carried by every packet as
it is switched across the crosspoints. The ingress transceiver devices maintain
and "stamp" the sequence IDs into the packets. The sequence IDs preferably
only need to be maintained on a per transceiver basis (not per port and/or QoS
class).
[0090] Preferably, a sequence ID counter is incremented each time a packet is
transferred from its associated VoQ to an ILQ, and the sequence IDs therefore
completely correspond to the packet sequence dictated by the flow of
acknowledge messages from the scheduler to the ingress transceiver. In a preferred construction of the first embodiment, a packet is sequence ID
"stamped" when it leaves the VoQ, and not when it leaves the ILQ.
[0091] When multiple packets have been collapsed into a single packet by the
ingress transceiver or the scheduler, only the resulting collapsed packet receives a
sequence ID "stamp". A sequence ID counter is incremented and stamped into
the collapsed packet when the corresponding acknowledge arrives at the ingress
transceiver, and the collapsed packet is transferred from its associated VoQ to an
ILQ. The sequence IDs of such collapsed packets therefore completely
correspond to the packet sequence dictated by the flow of acknowledges from the
scheduler to the ingress transceiver.
[0092] An Ingress Serial Link queue (ILQ) is implemented for each ingress serial
link. Each ILQ can accept packets at the performance dictated by the scheduler
generating the acknowledge messages, and is capable of mapping packets back-
to-back onto the corresponding ingress serial link in FIFO order regardless of
packet size, thereby fully utilizing the bandwidth on its corresponding ingress
serial link.
[0093] The size of the ILQ is preferably determined based on an analysis of worst
case ILQ contention. In any event, the size of the ILQ is preferably very small. [0094] A description will now be made of a transceiver packet collapsing function
that may be utilized in the first embodiment. The scheduler operation ensures
efficient back-to-back switching of packets, for packets that are at least
RequestMin bytes long when transmitted on a serial link. To also support effective
switching of packets smaller than RequestMin bytes, the transceiver is capable of
collapsing these packets into larger packets.
[0095] The transceiver unicast packet collapse function is performed per VoQ.
When a unicast packet is inserted into an VoQ, it is collapsed with the last packet
in the VoQ if the following conditions are met:
a) The incoming packet is smaller than RequestMin bytes, or the size of the current last packet in the VoQ is smaller than RequestMin bytes, where the last packet in the VoQ may itself include packets that have previously been collapsed. b) The current last packet in the VoQ includes no more than TCol - 1 collapsed packets, where TCol is an integer value greater than one. c) A Request for the current last packet in the VoQ has not yet been generated and forwarded to the scheduler.
[0096] The transceiver multicast packet collapse function is performed per VoQ.
When a multicast packet is inserted into an VoQ, it is collapsed with the last
packet in the VoQ if the following conditions are met: a) The incoming packet's egress transceivers destination fanout is identical to the egress transceiver fanout of the current last packet in the VoQ. b) The incoming packet's size is smaller than RequestMin bytes, or the size of the current last packet in the VoQ is smaller than RequestMin bytes, where the last packet in the VoQ may itself include packets which have previously been collapsed. c) The current last packet in the VoQ includes no more than TCol - 1 collapsed packets. d) A Request for the current last packet in the VoQ has not yet been generated and forwarded to the scheduler.
[0097] An example of VoQ packet collapsing is shown in Figure 8, where the
scheduler collapses requests 4, 5, and 6 into a single equivalent packet unit in the
ILQ and where the scheduler collapses requests 1 and 2 into a single equivalent
packet unit in the ILQ. Prior to the collapsing by the scheduler, an ingress
transceiver has collapsed 2 VoQ packets as shown by label "A", and collapsed 3
VoQ packets as shown by label "B". Only a single request is sent to the
scheduler for the "A" collapsed packets, and only a single request is provided
for the "B" collapsed packets.
loose] The logical structure of the queues of an egress transceiver according to a
preferred construction of the first embodiment is shown in Figure 9. [0099] When a unicast or multicast packet arrives at the egress transceiver 630, it
is assigned a buffer memory location, and written into buffer memory (not
shown), with a pointer to that buffer memory location being passed to an ERQ
810. Then, the packet is moved to an EPQ 830 by exchanging packet pointers
between these queues.
[00100] Since packets can experience different delays as they are switched across
parallel crosspoints, for example, due to experiencing different ILQ and CQ fill
levels. The egress transceiver may need to perform packet sequence re-ordering
to ensure that the original packet sequence is maintained on a per < port, QoS
class > basis between the systems ingress and egress ports.
[00101] Since the ILQs and CQs are sized so as to be shallow queues in a preferred
construction of this embodiment, the delay variation across the crosspoints is
small. The re-ordering can therefore be performed per < ingress transceiver > ,
as opposed to per < ingress transceiver port, QoS class > , since the latency
contributions from the ERQ 810 is insignificant from a QoS class and port
fairness perspective.
[ooio2] The packet buffer capacity of the CQ is preferably determined based on
worst case CQ contention scenarios, though, in any event, the size of the CQ is
relatively small compared to queue structures found in conventional shared memory architecture type switch devices (e.g. , typically in the range from 100
bytes to 1400 bytes).
[00103] The egress transceiver maintains an Egress Port Queue (EPQ) 830 per QoS
class or group of classes per transceiver port for both unicast and multicast
traffic.
[00104] When the transceiver is operating in store-and-forward switch mode, the
packet is qualified for readout and final forwarding across the egress port once
the entire packet is stored in the egress memory buffer. When the transceiver is
operating in cut-through switch mode, the packet is qualified for final forwarding
across the egress port as described previously.
[ooio5] Preferably, one CQ is implemented per egress serial link in the crosspoint.
A CQ buffers unicast and multicast packets, and can accept up to CQL packets
arriving overlapping each other from any of the crosspoint inputs, where CQL =
6 in the preferred construction regardless of the number of crosspoint inputs. Of
course, other values for CQL may be contemplated while remaining within the
scope of the invention. The packets are forwarded out of the CQ towards its
corresponding egress serial link one at a time. Figure 10 shows one possible
structure of a CQ 900, where CQL = 5. [00106] The lane buffer structure 920 of the CQ 900 is preferably implemented
with dual ported single clocked memories where the data words are structured as
CQL parallel lanes with a write cycle disable input per lane, and where each data
word can be marked as empty per lane.
[ooio7] When a packet arrives at the CQ 900 it is assigned an empty lane, if one is
available. If the assigned lane runs full, the packet is assigned an additional
empty lane, and so forth until the entire packet is written in the CQ 900. A
single packet can therefore occupy all CQL lanes, and no more than one packet
can occupy one lane. Each lane supports a throughput equivalent to the
maximum serial link speed and holds up to CQLS bytes.
[00108] To prevent CQ overflow during normal operation, the following very
conservative CQ size (CQLS) dimensioning guideline may be adopted. For
example, dimensioning of each lane is made so as to ensure that a short packet
never occupies more than a single CQ lane. This means that CQLS is defined as
the maximum size short packet including packet tails, which is a function of
LongMiπ. This dimensioning guideline provides more buffer space than what is
required based on a worst case CQ contention scenarios, which may be
acceptable depending on the specific implementation. [00109] In a preferred implementation, a lane buffer write pointer 940 is
incremented every clock cycle, and any packet word is written into a particular
lane buffer identified by the write pointer 940 regardless of the lane. Once the
readout of a packet has been initiated, it will continue until the last byte of the
packet has been read, and lane change may occur during the readout process.
[00110] A packet is read out from one or more of the lanes of the CQ 900 in the
order they are qualified for readout. In a preferred construction of the first
embodiment, short packets are immediately qualified for readout, while long
packets are not qualified for readout until at least CQL.Q bytes or the entire packet
has been written into the lane buffer, where CQL.Q = RILQ.
[00111] This ensures that, in the scenario where a short packet followed by a long
packet is scheduled for transmission across the same egress serial link, the packet
order is maintained even when the short packet experiences a maximum ILQ
delay and the long packet experiences a minimum ILQ delay. In one possible
implementation, an additional amount (e.g., 40 bytes) is added to the
qualification threshold to accommodate for variations in serial link transmission
delays and internal synchronization between different clock domains in the
crosspoint. [00112] When readout of a packet is initiated, the lane buffer read pointer 950 is
set equal to the lane buffer address of the beginning of the packet, and is
incremented in every clock cycle until the packet readout is completed.
tooii3] When a physical switch device is partitioned into multiple virtual switch
devices, each serial link in the physical switch device has an associated
programmable base address offset. When a packet enters a crosspoint, this offset
is added to the destination transceiver identified by the packet header, and the
resulting number represents the egress serial link across which the crosspoint will
transmit the packet out of the physical switch device. This is a very easy and
simple method for partitioning a switch device into multiple virtual crosspoints.
An example of a single switch device partitioned into multiple virtual crosspoints
is shown in Figure 1.
[ooii4] The scheduling function performed by a switch device when operating in
scheduler mode, according to a preferred construction of the first embodiment, is
described in detail below.
loons] The control serial link protocol is running on the serial links connecting the
transceivers with the scheduler, and exchanges ingress transceiver queue status
information and scheduler decisions between the devices. [00116] The data serial link protocol is running on the serial links connecting the
transceivers with the crosspoints, and includes a small header per packet carrying
packet-addressing information used by the crosspoints and egress transceiver
queue systems. Each ingress transceiver is capable of forwarding multiple
packets simultaneously across multiple parallel crosspoints, such that full
throughput can be achieved when the transceiver traffic capacity exceeds the
serial link speed.
tooii7] In a preferred implementation, packets are switched using a 'one packet
per serial link' protocol, combined with a connection-oriented scheduling
mechanism. When a packet is to be switched, the first step is the establishment
of a dedicated serial link connection from the ingress transceiver to the packet's
destination egress transceiver. Then follows the actual transport of the packet
across one of the crosspoints via the established data serial link connection.
Finally, once the tail of the packet is about to leave or has already left the ingress
transceiver, the serial link connection is removed. This type of connection is
referred to as a "long connection. "
[00118] The delay associated with the tear down of a long connection and
establishment of a new connection is equivalent to the serial link transmission
period for a packet of size LongMin. So to achieve full throughput in terms of packets per second and bandwidth utilization, a special scheduler mechanism is
utilized for packets shorter than LongMin. For such packets, the scheduler
automatically de-allocates these serial link connections when a predetermined
period of time, such as based on the size of the short packet, has elapsed. This
contrasts with de-allocating the connection only when a termination notification
has been received from the ingress transceiver. This latter type of serial link
connection is referred to as a "short connection."
[ooii9] Congestion may occur at the ingress serial link queues (ILQ) and egress
serial links queues (CQ) for the following reasons:
a) Short connection scheduler decisions are performed ignoring the packet tail, since the packet sizes included in the requests forwarded to the scheduler by the transceivers does not include the packet tail. b) Long connections are de-allocated in advance such the long packets tails may overlap with the following transmitted packets. c) The scheduler is allowed to perform slightly overlapping scheduling decisions. d) Packets experience different delays as they are forwarded out of the ingress transceiver towards one of the crosspoints.
[00120] To accommodate for egress serial link congestion, the crosspoint outputs
packets to a small queue per egress serial link (where the small queue is called a
CQ). When traffic is accumulated in such a crosspoint queue, the crosspoint sends a flow control command (Egress Tick) to the scheduler via the transceiver
device, and the scheduler then delays any re-allocation of the egress serial link
for a small period of time. To accommodate for ingress serial link congestion,
the ingress transceiver implements a small queue per ingress serial link (ILQ).
When traffic is accumulated in such an ingress transceiver queue, the ingress
transceiver sends a flow control command (Ingress Tick) to the scheduler, and
the scheduler then delays any re-allocation of the ingress serial link for a small
period of time.
[00121] The scheduler algorithm used according to a preferred implementation is
an iterative matching algorithm, but it does not operate on a traditional time
slotted basis where the serial link switching resources are aligned to time slot
boundaries. Instead, the serial link resources are made available to the scheduler
as soon as they become available, and the scheduler is allowed to perform
overlapping scheduling decisions to prevent a serial link from sitting idle while
the scheduler calculates the next scheduling decision. The scheduler preferably
includes requests queues, grant arbiters, and accept arbiters to arbitrate between
packet requests. The scheduler preferably maintains a serial link state table
spanning all serial links in the entire system, which defines when a serial link is
available for a new scheduling decision. [00122] The crosspoint queue's (CQ) fill level determine the switching delay across
the crosspoint. The flow control mechanism operating between these CQs and
the scheduler ensures an upper bound on the latency and fill level of the CQ.
[ooi23] A switching connection includes a switch path across an ingress and egress
serial link pair to facilitate the switching of a packet. The scheduler supports the
following two connection types:
a) Short connections. A short connection is requested for packets shorter than LongMin. The scheduler will allocate the ingress and egress serial link(s) for a period of time equivalent to the packet size ignoring any packet tail (because the size of the corresponding Request did not include any packet tail), which may cause the corresponding short packet to overlap with the previous packet at both the ingress and egress serial link. b) Long connections. A long connection is requested for packets that are equal to, or longer than LongMin. The transceiver will request the scheduler to de¬ allocate the connection when only TILQ bytes remain in the ingress serial link ILQ, which may cause the corresponding long packet to overlap with the next packet at both the ingress and egress serial link.
[ooi24] A description will now be made of the flow of information between the
transceiver and scheduler related to the establishment of serial link connections.
[ooi25] When a unicast packet arrives at a VoQ, a unicast connection request is
immediately forwarded to the scheduler in the following format:
Unicast Connection request: < Transceiver, Size, QoS > , where: [Transceiver] = An integer number defining the destination transceiver.
[Size] = A specific size of a short packet (RequestMm < Size < LongMIN) excluding any packet tail, or an unknown size long packet ( LongMIN).
[QoS] = The QoS priority of the packets pending for acknowledgement in the VoQ.
Once the scheduler has granted a serial link connection for a unicast packet, it will send an acknowledge back to the transceiver. The acknowledge has the following format: Unicast acknowledge: < Switch, Transceiver, Size, Timer > , where:
[Switch] = The virtual switch device across which the unicast packet shall be switched.
[Transceiver] = An integer number defining the destination transceiver and VoQ.
[Size] = A specific size short packet (RequestMm < Size < LongMIN), or an unknown size long packet ( LongMIN).
[Timer] = The value of ingress serial link allocation state variable Timer, just before the acknowledge was generated and Timer was updated accordingly. This field is used by the ingress serial link flow control function. [ooi26] After receipt of the acknowledge, forwarding of the corresponding unicast
packet from the VoQ to the ILQ begins immediately.
[ooi27] Since the scheduler attempts to collapse short connection requests with
other short or long connection requests, the ingress transceiver may receive an
acknowledge for a connection that does not match the size of the current VoQ HOL packet. When this happens, the ingress transceiver calculates how many
connections requests the scheduler collapsed, and then forwards all of these as a
single packet unit with multiple embedded packets to the ILQ.
[ooi28] A description will now be made of the request and acknowledge
communication flow for a multicast VoQ, which is the same as the
communication flow for a unicast VoQ, except that the format of the request and
acknowledge is different and multiple acknowledges may be generated as a result
of a single request.
[ooi29] When a multicast packet arrives at a VoQ, a multicast connection request is
immediately forwarded to the scheduler in the following format:
Multicast Connection request: < Multicast ID, Size, QoS > , where: [MulticastID] = The multicast ID associated with the multicast packet. [Size] = A specific size of a short packet (RequestMin < Size < LongMIN) excluding any packet tail, or an unknown size long packet ( LongMIN). [QoS] = The QoS priority of packets pending for acknowledgement in the VoQ. [00130] Once the scheduler has granted a serial link connection for a multicast
packet, it will send an acknowledge back to the transceiver. The multicast
acknowledge has the following format:
Multicast acknowledge: < Multicast ID, Switch, Size > , where:
[MulticastID] = The multicast ID associated with the multicast packet. [Switch] = The virtual switch device across which the multicast packet shall be switched
[Size] = A specific size short packet (RequestMin < Size < LongMIN), or an unknown size long packet ( LongMIN).
[Timer] = The value of ingress serial link allocation state variable Timer, just before the acknowledge was generated. This field is used by the ingress serial link flow control function. [ooi3i] After receipt of the acknowledge, forwarding of the HOL multicast packet
from the VoQ to the ILQ begins immediately. Figure 11 shows an example of a
Request and Acknowledge handshake.
[ooi32] The scheduler attempts to collapse multicast requests, just as it attempts to
collapse unicast requests, but only before the first acknowledge for a specific
request is transmitted.
[ooi33] The scheduler implements a unicast request FIFO for each VoQ in the
system. The unicast request FIFO structure is shown in Figure 12. The unicast
request FIFO is dimensioned to hold a number of requests equal to the maximum
allowed number of outstanding requests in the corresponding unicast VoQ in the
ingress transceiver (VoQReqs). The entries in the request FIFO are long
connection requests, except the last FIFO entry that can be either a short
connection request of a specific size ( < LongMin) or a long connection request
(>LongMin). This is ensured because the scheduler preferably always attempts to collapse incoming connection requests with previous requests, which has the
advantage that the number of logic gates required to implement the request FIFO
is significantly reduced, and that the required scheduler performance is reduced
since switching of larger packets in general requires less scheduler matching
efficiency than switching of smaller packets.
[ooi34] When a new connection request (short or long) arrives, one of the two
following two actions occurs:
a) If the last entry in the connection request FIFO is a long request, the incoming connection request is added to the connection request FIFO as a new entry. b) If the last entry in the connection request FIFO is a short request, it is collapsed with the incoming connection request. The size of the incoming request is added to the size of the last entry, and if the result of the addition is > LongMin, the last entry is converted to a long request.
[ooi35] Each unicast request FIFO also maintains a QoS variable. The QoS
variable is preferably set equal to the QoS priority of the last incoming request.
The request FIFO structure can be implemented with a low gate count using the
following variables:
a) The number of FIFO entries. b) The size of the last FIFO connection request entry, which is a specific size short packet, or an unknown size long packet ( LongMIN). c) The QoS priority of the packet requests. [ooi36] The scheduler implements one multicast request FIFO per ingress
transceiver, as well as a single multicast request queue from which multicast
requests are forwarded to the scheduler.
[ooi37] In a system with Lvx serial links per virtual switch, the scheduler therefore
maintains Lvx multicast request FIFOs plus a single multicast request queue. The
multicast request FIFO and multicast request queue structure is shown in Figure
13.
[ooi38] The multicast request FIFO is identical to the unicast request FIFO except
that:
a) Each entry in the multicast request FIFO can be either short or long and each entry also holds the multicast ID associated with the request. b) In addition to the unicast collapsing rales, the incoming request can only be collapsed with the last entry in the multicast request FIFO if the two multicast IDs are identical.
[ooi39] In this embodiment, while the multicast request FIFO collapse mechanism
does not save any significant number of logic gates required to implement the
FIFO, the system aggregate multicast pps performance is improved when bursts
of short packets sharing the same multicast ID are switched.
[00140] Requests in the multicast requests FIFOs are not scheduled directly by the
scheduler. Instead there is a multicast request queue where requests from all multicast requests FIFOs are stored, and from which the multicast request queue
forwards the multicast requests to the scheduler, as shown in Figure 13.
Requests are forwarded from the request FIFOs to the request queue by a strict
priority round robin scheduler discipline according to the request FIFO's QoS
priority variables, by maintaining a round robin pointer per strict priority, as
shown in Figure 13.
[ooi4i] When a request is inserted into the multicast request queue, a multicast
fanout bit mask where each entry corresponds to an egress transceiver is added to
the request entry. Transceivers included in the multicast ID's fanout are marked
in a fanout bit mask ("Mask" in Figure 13). The multicast fanout mask
generation is performed with a lookup in a multicast ID fanout mask table
memory. The ingress transceiver that generated the request is also added to the
request. The multicast request queue supports collapsing of requests in a manner
identical to the collapsing performed by the multicast request FIFO with the
additional rule that the packet's ingress transceiver must be identical.
[ooi42] The scheduler maintains allocation status per serial links in the system, and
utilizes and updates the allocation status as it processes connection requests and
schedules switch connections. The relationship between serial link allocation statuses, scheduler algorithm, and ingress and egress serial link flow control is
shown in Figure 14.
[00143] In Figure 14, short/long connection requests are received by the scheduler,
and processed according to a scheduler algorithm. The scheduler algorithm also
receives information corresponding to ingress serial link allocation status, and
egress serial link allocation status. Each ingress serial link allocation status
receives ingress serial link flow control commands (ingress ticks, long release
commands), and each egress serial link allocation status receives egress serial
link flow control commands (egress ticks and long release commands). The
scheduler algorithm uses the information it receives to generates connection
acknowledges
[ooi44] An example of an egress serial link allocation status maintained by a
scheduler according to an embodiment of the invention is shown in Figure 15.
The allocation status per egress serial link tracks the allocation and de-allocation
of the egress serial link for short and long connections. The allocation status is
based on an integer variable named Timer, which is defined in the range from 0
to N. The resolution of Timer is equal to (TailMax+ 1), and N = T + LongMin /
(TailMax+ 1). The scheduler algorithm can allocate the egress serial link for a
new short/long connection when Timer < T. [00145] When the egress serial link is allocated for a long connection, Timer is
assigned the value N. When the ingress serial link flow control releases the long
connection, Timer is assigned a value of T if no Egress Tick has arrived while
the connection was allocated, and a value of T + (ETSize / (TailMax+ 1)) if one or
more Egress Ticks arrived while the long connection was allocated. The ingress
serial link flow control de-allocates long connections before their transmission is
completed, such that there is a very high probability that the scheduler can find a
match for a new short or long connection match before the transmission is
completed. Thus, a new connection is guaranteed to be allocated back-to-back
with the previous long connection on both the ingress and egress serial link even
during worst case ILQ queue fill level scenarios.
[ooi46] When the egress serial link is allocated for a short connection, Timer is
incremented with a value reflecting the size of the short connection. When Timer
is not equal to N, it is per default periodically decremented at a rate equivalent to
the transmission speed of the corresponding egress serial link to provide
automatic expiration of short connections. When an egress serial link flow
control Egress Tick command arrives, a number of decrements equivalent to
ETSize are skipped. [00147] The value of T is dimensioned such that if there is a very high probability
that the scheduler can find a new short or long connection match within a period
of time equal to or smaller than T, the new connection will be allocated back-to-
back: with the previous connection on the egress serial link even during worst
case ILQ queue fill level scenarios. An overview of how the Timer operates is
shown in Figure 16 for the following example:
a) TailMax = 19B b) T = 4 c) LongMin = 160B d) N = 12 e) ETSize = 8OB
[ooi48] Two variables are preferably maintained per egress serial link in the
system clock domain:
a) Timer, which tracks the allocation and de-allocation of the egress serial link for both short and long connections (implemented with LOg2(N + 1) register bits). b) EgressTick, which remembers if one or more Egress Ticks have arrived (implemented with 1 register bit).
The definition of how these variables are updated is shown in Table 3.
Table 3: Variable Updating Definition Event Action
The scheduler allocates the egress Timer = = N (long connection) serial link to a long connection.
The scheduler allocates the egress Timer = = Timer + ShortSize (length of short connection). serial link to a short connection.
The scheduler de-allocates the long Timer = = N: connection for which the egress If (EgressTick = 0): serial link is currently allocated, Timer = T based upon the reception of an If (EgressTick = 1): ingress serial link flow control Timer = T + ETSlze / (TailMax + 1);
Long Release command. EgressTick = 0.
Periodic timer event with a time (Timer < N) and (EgressTick = 0): interval equivalent to the Timer = Timer - 1 (Timer > 0) transmission period of (TailMax + 1)
(Timer < N) and (EgressTick = 1): bytes on a serial link. Timer = Timer + (ETSlze / (TailMax + 1) - 1);
EgressTick=O.
An Egress Tick has arrived. EgressTick = 1
tooi49] An example of an ingress serial link allocation status maintained by the
scheduler according to an embodiment of the invention is shown in Figure 17.
The allocation status per ingress serial link tracks the allocation and de-allocation
of the ingress serial link for short and long connections. The allocation status is
based on an integer variable named Timer, which is identical to the equivalent
Timer definition for egress serial link allocation in its range and resolution, and
also the criteria for when the scheduler algorithm can allocate the ingress serial
link for a new short/long connection.
[00150] When the ingress serial link is allocated for a long connection, Timer is
assigned the value N. When the ingress serial link flow control releases the long
connection, Timer is assigned a value of T. tooi5i] When the ingress serial link is allocated for a short connection, Timer is
incremented with a value reflecting the size of the short connection. When Timer
is not equal to N, it is per default periodically decremented at a rate equivalent to
the transmission speed of the corresponding ingress serial link to provide
automatic expiration of short connections. When an ingress serial link flow
control Ingress Tick command arrives, a number of decrements equivalent to
ITSize are skipped.
[ooi52] The value of T is preferably dimensioned such that if there is a very high
probability that the scheduler can find a new short or long connection match
within a period of time equal to or smaller than T, the new connection will be
allocated back-to-back with the previous connection on the ingress serial links.
An overview of how the Timer operates is shown in Figure 18 for the following
example:
a) TailMax = 19B b) T = 4 c) LongMin = 160B d) N = 12
[ooi53] The following variables are maintained per ingress serial link in the system
clock domain: a) Timer, which tracks the allocation and de-allocation of the egress serial link for both short and long connections (can be implemented using Log2(N+ l) register bits). b) Long Allocate, which detects if a long connection is never released due to a bit transmission error.
100154] A definition of how these variables are updated is shown in Table 4.
Table 4: Definition of Variable Updating
Figure imgf000055_0001
[00155] The scheduler tracks the allocated < ingress, egress > serial link pairs for
long connections. The ingress serial link flow control Long Release command
identifies which ingress serial link must be de-allocated for long connections, but
the scheduler itself tracks which egress serial link was associated with the long connection and then de-allocate both the ingress and egress serial link. This is
performed both for imicast and multicast connections. This is done using a
monitor function per egress serial link. An overview of the monitors maintained
by the scheduler is shown in Figure 19.
[ooi56] Each time the scheduler allocates a < ingress serial link, egress serial
link> pair for a long connection, the monitor associated with the allocated egress
serial link is setup to point to the allocated ingress serial link. The monitor
continuously tracks if an ingress serial link flow control Long Release command
de-allocates the pointed-to ingress serial link. When this happens, the monitor
function generates an egress serial link de-allocation command for the associated
egress serial link.
[00157] Internal flow control messages operate for each ingress and egress serial
link in the system. The ingress serial link flow control mechanism uses two
signaling mechanisms:
Ingress Tick - compensates for tails accumulated during transmission of short packets.
Long Release - indicates the completion of long packet transmissions. [ooi58] The transmission path 1310 for these two ingress serial link flow control
signals is illustrated in Figure 20. [00159] The ingress transceiver generates a Long Release command for a specific
ILQ when the following conditions are met:
a) A long packet is currently being streamed out of the corresponding ILQ queue. b) BILQ < TILQ with the EOP (end-of-packet) byte of the long packet being in the ILQ queue, where,
BILQ = The current number of bytes in the ILQ queue. TjLQ = An ILQ queue threshold (bytes)
The following requirements are utilized to ensure back-to-back streaming of packets across both the ingress serial link and the egress serial link:
Figure imgf000057_0001
LongMin > DLR.Data Parameter TILQ is preferably dimensioned to be as small as possible, to reduce
worst case delay variation across the ILQ structures since this delay increases the
worst case number of colliding packets in the crosspoint CQ. The dimensioning
of TILQ is preferably made as follows: TILQ = DLR.Data.
[00160] The Ingress Tick flow control command is used to compensate for the
over-allocation of the ingress serial links, which occurs when packet tails are
greater than zero. When the scheduler generates a short acknowledge, it inserts
the current ingress serial link allocation status into the acknowledge message.
When the ingress transceiver processes the acknowledge, the allocation status provided by the scheduler is compared with the actual fill level of the ILQ, to
determine if an Ingress Tick command should be generated to compensate for
ILQ accumulation. This can occur when the packet tails are greater than zero.
When an Ingress Tick arrives at the scheduler, it causes the scheduler to halt
allocation of the corresponding ingress serial link for a period of time equivalent
to ITSize on the ingress serial link. The ingress transceiver generates one or more
Ingress Ticks for a specific ILQ when an acknowledge for a short packet arrives
at the ingress transceiver and the following condition is met:
BILQ + PTail > Acknowledge. Timer + PTicks , where,
BILQ = Number of bytes in the ILQ queue before the short packet is added. PTail = Size of the short packets tail.
Acknowledge. Tuner = The ingress serial link allocation status (Timer) right before the scheduler made the scheduler decision for the short connection (Timer T) multiplied by the Timer unit which is (TailMax + 1).
PTicks = Number of previous generated Ingress Ticks, which have not yet expired multiplied by ITSize. A previous generated Ingress Tick expires after a period of DTick-Ack, and it is the ingress transceiver, which tracks the expiration of previous Ingress Ticks per ingress serial link.
The number of generated Ingress Ticks is:
#Ingress Ticks = (BILQ + PTail - acknowledge. Timer - Pτicks) div ITSize
[0016I] The egress serial link flow control mechanism uses one signaling
mechanism: Egress Tick - compensates for tails accumulated during transmission of short packets.
[ooi62] An egress serial link flow control message is referred to as a 'Egress
Tick' . The transceiver collects the Egress Tick messages from the parallel
crosspoints, and forwards the information in control record format to the
scheduler. To minimize CQ size requirements and worst-case crosspoint packet
traversal latency, the flow control loop latency is preferably minimized. The
transmission path 1410 for the Egress Tick flow control messages is illustrated in
Figure 21.
[ooi63] When a Egress Tick message arrives at the scheduler, it causes the
scheduler to halt allocation of the corresponding egress serial link for a period of
time equivalent to ETSize on the crosspoint egress serial link, and a continuous
stream of Egress Tick messages will therefore halt the scheduling of any unicast
traffic to the corresponding egress serial link.
[ooi64] The following three (3) conditions triggers the generation of an Egress
Tick flow control message:
BCQ .. TCQj and at least MCQ bytes has been transmitted since the last Egress Tick.
BCQ -≥. DCQ, and at least MCQ bytes has been transmitted since the last Egress Tick. BCQ -≥. 2xETsize, and at least 2 x DCQ bytes has been transmitted since the last Egress Tickwhere,
BCQ = The current number of bytes in the CQ queue.
TCQ is an integer variable that is incremented by ETSize when condition 1 below is true, and TCQ is decremented by ETSize when BCQ < TCQ - ETSize, but TCQ can never be assigned a value less then 2xETSize which is also the reset value of TCQ.
MCQ is an integer with a value slightly lower than ETSize to ensure that back-to-back generated Egress Ticks over a long period of time will completely halt the allocation of the serial link throughout that entire period of time. This is done due to small ppm differences in clock frequency between the transceiver and crosspoint device and non-deterministic latency in the forwarding of tick signals across clock domains in the transceiver.
[00165] Conditions 2 and 3 are used because the scheduler only remembers one
Egress Tick at a time, and Egress Ticks may therefore be 'lost' upon arrival at
the scheduler if the serial link by then is allocated for a long connection.
[00166] A connection oriented scheduler approach is utilized in a preferred
construction of the first embodiment, in which, when a packet is to be switched
across a switch device, a connection is first established. In this context, a
"connection" means a switch path between an ingress transceiver and an egress
transceiver through a crosspoint. Once the connection is established, at least one packet is streamed across the connection, and then the connection is removed.
This is different than a conventional crosspoint scheme whereby packets are
switched on a time slotted basis. Ideally, if there were no physical
implementation constraints, the system would only have long packets that result
in long connections. However, because some packets to be switched are
relatively small in size, the physical amount of time it takes to establish a
connection and remove it again and communicate that handshake back and forth
between the scheduler and the transceiver, may take longer than the actual
transmission of the relatively shorter packets. Accordingly, the first embodiment
of the invention supports short connections as well as long connections, where
these respective connections are allocated differently. The long connection is
established for as long as the connection is needed. The short connection, on the
other hand, is allocated for only a predetermined amount of time.
[ooi67] Figure 22 shows a Scheduler Decision Time Line 1510 versus a Serial
Link Path Allocation Time Line 1520, whereby a scheduler decision is made, and
then the corresponding connection is allocated on a serial link path. In the
example shown in Figure 22, the scheduler makes a first decision (Decision 1)
that it is to establish a long connection to provide a switching path for a long
packet. A long connection termination request is received at the scheduler, which effectively notifies the scheduler that the packet has been switched, and
that the allocated long connection serial link path can be closed. In the next
scheduling decision, Decision 2, in which the scheduler performs a scheduling
decision for a short packet, the scheduler assigns a short connection for a fixed
amount of time on an allocated serial link. When that period of time has expired,
the connection is removed by the scheduler itself, and the allocated serial link
path is then again made available for upcoming scheduling switching decisions.
Another long connection is allocated, as Decision 3, by the scheduler, whereby
this long connection is scheduled to start right after the short connection
automatically expires, since the scheduler knows the exact time that the first short
connection expires and does not receive any connection tear-down
acknowledgement from the ingress transceiver. After that, another long or short
connection is made, as Decision 4.
[00168] In a preferred construction of the first embodiment, the long connection
termination request is output by the corresponding ingress serial link flow control
function a short amount of time before the actual time that the long connection
packet has finished being switched over the allocated ingress serial link,
essentially ignoring the packet tail, to thereby ensure that packets are switched
back-to-back against each other on the ingress and egress serial links. [00169] In one possible implementation of the first embodiment, for short packets,
the so-called predetermined short connections, the amount of time that the ingress
transceiver will request for a connection excludes any potential tail of the short
packet. For example, if the embodiment defines TailMax as 19 bytes, for a 99
byte short packet the ingress transceiver will request a short connection for a
period of time equivalent to the transmission period of only 80 bytes, whereby
the 19 byte packet tail is effectively ignored by the scheduler.
[00170] The packet tails ignored by the scheduler may accumulate at the ILQs on
the ingress serial link side and at the CQs on the egress serial link side, whereby
the structure of the ILQs and the CQs allows for absorbing the packet collisions
and to output the packets back-to-back across the serial links, to thereby achieve
byte-level granularity even though the scheduler can only operate with a cell-
magnitude decision cycle.
[ooi7i] An egress serial link flow control for adjusting for packet tail accumulation
that may occur in a crosspoint queue, according to a first embodiment of the
invention, is described below, with reference to Figure 21. When the packet tails
from adjacent packets start overlapping with each other at an egress serial link
and whereby a CQ begins to fill up, a flow control message, referred to herein as
an Egress Tick, is output by the CQ where the packet overlapping is occurring. The Egress Tick is provided to the scheduler by the transceiver of the destination
packet. The Egress Tick causes the scheduler to force the corresponding egress
serial link on the crosspoint idle for a fixed period of time. In one
implementation, each Egress Tick causes an 80 byte idle time for the particular
egress serial link that provided that Egress Tick to the scheduler. The idle time
amount can be changed to suit a particular system, whereby a 40 byte Egress
Tick or a 120 byte Egress Tick could be utilized, for example.
[ooi72] The flow control Egress Tick signal is transported from the crosspoint to
the scheduler by it first being sent to an egress transceiver from the congested
CQ of the crosspoint, which is then looped to the corresponding ingress
transceiver which sends it to the scheduler. In a preferred construction, there is
no direct electrical path between the crosspoint queue and the scheduler, whereby
only the transceivers directly communicate with the scheduler.
[ooi73] Similar to the egress side, if packet tails from two of more packets collide
at an ingress serial link, that collision is absorbed by the corresponding ILQ on
that ingress serial link. This happens because the scheduler does not consider
packet tails when it allocates ingress serial links, similar to the scheduler's
allocation of egress serial links. [00174] When a certain amount of accumulation occurs at an ILQ, it sends a flow
control signal, referred to herein as an Ingress Tick, to the scheduler to thereby
force the corresponding egress link idle for a certain period of time, so as to
allow the ILQ to alleviate its congestion. In a preferred construction, each
Ingress Tick flow control message forces the corresponding link idle for a period
of time that is equivalent to 20 bytes. Alternatively, another amount of idle time,
such as 40 or 80 bytes, can be utilized for a single Ingress Tick, depending upon
the particular implementation of the switch device. Figure 20 shows an ingress
serial link flow control according to the first embodiment, whereby an Ingress
Tick is sent directly from a congested ILQ of an ingress transceiver to the
scheduler, whereby the scheduler reserves an idle time for that ingress serial link
to allow the congested ILQ to alleviate its congestion.
toons] The left side portion 1710 of Figure 23 shows an example whereby worst-
case CQ contention occurs for an ILQ and a CQ, assuming a TailMax value of 19
bytes. The time line at the top of Figure 23 is broken down into 80-byte chunks.
At time TO, the scheduler schedules a first and a second 99-byte packet from two
different ingress transceivers to be send across the crosspoint unit to the same
egress transceiver, whereby there is a delay in those packets exiting from their
respective ILQs along the path to the CQ that is coupled to the allocated egress transceiver. This ILQ delay is the worst case equivalent to a current ILQ fill
level of RILQ bytes.
[ooi76] At time Tl, the scheduler schedules a third 99-byte packet to be sent across
the crosspoint unit to the same egress transceiver, whereby the delay or latency in
outputting that third 99-byte packet from the respective ILQ of its ingress serial
link along a path to the CQ that is coupled to the allocated egress serial
transceiver is less than the delay for the first and second 99-byte packets.
Similarly, at time T2 and at time T3, fourth and fifth 99-byte packets are
scheduled by the scheduler to be switched across the crosspoint unit to make their
way to the same CQ that is coupled to the same allocated egress transceiver,
respectively, whereby the fourth and fifth 99-byte packets have a shorter delay
time in their respective ILQs than the other packets previously scheduled by the
scheduler for that same egress transceiver.
[ooi77] At a time T4, a sixth 99-byte packet is scheduled by the scheduler to go to
the same egress transceiver, whereby, in the worst case, the delay in outputting
that sixth 99-byte packet from the respective ILQ of its ingress serial link along a
path to the CQ that is coupled to the allocated egress serial transceiver is virtually
non-existing. Accordingly, the CQ of the crosspoint unit receives six separate
99-byte packets overlapping each other, and consequently the CQ is capable of storing each of these six separate 99-byte packets in a respective one of its six
lanes simultaneously to avoid packet data loss at the CQ even in this worst-case
scenario. Of course, one of ordinary skill in the art will recognize that the
number of lanes for the CQ is configurable for any given implementation.
[00178] At this point, in the right side portion 1720 of Figure 23, each packet
scheduled by the scheduler to go to the same egress transceiver is output from its
respective ILQ at the same fill level, whereby only the overlap between adjacent
packets received at the CQ exists due to the 19 byte packet tails that are not
considered by the scheduler when it makes its scheduling decisions. In this
example, only two of the CQ lanes are utilized in order to cope with the
overlapping packets arriving at the CQ at the same time, and again no data is lost
at the egress serial link side.
[ooi79] Thus, methods and apparatuses have been described according to the
present invention. Many modifications and variations may be made to the
techniques and structures described and illustrated herein without departing from
the spirit and scope of the invention. Accordingly, it should be understood that
the methods and apparatuses described herein are illustrative only and are not
limiting upon the scope of the invention. For example, while the preferred
embodiments have been described with respect to a transceiver device corresponding to a linecard on a one-to-one basis, the present invention is also
applicable to a case whereby a single linecard can implement more than one
ingress or egress transceiver device. Furthermore, the present invention is also
applicable to a case whereby all of the transceivers and switch components of a
network switching device are implemented on a single card (e.g., a single
printed circuit board). For example, in that case, all of the components shown
in Figures 6 would be implemented on a single card. Moreover, any one or
more aspects shown and described herein may be practiced individually in any
given system, or may be practiced together with one or more other aspects,
including practicing all non-conflicting aspects in a single system. Further, the
present invention may be implemented in any type of switching system for any
type of network.
[00180] This application claims priority from United States Patent Application
No. 10/898,540, filed July 26, 2004, which is incorporated herein by reference
in its entirety.
-64-

Claims

What Is Claimed Is:
1. A network switching system, comprising: a switching device configured to communicatively connect one of a plurality of input transceivers to one of a plurality of output transceivers, the switching device including: a) a crosspoint having: i) a plurality of inputs for receiving variable-sized packets from the respective input transceiver, ii) a crosspoint matrix, iii) a plurality of outputs for outputting the variable-sized packets to the respective output transceiver, and iv) a plurality of output queues respectively provided between the outputs of the crosspoint matrix and the output transceivers; and b) a scheduler configured to schedule the variable-sized packets across the crosspoint matrix while ignoring any packet tails.
2. The network switching system according to claim 1, wherein each of the plurality of output queues is configured to absorb partially overlapping packets of at least two variable-sized packets output from the crosspoint matrix.
3. The network switching system according to claim 2, wherein each of the plurality of output queues is configured to output the variable-sized packets back-to-back to a same one of the output transceivers.
4. The network switching system according to claim 1, wherein each of the plurality of input transceivers comprises: a plurality of input FIFOs, wherein each of said plurality of input FIFOs of the input transceivers is configured to store the variable-sized packets and to output the variable-sized packets to said crosspoint matrix, wherein each of said plurality of input FIFOs of the input transceivers is configured to absorb overlapping packets of at least two variable-sized packets that are scheduled by the scheduler to be output from one of said input FIFOs and thereby through the switching device.
5. The network switching system according to claim 1, wherein the scheduler is configured to schedule each of the variable size packets across the crosspoint matrix in one of: a) a predetermined connection time length when the corresponding variable size packet to be transferred across the crosspoint matrix is less than a predetermined size, and b) a variable connection time length that is based on the size of the variable size packet when the corresponding variable size packet to be transferred across the crosspoint matrix is greater than or equal to the predetermined size.
6. The network switching system according to claim 5, wherein a packet long connection termination message indicating an end of a packet transmission is output by a corresponding one of said plurality of input FIFOs at a predetermined time before the variable size long packet from the one of said plurality of input FIFOs has been completed.
7. The network switching system according to claim 5, wherein each of the output queues is configured to output a first variable size packet that is less than the predetermined size as soon as a portion of the variable size packet is received and stored in the output queue, and wherein the output queue is configured to not output a second variable size packet that is greater than or equal to the predetermined size until at least a predetermined amount of the second variable size packet is received and stored in the output queue.
8. The network switching system according to claim 1, further comprising: a monitor unit configured to, when the scheduler allocates an ingress serial link/egress serial link pair for a long connection, continuously track whether or not an ingress serial link flow control Long Release command deallocates the ingress serial link in the pair, and when the ingress serial link flow control command for deallocating the ingress serial link is detected by the monitor unit, to generate an egress serial link deallocation command for deallocating the egress serial link in the pair.
9. The network switching system according to claim 1, further comprising: a monitor unit configured to, when the scheduler allocates an ingress serial link/egress serial link pair for a long connection, continuously track whether or not an egress serial link flow control command deallocates the egress serial link in the pair, and when the egress serial link flow control command for deallocating the egress serial link is detected by the monitor unit, to generate an ingress serial link deallocation command for deallocating the ingress serial link in the pair.
10. The network switching system according to claim 1, wherein the scheduler is configured to allocate an ingress serial link/egress serial link pair for a long connection based on a packet size being greater than a predetermined packet size, the network switching system further comprising: a monitor unit configured to continuously track if an ingress serial link flow control Long Release command de-allocates the ingress serial link in the pair that it is currently pointing to, and if so, the monitor being configured to generate an egress serial link de¬ allocation command for the egress serial link in the pair.
11. The network switching system according to claim 1 , wherein the switching device is capable of operating in either a crosspoint mode or a scheduler mode, in which the crosspoint mode utilizes the crosspoint and not the scheduler, and in which the scheduler mode utilizes the scheduler and not the crosspoint.
12. A network switching system, comprising: a plurality of first transceiver devices respectively provided on a plurality of input line cards; a plurality of second transceiver devices respectively provided on a plurality of output line cards; and a plurality of switch devices communicatively coupled to each of the plurality of input line cards and the plurality of output line cards, the plurality of switch devices being disposed in a parallel arrangement with respect to each other, wherein each of the plurality of switch devices comprises: a crosspoint matrix for communicatively connecting at least one of the input line cards to at least one of the output line cards; and at least one output queue for temporarily storing at least a part of a packet output from the crosspoint matrix unit prior to the packet being sent to one of the second transceiver devices; wherein a respective one of said plurality of switch devices switches the at packet based on scheduling commands provided by way of a scheduler that controls at least one of the plurality of switch devices disposed in the parallel arrangement.
13. The network switching system according to claim 12, wherein the scheduler comprises unicast request queues for maintaining outstanding unicast requests for transferring packets across the crosspoint matrix, and wherein the scheduler is capable of switching multiple unicast packets in parallel across the plurality of switch devices, to thereby perform load balancing.
14. The network switching system according to claim 13, wherein the scheduler is configured to collapse the outstanding unicast requests stored in the unicast request queue, such that all short unicast requests in the unicast request queue except for a last entry in the unicast request queue are collapsed into long connection requests, and wherein the last entry in the request queue may correspond to either a long connection request or a short connection request.
15. The network switching system according to claim 12, wherein each of said plurality of first transceiver devices is configured to collapse a plurality of packets into a single equivalent packet unit, and wherein said each of said plurality of first transceiver devices is configured to output the single equivalent packet unit to a VoQ buffer.
16. The network switching system according to claim 12, wherein the scheduler comprises multiple request buffers and a single request Queue for maintaining outstanding multicast requests for transferring packets across the crosspoint matrix, and wherein the scheduler is capable of switching multiple multicast packets in parallel across the plurality of switch devices, to thereby perform load balancing.
17. The network switching system according to claim 16, wherein the scheduler is configured to collapse multicast requests stored in the multicast request FIFO when the multicast request FIFO entries have the same multicast ID; and wherein the scheduler further is configured to collapse the outstanding multicast requests stored in the multicast request Queue when the multicast request Queue entries have the same multicast ID and were received from the same transceiver device.
18. The network switching system according to claim 12, wherein each of said plurality of first transceiver devices comprises: at least one input FIFO for temporarily storing at least one packet prior to the at least one packet being sent to said crosspoint matrix; wherein a respective one of said switch devices switches the at least one packet received from the at least one input FIFO based on scheduling commands provided by way of the scheduler.
19. The network switching system according to claim 18, wherein the at least one output queue comprises a plurality of lane buffers disposed in a parallel relationship with respect to each other.
20. The network switching system according to claim 18, wherein each of the switch devices is configured to receive variable size packets, and wherein the at least one input FIFO and the at least one output queue are configured to absorb overlapping packets resulting from the scheduler operating without regard for packet tails.
21. The network switching system according to claim 18, wherein the scheduler performs load balancing by scheduling packets across each of the switch devices of the parallel arrangement of switch devices in a fairly even manner.
22. A network switching system, comprising: a crosspoint matrix configured to switch packets from one of a plurality of M ports of the crosspoint matrix to one of a plurality of N output ports of the crosspoint matrix, M and N being integers greater than one; a scheduler configured to schedule packets across the crosspoint matrix in predetermined-size units; an output queue provided for at least one of the plurality of outputs of the crosspoint matrix unit, the output queue capable of receiving up to X overlapping packets having at least a partial overlap with one another, and to store the up to the X overlapping packets without any loss in packet data, X being an integer greater than one and less than each of M and N; and a control unit configured to output a flow control signal from the output queue to the scheduler, the flow control signal causing the scheduler to stop sending packets to the at least one of the plurality of N output ports of the crosspoint matrix for at least a predetermined period of time.
23. The network switching system according to claim 22, wherein X is at least one order of magnitude less than M and N.
24. A network switching system according to claim 22, further comprising: an input transceiver communicatively coupled to the crosspoint matrix; and an output transceiver communicatively coupled to the crosspoint matrix, wherein the input transceiver includes an input FIFO communicatively coupled to one of the plurality of M input ports of the crosspoint, the input FIFO configured to store up to N partial packets, the input FIFO of the input transceiver be provided to accommodate variable size packet overlap at the input transceiver due to the scheduler scheduling of packets without regard to packet tails.
25. The network switching system according to claim 1, wherein the scheduler performs scheduling of first and second packets respectively received at a first transceiver of the plurality of input receivers by: sending commands from the scheduler to the first transceiver to switch the first and second packets, wherein the commands direct the switching of the first and second packets without regard to any packet tails of the first and second packets; and buffering the first and second packets at the first transceiver by way of a queue, prior to outputting the first and second packets back-to-back to the switch device in accordance with the commands provided by the scheduler, wherein the buffering of the first and second packets results in no loss of data of the first and second packets as the first and second packets are sent to the crosspoint.
26. The network switching system according to claim 25, wherein, when the scheduler receives an ingress tick flow control message indicating an amount of congestion at a buffer of the first transceiver the scheduler stops scheduling any new packets to the buffer for a predetermined amount of time, in order to relieve the congestion at the buffer of the first transceiver.
27. The network switching system according to claim 25, wherein the first packet corresponds to a long packet greater than a predetermined size, wherein the second packet corresponds to a short packet less than or equal to the predetermined size, wherein the scheduler is configured to schedule the first packet across the crosspoint based on a handshaking scheme; and wherein the scheduler is configured to schedule the second packet across the crosspoint based on a fixed time allocation scheme.
28. The network switching system according to claim 27, wherein the scheduler is configured to determine whether or not cut through switching is to be performed on the first packet, and if so, switching the first packet across the crosspoint using a cut through path that speeds up transfer of packets across the crosspoint as compared to not using the cut through path.
29. A network switching system, comprising: a plurality of virtual switching devices each configured to communicatively connect one of a plurality of input transceiver devices to one of a plurality of output transceiver devices, each of said plurality of virtual switching devices being capable of operating in a first crosspoint switching mode that includes: a) a plurality of inputs for receiving variable-sized packets from the respective input transceiver devices, b) a crosspoint matrix, c) a plurality of outputs for outputting the variable-sized packets to the respective output transceiver devices, and d) a plurality of output queues respectively provided between the outputs of the crosspoint matrix and the output transceiver devices, wherein the plurality of virtual switching devices correspond to a single physical switching device.
30. The network switching system according to claim 29, wherein each of said plurality of virtual switching devices is also capable of operating in a second scheduler mode that includes a scheduler that schedules packets across at least one crosspoint matrix without regards to packet tails.
31. The network switching system according to claim 30, further comprising: means for allocating at least one of said virtual switching devices to operate in the second scheduler mode, and for allocating at least one other of said virtual switching devices to operate in the first crosspoint switching mode.
32. The network switching system according to claim 1, wherein the scheduler is configured to determine whether or not a packet provided to the crosspoint matrix via the at least one input path is a greater than a predetermined byte value, wherein the received packet is determined to be a short packet if it is less than or equal to the predetermined byte value and is determined to be a long packet if it is greater than the predetermined byte value, and wherein the scheduler schedules long packets differently than short packets across the crosspoint matrix.
33. The network switching system according to claim 32, wherein the scheduler schedules a short packet by a connection-oriented scheme which allocates a predetermined time for passing the short packet through the crosspoint matrix, and wherein the scheduler schedules a long packet by a connection-oriented scheme which allocates a path across the crosspoint matrix that is deallocated upon receipt of a packet termination message sent from the at least one input path that signifies the sending of the long packet to the crosspoint matrix.
34. The network switching system according to claim 1, wherein the switching device is capable of operating in either a crosspoint mode or a scheduler mode, wherein the network switching system further comprises: a second switching device capable of operating in either a crosspoint mode or a scheduler mode, the second switching device including: a second crosspoint having a second crosspoint matrix; and a second scheduler, wherein the network switching system is capable of operating in a first mode in which the switching device operates in the crosspoint mode and the second switching device operates in the scheduler mode in order to schedule packets across the crosspoint matrix of the switching device, and wherein the network switching system is capable of operating in a second mode in which the switching device operates in the scheduler mode and the second switching device operates in the crosspoint mode in order to schedule packets across the second crosspoint matrix of the second switching device.
35. The network switching system according to claim 34, wherein the second scheduler of the second switching device is configured to schedule packets across the crosspoint matrix of the switching device without regard to packet tails.
36. The network switching system according to claim 1, wherein the scheduler schedules packets to the plurality of outputs of the crosspoint back-to-back assuming a packet size which is less than or equal to an actual packet size due to the scheduler ignoring packet tails.
PCT/US2005/025438 2004-07-26 2005-07-18 Network interconnect crosspoint switching architecture and method WO2006020232A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP05772308A EP1779607B1 (en) 2004-07-26 2005-07-18 Network interconnect crosspoint switching architecture and method
DE602005012278T DE602005012278D1 (en) 2004-07-26 2005-07-18 NETWORK CONNECTION CROSS-POINT SWITCH ARCHITECTURE AND METHOD

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10/898,540 2004-07-26
US10/898,540 US7742486B2 (en) 2004-07-26 2004-07-26 Network interconnect crosspoint switching architecture and method

Publications (1)

Publication Number Publication Date
WO2006020232A1 true WO2006020232A1 (en) 2006-02-23

Family

ID=33518320

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2005/025438 WO2006020232A1 (en) 2004-07-26 2005-07-18 Network interconnect crosspoint switching architecture and method

Country Status (6)

Country Link
US (1) US7742486B2 (en)
EP (1) EP1779607B1 (en)
AT (1) ATE420516T1 (en)
DE (1) DE602005012278D1 (en)
GB (1) GB2414625B (en)
WO (1) WO2006020232A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014502077A (en) * 2010-10-28 2014-01-23 コンパス・エレクトロ−オプティカル・システムズ・リミテッド Router and switch architecture
CN111614752A (en) * 2020-05-19 2020-09-01 北京百度网讯科技有限公司 Method and device for data transmission

Families Citing this family (52)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006101177A (en) * 2004-09-29 2006-04-13 Fujitsu Ltd Data transfer apparatus
US8559443B2 (en) 2005-07-22 2013-10-15 Marvell International Ltd. Efficient message switching in a switching apparatus
CA2562592A1 (en) * 2005-11-28 2007-05-28 Tundra Semiconductor Corporation Method and system for handling multicast event control symbols
CA2562634A1 (en) * 2005-11-28 2007-05-28 Tundra Semiconductor Corporation Method and switch for broadcasting packets
JP4334534B2 (en) * 2005-11-29 2009-09-30 株式会社東芝 Bridge device and bridge system
US7672236B1 (en) * 2005-12-16 2010-03-02 Nortel Networks Limited Method and architecture for a scalable application and security switch using multi-level load balancing
US7738473B2 (en) * 2006-04-20 2010-06-15 Forestay Research, Llc Multicast switching in a credit based unicast and multicast switching architecture
US8639833B1 (en) * 2006-10-06 2014-01-28 Nvidia Corporation Dynamic control of scaling in computing devices
US7852866B2 (en) * 2006-12-29 2010-12-14 Polytechnic Institute of New York Universiity Low complexity scheduling algorithm for a buffered crossbar switch with 100% throughput
US8402181B2 (en) * 2007-03-14 2013-03-19 Integrated Device Technology, Inc. Bifurcate arbiter
US7813415B2 (en) * 2007-06-11 2010-10-12 Lsi Corporation System for automatic bandwidth control of equalizer adaptation loops
WO2009029817A1 (en) * 2007-08-31 2009-03-05 New Jersey Institute Of Technology Replicating and switching multicast internet packets in routers using crosspoint memory shared by output ports
US7974278B1 (en) 2007-12-12 2011-07-05 Integrated Device Technology, Inc. Packet switch with configurable virtual channels
US20090225775A1 (en) * 2008-03-06 2009-09-10 Integrated Device Technology, Inc. Serial Buffer To Support Reliable Connection Between Rapid I/O End-Point And FPGA Lite-Weight Protocols
US20090228733A1 (en) * 2008-03-06 2009-09-10 Integrated Device Technology, Inc. Power Management On sRIO Endpoint
US8312241B2 (en) * 2008-03-06 2012-11-13 Integrated Device Technology, Inc. Serial buffer to support request packets with out of order response packets
US8625621B2 (en) * 2008-03-06 2014-01-07 Integrated Device Technology, Inc. Method to support flexible data transport on serial protocols
US8312190B2 (en) * 2008-03-06 2012-11-13 Integrated Device Technology, Inc. Protocol translation in a serial buffer
US8213448B2 (en) * 2008-03-06 2012-07-03 Integrated Device Technology, Inc. Method to support lossless real time data sampling and processing on rapid I/O end-point
US7907625B1 (en) * 2008-08-04 2011-03-15 Integrated Device Technology, Inc. Power reduction technique for buffered crossbar switch
US8385202B2 (en) * 2008-08-27 2013-02-26 Cisco Technology, Inc. Virtual switch quality of service for virtual machines
US8081569B2 (en) 2009-04-20 2011-12-20 Telefonaktiebolaget L M Ericsson (Publ) Dynamic adjustment of connection setup request parameters
US8638799B2 (en) * 2009-07-10 2014-01-28 Hewlett-Packard Development Company, L.P. Establishing network quality of service for a virtual machine
KR101025255B1 (en) 2010-01-18 2011-03-29 연세대학교 산학협력단 Apparatus for controlling signal transmition and method for controlling the same
US8693470B1 (en) * 2010-05-03 2014-04-08 Cisco Technology, Inc. Distributed routing with centralized quality of service
US8780931B2 (en) 2011-05-14 2014-07-15 International Business Machines Corporation Multi-role distributed line card
US8982905B2 (en) 2011-05-16 2015-03-17 International Business Machines Corporation Fabric interconnect for distributed fabric architecture
KR101390092B1 (en) 2011-12-01 2014-04-29 연세대학교 산학협력단 Network relay apparatus having virtual output queue and the control method thereof
CN103428817B (en) * 2012-05-23 2016-08-03 华为技术有限公司 D2D method for discovering equipment based on LTE cellular communication system and device
US8867533B2 (en) 2013-01-08 2014-10-21 Apple Inc. Multi-tier switch interface unit arbiter
US9485188B2 (en) * 2013-02-01 2016-11-01 International Business Machines Corporation Virtual switching based flow control
US9166925B2 (en) * 2013-04-05 2015-10-20 International Business Machines Corporation Virtual quantized congestion notification
US9160791B2 (en) * 2013-08-13 2015-10-13 International Business Machines Corporation Managing connection failover in a load balancer
US10116558B2 (en) 2014-01-24 2018-10-30 Fiber Mountain, Inc. Packet switch using physical layer fiber pathways
WO2015148970A1 (en) 2014-03-28 2015-10-01 Fiber Mountain, Inc. Built in alternate links within a switch
EP3138347A4 (en) * 2014-04-28 2017-12-13 Intel IP Corporation Simultaneous scheduling request transmission in dual connectivity
WO2016004340A1 (en) 2014-07-03 2016-01-07 Fiber Mountain, Inc. Data center path switch with improved path interconnection architecture
US10382845B2 (en) 2014-09-29 2019-08-13 Fiber Mountain, Inc. System for increasing fiber port density in data center applications
WO2016054028A1 (en) 2014-09-29 2016-04-07 Fiber Mountain, Inc. Data center network
US20160139806A1 (en) * 2014-11-13 2016-05-19 Cavium, Inc. Independent Ordering Of Independent Transactions
US10270713B2 (en) * 2014-12-16 2019-04-23 Oracle International Corporation Scheduling packets with multiple destinations in a virtual output queue network switch
US9813362B2 (en) * 2014-12-16 2017-11-07 Oracle International Corporation Framework for scheduling packets with multiple destinations in a virtual output queue network switch
WO2016164769A1 (en) * 2015-04-09 2016-10-13 Fiber Mountain, Inc. Data center endpoint network device with built in switch
US9871610B2 (en) * 2015-10-30 2018-01-16 Citrix Systems, Inc. Method for packet scheduling using multiple packet schedulers
WO2018053179A1 (en) 2016-09-14 2018-03-22 Fiber Mountain, Inc. Intelligent fiber port management
US11146269B1 (en) * 2018-02-05 2021-10-12 Rambus Inc. Low power cryogenic switch
CN108200218B (en) * 2018-03-09 2021-11-26 北京奇艺世纪科技有限公司 Method and device for realizing load balance and electronic equipment
US11494212B2 (en) * 2018-09-27 2022-11-08 Intel Corporation Technologies for adaptive platform resource assignment
US10902177B1 (en) * 2019-02-20 2021-01-26 Cadence Design Systems, Inc. Reconfigurable switch for a computing system
US11575609B2 (en) * 2019-07-19 2023-02-07 Intel Corporation Techniques for congestion management in a network
WO2021146964A1 (en) * 2020-01-21 2021-07-29 华为技术有限公司 Switch element and switch apparatus
CN117321977A (en) * 2021-06-15 2023-12-29 华为技术有限公司 Data exchange method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6044061A (en) * 1998-03-10 2000-03-28 Cabletron Systems, Inc. Method and apparatus for fair and efficient scheduling of variable-size data packets in an input-buffered multipoint switch
US20030227932A1 (en) * 2002-06-10 2003-12-11 Velio Communications, Inc. Weighted fair share scheduler for large input-buffered high-speed cross-point packet/cell switches

Family Cites Families (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0522224B1 (en) 1991-07-10 1998-10-21 International Business Machines Corporation High speed buffer management
US5355370A (en) * 1992-07-02 1994-10-11 The Grass Valley Group, Inc. Crosspoint matrix
US5367520A (en) 1992-11-25 1994-11-22 Bell Communcations Research, Inc. Method and system for routing cells in an ATM switch
US5500858A (en) 1994-12-20 1996-03-19 The Regents Of The University Of California Method and apparatus for scheduling cells in an input-queued switch
US6212182B1 (en) 1996-06-27 2001-04-03 Cisco Technology, Inc. Combined unicast and multicast scheduling
US6442172B1 (en) 1996-07-11 2002-08-27 Alcatel Internetworking, Inc. Input buffering and queue status-based output control for a digital traffic switch
US6738381B1 (en) * 1997-12-19 2004-05-18 Telefonaktiebolaget Lm Ericsson (Publ) ATM time stamped queuing
US6934253B2 (en) 1998-01-14 2005-08-23 Alcatel ATM switch with rate-limiting congestion control
US6351466B1 (en) 1998-05-01 2002-02-26 Hewlett-Packard Company Switching systems and methods of operation of switching systems
JP3711752B2 (en) * 1998-07-09 2005-11-02 株式会社日立製作所 Packet communication device
EP1053610B1 (en) * 1998-12-01 2007-08-15 Samsung Electronics Co., Ltd. Mobile communication system having atm-based connecting scheme
GB9828144D0 (en) * 1998-12-22 1999-02-17 Power X Limited Data switching apparatus
US6956818B1 (en) * 2000-02-23 2005-10-18 Sun Microsystems, Inc. Method and apparatus for dynamic class-based packet scheduling
ATE331369T1 (en) 2000-03-06 2006-07-15 Ibm SWITCHING DEVICE AND METHOD
US7039058B2 (en) * 2000-09-21 2006-05-02 Avici Systems, Inc. Switched interconnection network with increased bandwidth and port count
DE60119866T2 (en) 2000-09-27 2007-05-10 International Business Machines Corp. Switching device and method with separate output buffers
US7310353B1 (en) * 2000-10-30 2007-12-18 Yair Bourlas Compression of overhead in layered data communication links
US6996116B2 (en) 2000-11-22 2006-02-07 International Business Machines Corporation Switching nodes and interface modules for data networks
US7197540B2 (en) 2001-03-09 2007-03-27 International Business Machines Corporation Control logic implementation for a non-blocking switch network
US20030021230A1 (en) 2001-03-09 2003-01-30 Petaswitch Solutions, Inc. Switch fabric with bandwidth efficient flow control
US7106738B2 (en) 2001-04-06 2006-09-12 Erlang Technologies, Inc. Method and apparatus for high speed packet switching using train packet queuing and providing high scalability
US20030056073A1 (en) 2001-09-18 2003-03-20 Terachip, Inc. Queue management method and system for a shared memory switch
US8213322B2 (en) 2001-09-24 2012-07-03 Topside Research, Llc Dynamically distributed weighted fair queuing
US7046660B2 (en) 2001-10-03 2006-05-16 Internet Machines Corp. Switching apparatus for high speed channels using multiple parallel lower speed channels while maintaining data rate
US7362751B2 (en) 2001-10-03 2008-04-22 Topside Research, Llc Variable length switch fabric
US20030088694A1 (en) 2001-11-02 2003-05-08 Internet Machines Corporation Multicasting method and switch
US7203203B2 (en) 2001-12-05 2007-04-10 Internet Machines Corp. Message ring in a switching network
US7206308B2 (en) 2001-12-22 2007-04-17 International Business Machines Corporation Method of providing a non-blocking routing network
KR100440574B1 (en) 2001-12-26 2004-07-21 한국전자통신연구원 Variable Length Packet Switch
US8432927B2 (en) 2001-12-31 2013-04-30 Stmicroelectronics Ltd. Scalable two-stage virtual output queuing switch and method of operation
US7277425B1 (en) * 2002-10-21 2007-10-02 Force10 Networks, Inc. High-speed router switching architecture
KR100488478B1 (en) 2002-10-31 2005-05-11 서승우 Multiple Input/Output-Queued Switch
US7483980B2 (en) * 2002-11-07 2009-01-27 Hewlett-Packard Development Company, L.P. Method and system for managing connections in a computer network
US7349416B2 (en) 2002-11-26 2008-03-25 Cisco Technology, Inc. Apparatus and method for distributing buffer status information in a switching fabric
US7003597B2 (en) * 2003-07-09 2006-02-21 International Business Machines Corporation Dynamic reallocation of data stored in buffers based on packet size
JP3757286B2 (en) * 2003-07-09 2006-03-22 独立行政法人情報通信研究機構 Optical packet buffering apparatus and buffering method thereof
US20060031506A1 (en) * 2004-04-30 2006-02-09 Sun Microsystems, Inc. System and method for evaluating policies for network load balancing

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6044061A (en) * 1998-03-10 2000-03-28 Cabletron Systems, Inc. Method and apparatus for fair and efficient scheduling of variable-size data packets in an input-buffered multipoint switch
US20030227932A1 (en) * 2002-06-10 2003-12-11 Velio Communications, Inc. Weighted fair share scheduler for large input-buffered high-speed cross-point packet/cell switches

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014502077A (en) * 2010-10-28 2014-01-23 コンパス・エレクトロ−オプティカル・システムズ・リミテッド Router and switch architecture
US9363173B2 (en) 2010-10-28 2016-06-07 Compass Electro Optical Systems Ltd. Router and switch architecture
CN111614752A (en) * 2020-05-19 2020-09-01 北京百度网讯科技有限公司 Method and device for data transmission

Also Published As

Publication number Publication date
US20060018329A1 (en) 2006-01-26
ATE420516T1 (en) 2009-01-15
EP1779607B1 (en) 2009-01-07
GB0424265D0 (en) 2004-12-01
EP1779607A1 (en) 2007-05-02
GB2414625A (en) 2005-11-30
DE602005012278D1 (en) 2009-02-26
GB2414625B (en) 2006-06-28
US7742486B2 (en) 2010-06-22

Similar Documents

Publication Publication Date Title
EP1779607B1 (en) Network interconnect crosspoint switching architecture and method
US7023841B2 (en) Three-stage switch fabric with buffered crossbar devices
US7161906B2 (en) Three-stage switch fabric with input device features
US6510138B1 (en) Network switch with head of line input buffer queue clearing
US5790545A (en) Efficient output-request packet switch and method
EP1949622B1 (en) Method and system to reduce interconnect latency
US7009985B2 (en) Fibre channel arbitrated loop bufferless switch circuitry to increase bandwidth without significant increase in cost
US9094327B2 (en) Prioritization and preemption of data frames over a switching fabric
EP0981878B1 (en) Fair and efficient scheduling of variable-size data packets in an input-buffered multipoint switch
US6487171B1 (en) Crossbar switching matrix with broadcast buffering
US6754222B1 (en) Packet switching apparatus and method in data network
EP1854254B1 (en) A method of and a system for controlling access to a shared resource
US20030107996A1 (en) Fibre channel arbitrated loop bufferless switch circuitry to increase bandwidth without significant increase in cost
US20070121499A1 (en) Method of and system for physically distributed, logically shared, and data slice-synchronized shared memory switching
US6574232B1 (en) Crossbar switch utilizing broadcast buffer and associated broadcast buffer management unit
KR20010082335A (en) Data switching method and apparatus
WO2001067691A1 (en) NxN CROSSBAR PACKET SWITCH
US6345040B1 (en) Scalable scheduled cell switch and method for switching
US5742597A (en) Method and device for multipoint switching and arbitration in output-request packet switch
EP1521411B1 (en) Method and apparatus for request/grant priority scheduling
Chrysos Design issues of variable-packet-size, multiple-priority buffered crossbars
JP3880890B2 (en) Cell switch and cell replacement method

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU LV MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

DPEN Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed from 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2005772308

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 2005772308

Country of ref document: EP