WO1995030315A1

WO1995030315A1 - Design of a fault-tolerant self-routing crossbar

Info

Publication number: WO1995030315A1
Application number: PCT/US1995/005295
Authority: WO
Inventors: Aloke Guha
Original assignee: Honeywell Inc.
Priority date: 1994-04-29
Filing date: 1995-04-28
Publication date: 1995-11-09

Abstract

A fault-tolerant self-routing crossbar router that can be used to provide parallel non-blocking connections in a fault-tolerant manner while requiring non-centralized control. The fault-tolerant design uses a regular switch array structure and uses augmented link connections between switches to avoid fault locations when routing connections. Switch failures are predicted and a parallel fault isolation mode is invoked to isolate locations of faulty switches. A parallel fault location algorithm is used by the connected processors to detect faulty switches, and then reconfigure the crossbar for continued usage which over a long term and multiple faults tends to graceful degradation. A routing algorithm is presented so that even with modifications to the switching elements, self-routing can be achieved on reconfiguration of the router. An alternative embodiment is described in which a concurrent reconfiguration and routing strategy can be used, eliminating the need for a separate reconfiguration phase but increasing the complexity of the switches. Successful reconfigurability in the preferred embodiment merely requires that less than 20 % of the switches are faulty and that no two faulty switches are adjacent.

Description

DESIGN OF A FAULT-TOLERANT SELF-ROUTING CROSSBAR BACKGROUND OF THE INVENTION Field of the Invention

The present invention relates to self-routing switching matrices and fault tolerant switching systems particularly suited for communications and computer systems.

Prior Art

With the increased need for high-bandwidth communications for distributed computing and multimedia communications, high-speed switching has gained renewed significance. Such standards as HIPPI, Fiber Channel and ATM switching, M. Friedman, On the Road to Gbit Nets. Electronic Engineering Times Nov. 4, 1991 at 79, have promoted the use of non-blocking switch fabrics. While many switch designs have been proposed or are in evaluation, there are few known efforts for fault-tolerant non- blocking switch designs.

Of special interest is the class of self-routing switches. Self-routing switch designs are inherently scaleable and do not rely on centralized controllers that fail to scale in performance as one attempts to increase the total bandwidth of the switch (either by increasing the switch size or the line speeds of data in each switch channel). The crossbar switching fabric is a well known example of these. Charles Clos, A study of Non-blocking Switching Networks. Bell Systems Technical Journal, March 1953 at 406.

The crossbar network uses a simple dual input/dual output switch wherein either input may be switched to either output. While simple and scaleable it has suffered from an inability to deal well with faults in it switches ~ usually requiring the replacement of the entire fabric in the event of a single faulty switch. The proposed connection scheme uses diagonal connections between switches and is different from other reconfigurable array research, where only orthogonal non- adjacent connections are added. However, these previous works are also in a different category since they are not directed towards self-routing. J.A.B. Fortes and C.S. Raghavendra, Gracefully Degradable Processor Arrays. IEEE Trans. Computers, Vol. C-34, Nov. 1985, pp. 1033-1044. M.G. Sami and R. Stefanelli, Reconfigurable

Architectures for VLSI Processing Arrays. Proc. IEEE, Vol. 74, May 1986, pp. 712-722. The proposed fault-tolerant self-routing crossbar design is based on past work on high-speed routing utilizing crossbar networks and disclosed in U.S. Patent 5,218,198, international application PCT/US92/06651, international application PCT/US92/ 06653 and U.S. Patent 5,319,639. SUMMARY OF THE INVENTION

The crossbar switch is intended to be used as a key component of a fault-tolerant network router that incorporates multiple levels of redundancy and fault tolerance features. The router can be used in a dual redundant fashion where two crossbar switches are used in parallel (Figure 1). Any failures detected in a single crossbar will allow a "switch-over" to the secondary crossbar.

The switches within the crossbar are designed with four inputs and an equal number of outputs. Two of each of the throughputs (an input / output pair) are used as the primary routes for data in the manner of well known crossbars and the extra throughputs are used as detours in the event that an adjacent switch has been detected to be faulty. Thus, each crossbar is built such that it can tolerate multiple faults after fault detection and isolation has occurred. Because of the presence of the dual routing crossbars in the fault-tolerant router, on failure detection, the router can be reconfigured to use the second crossbar switch and continue operation, while the fault detection and isolation is initiated in the first. This crossbar design can be used to construct larger scaleable non-blocking networks whose designs are well known such as Clos, Delta, and Omega Networks. Unlike previous inventions this design provides for a self-routing non-blocking fault- tolerant crossbar switch which can isolate individual faulty switches within the crossbar, and one which is able to route around faulty switches in the crossbar switch. The crossbar itself can withstand a faulty switch without interruption to the network, and increase the overall lifespan of the crossbar, which will not fail en masse but rather will decay gracefully, i.e. withstand multiple faults before becoming useless. Multiple crossbars can also be employed redundantly adding further tolerance and transmission benefits. BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 illustrates a Dual Redundant Router Scheme. Figure 2 illustrates two representations of an NxN crossbar using 2x2 switches. Figure 3 illustrates a scheme for labeling an NxM crossbar.

Figure 4 illustrates the basic pass and exchange modes of a 2x2 switch.

Figure 5 illustrates a Fault tolerant Crossbar Switch Design

Figure 6 illustrates fault tolerant routing. Figure 7 illustrates reconfiguration of connections for separate reconfiguration and rerouting stages.

Figure 8 illustrates a 4x4 switch with primary and secondary I/O pairs.

Figure 9 illustrates the 6 possible routing configurations of a 4x4 switch.

Figure 10 illustrates the addressing of the 6 necessary states of the crossbar switch.

Figure 11 illustrates one embodiment of the 4 x 4 switch logic.

DETAILED DESCRIPTION

The structure of the self-routing crossbar that we propose as the basis of our fault-tolerant design and its equivalent mesh-like representation is shown in Figure 2(i) and (ii). The crossbar is composed of a regular array of 2 x 2 switching elements that can be in one of two states, exchange or pass (see Figs. 5(i) and 5(ii)).

A number of different self-routing mechanisms, especially for scaleable routing designs, have been derived, and need not be covered here. See for example U.S. Patent

5,319,639 and international application PCTUS92/06651. Fault Detection and Location Switch Fault Model

The faults expected in the crossbar are those that can occur in the switches or its links. Each switch has two modes of operation: pass and exchange (Figure 4).

Functionally, the failures that can occur in the switch, whether internal to the switch or in its input or output links, are incorrect modes of operation or failure to forward a message. Thus, functionally, failures concerning the switches and its connections can be classified as those that cause i) the switch to select the wrong mode of operation, i.e., pass instead of exchange or exchange instead of pass, ii) unintended broadcast (bridging fault), and iii) no message to be forwarded. Such a functional fault model avoids detailed modeling of the cause of faults within the switching element. Note that we are not concerned with data corruption in the message packet, due to hard faults at the input or output links, which can be easily detected by standard encoding techniques, such as block coding techniques, within the message. Detecting Switch Failures

There are two methods in detecting failures in the fault-tolerant crossbar design. It is based on a reliable end-to-end data transfer verified by acknowledgment. Thus, correct transfer of a packet or message, which includes header information, can be detected by ensuring every packet (or group of packets in a serial packet message) is verified by acknowledgment from the destination node or, by requiring the destination node to check that the header information, i.e., header address data, matches the address of the destination.

Examination of Figure 2 will show that any switch failure or failures that result in incorrect routing, based on the header-based self-routing algorithm described earlier, can have two effects; either the message is never routed to any destination node, or the message reaches the wrong destination address.

In the first case, no acknowledgment will be received by the source node despite repeated transmissions. Thus, after a predetermined period of time, following repeated failures in receipt of acknowledgment, the source nodes will initiate fault detection and location procedures. Note that a very quick transmit-acknowledgment sequence can be used in this crossbar switch since all non-blocking connections can occur in deterministic time (O(_V) switch delays where _V is the number of inputs to the crossbar). The second case of an incorrectly transmitted message can be detected by a simple check at the receipt node. Each destination node compares the header information bits with its own address. A mismatch would indicate that the header was routed incorrectly due to failures within the crossbar.

This scheme of switch fault detection can be accomplished in either case without added fault detection hardware. Rather, the routing properties and the protocol for message acknowledgment used in the crossbar handle it. This is preferable to adding self-testing logic to the switch, which is relatively small but prohibitively increases the hardware overhead. Also, additional fault-detecting hardware would increase the probability of failures in the larger switch (unless separate highly reliable circuitry is used), increasing the problem of fault detection and isolation. Fault Location

The proposed scheme for fault-tolerant operation of the crossbar relies on the following sequence of steps: i) fault detection: using repeated transmissions by source and address matches by the destination nodes; ii) switch-over to redundant crossbar: the source node that detects failures transmits "switch-over" requests over a secondary (redundant) crossbar, and all other nodes either curtail current transmission or wait until completion of the current transmissions to switch over to the secondary crossbar. iii) parallel tests for fault isolation: while the secondary crossbar is used, all sources run tests in parallel on the primary faulty crossbar to determine the location of the faulty switch. iv) reconfiguration of the primary crossbar: the primary crossbar is reconfigured (Section 4) through use of its redundant connections (Figure 5). The secondary crossbar is therefore used until a failure is detected, at which time a "switch-over" to the reconfigured primary is done. Thus, each crossbar acts as a reliable backup for the other. While faults are detected in one, detection and reconfiguration is concurrent with the operation of the other. A Parallel Fault Location Scheme The testing algorithm, referred to herein as the FTX algorithm, can be described as follows.

Algorithm FTX .Failure Location Algorithm):

Call or label the iV*-- switches of the crossbar, as shown in the _V_V crossbar of Figure 1, according to the input source (row) number and the output destination (column) number. Each switch that is a crosspoint of source i and destination /^' will be referred to as the (/, /^") switch.

Each source Si, 1 < / < N, node sends _V test message headers in parallel to the destination nodes using the following addressing scheme. during test 1 : source Si sends a message to destination D , 1 < i < N, during test 2: source Si sends a message to destination Oj,j = [ i mod N] + 1, during test n: source Si sends to destination Dj,j = [( ∑ + n-2) mod N] +1, 1< n

<N

during test N: source S∑ sends a message to destination Of, j = [(ι+N-2) mod N]

+1.

Any non-overlapping test sequence can be used by the source nodes in parallel, i.e., node i can send message headers to destination [(i+k+j) mod N] + 1 in the /^'th test, where 1 < k < N.

Each node i creates a linear failure vector Fj[l :N] where V[ j] ≡ {0, 1 }, 1 <j < N, is determined as follows.

Fit/] ⁼ 0 if the /^'th test is passed, i.e., the message header was received by the /^'th destination, or,

Fit/] ⁼ 1 if the /^'th test fails or the message is not acknowledged.

If a failure vector Fi has multiple l's, for example, [00...01111000...0], then the location of a faulty switch is (∑, I), where / = Min { ^' | Ε{[f] = 1 }.

If a failure vector Fi has only a single 1 at j, i.e., Fijy] = 1 and F{[k] - 0 for k <j, then examine Fk for k < i. The location of the faulty switch is (∑, /) where / = Min {y | F_k[/] -= l, * < ∑}. Reconfiguration and Fault-Tolerant Routing The parallel nature of the crossbar style routing switch allows for inherent fault tolerance. The presence of multiple paths from any input link (source) to any output link (destination) implies inherently fault-tolerant routing. To create fault-tolerant routing using the self-routing algorithm, the normal crossbar array topology is augmented with extra links: each switch has four input and output connections as opposed to two (Figure 5). These augmented connections are necessary for providing self-routing under conditions of switch failures.

Two augmented diagonal links (shown by bold lines in Figure 5) are used to avoid a faulty switch during routing. The routing through the two links are shown in Figure 6. In the following description (i,j) switch refers to the blackened switch element. When a horizontal path is required from switch (∑, ;^"-l) to (i, j+l) to reroute to destination k, k>j, then the diagonal link from switch (i,j-l) to switch (i+l,j) is used (Figure 6(i)) to proceed to switch (∑^'+l,y^'+l) or to switch (i+2,j). When a vertical path is required from the switch (i-l, /^') to switch (1+I_. ), then the diagonal link from switch (1- 1, ) to switch (∑^', +l) is used to proceed to switch (∑^'+l,/), also using the diagonal connection (Figure 6(ii)).

The fault-tolerant routing algorithm is based on the above rerouting principle. There are two approaches to reconfigured fault-tolerant routing. In one embodiment, the reconfiguration is done by a separate phase where the faulty switch is avoided by using augmented links to bypass the switch. After the reconfiguration, all switches adjacent to the faulty switch use their new connections to automatically self-route their messages. In an alternate embodiment (Concurrent Reconfiguration and Rerouting), there is no separate reconfiguration phase, but the header of the messages are modified to reroute the message through the network. To avoid a separate reconfiguration phase, the header for self-routing messages is also modified. The preferred embodiment utilizes the separate reconfiguration and routing approach since it is expected to be simpler in design. Reconfiguration Strategy

As evident from Figure 5, the fault-tolerant routing of messages traveling from west to east (left to right), the horizontal path, is achieved by using the southeast diagonal link of the switch preceding the faulty switch in the row (Figure 6(i)). When messages are traveling from north to south (top to bottom), the vertical path, rerouting is achieved by using the southeast diagonal link of the switch preceding the faulty switch in the column and the southwest link of the switch immediately to the right of the faulty switch (Figure 6(ii)). Since all messages are always normally routed left to right and top to bottom, this basic reconfigured routing scheme is the minimum required to avoid the faulty switch. The switches whose connections that must be reconfigured are shown in Figure

7. If switch (i,j) is faulty, then switches (i-\,j) and (i,j-\) must reconfigure their output connections. In case of switch (i,j-l), the horizontal output is shorted (connected) to the southeast output link to the corresponding northwest input of switch (i+l,j), while for switch (∑^'-l, j), the vertical output is shorted to the southeast output link to the corresponding northwest input of switch (∑^', +1). Which in turn is shorted to northwest input of switch (i+lj). Both cases are shown by bold lines in Figure 7. Routing after Reconfiguration

After the switches are reconfigured as in Figure 7, the individual switch elements must be reprogrammed to route all connections. Because each switch can have potentially four inputs in parallel, the switch can have multiple states. In theory, the switch can be in any of 24 (= 4! where "!" represents the factorial function) states corresponding to any of the four inputs or ports connected to any combination of the four outputs. For the purpose of description, we have labeled the four inputs as II, . . ., 14 and the corresponding four outputs 01, . . ., 04 (see Figure 8).

Any state of the switch will be described by the 4-tuple {si, s2, s3, s4} where si e {1, 2, 3, 4} and si != sj when ∑ \=j (Where !-= implies 'not equal¹). Thus, a 4-tuple

2134 represents the switch state where II is connected to 02, 12 is connected to 01, while 13 and 14 are connected to 03 and 04, respectively.

Although any switch can have 24 states, only a few are valid configurations for satisfying the self-routing property. For the desired routing scheme, based on the original routing scheme of routing by destination and source address headers, a minimum of six switch port connections will be sufficient. The normal fault-free operation are shown in Figure 9(a). In these modes, the only possible sources of messages to the switch are from the normal orthogonal links, i.e., from the inputs II and 21 only (see Figure 8). The four remaining modes are those necessary to handle routing when a switch failure is detected, and when messages are possible from the augmented inputs 13 and 14 as well. Note that these port connections are not unique-many sets of connection modes (states) can support self-routing under both faulty and fault-free switch conditions.

To distinguish the six states of the switch a simple logic table indicates the conditions under which each switch state will be set (Figure 10). The condition D∑ -=? L refers to the checking of the condition where the message is first detected as having arrived at input I∑. The result of the comparison Di ?= L is required to test whether D∑ = L or D∑ != L when the destination address of the header matches the internal label L of the switch element. Note that it is assumed that the first D∑ ?-= L comparison (i = 1 ,2,3,4) sets the switch state so that the tester arrival of messages on the other ports do not reset the switch. Additionally, the Turn bits (refer to the self-routing algorithm in Section 2) in the message headers from inputs II and 13 are also checked. These 6 inputs are sufficient but not necessary to determine the switch port connections.

Based on the switch logic table (Figure 10), the switch port connections for each desired state of the switch, the construction of the switch becomes clear to the skilled mechanic. Figure 11 illustrates one possible embodiment of the switch. Essentially, detection of the conditions that determine the primary port connections, for example, II —» 02 and 12 — > 01, will under the fault-free conditions, imply switching inputs 1 and 2 to outputs 2 and 1. As the truth table of Figure 10 shows, the only pairs of inputs that exchange their outputs are: 1 and 2, 1 and 3, 2 and 3, 2 and 4, and 3 and 4. Thus, detecting the correct inputs conditions will decide whether these inputs must be exchanged or not. Thus, the implementation shown requires five 2x2 pass/exchange switches, whose outputs are merged at the output of the complete switch. Note this is only a functional implementation, where we have assumed that normally the 2x2 switches are in the pass mode but when any switch is set into the exchange mode, all other switches are latched into the pass mode. Therefore, only one switch is set into the exchange mode at any time. In case of simultaneous message arrivals with conflicting port connection requests, a random tie will result in the resolving the conflicts. We have not addressed the race problem issues that arise for setting the switch on arrival of the first message on any of Dl, D2, D3, or D4 but assume that they can be handled by good engineering design. However, without loss of generality, the conflict resolution principle has been to give messages on orthogonal inputs (II and 12) higher priority that those on the diagonal inputs (13 and 14). This priority scheme can be easily reversed. We note that at the last row, the output row, all messages arriving at outputs 3 and 4 are recognized as not "routable" due to conflict. Fault Coverage and Conditions for Reconfigurability

The parallel fault location algorithm FTX in Section 3.4 can detect any number of faults less than N *-*^■ for an NN crossbar. However, successful reconfiguration and routing on detection of a fault or multiple faults within the crossbar requires that the switches immediately encompassing the faulty switch are fault-free. By examining the fault-tolerant routing paths shown in Figure 5, we can conclude that for any faulty switch (i,j), the set of switches S(ι, /^") that must be fault-free are the ones that are adjacent to (i,j) or S(i,j) = {(i-l,j), (i,j-\), (i +l), ( +1,7)}

Thus, for every faulty switch in the array, its adjacent 4 switches must be fault- free, though corner switches can be circumvented because of connections available from the inputs (see Figure 5). Thus, on an average, reconfigurability demands that the ratio R of the faulty switches to the total number of switches must be less than 20%.

Increasing R will require connectivity between non-adjacent switches. Concurrent Reconfiguration and Routing

It is possible that the router can be designed to not have a separate reconfiguration phase, but rather have concurrent reconfiguration and routing. In such a case, all headers must always have information on the diagnostic status of the array.

The following algorithm RTX can achieve simultaneous reconfiguration and routing. Algorithm RTX .Fault-Tolerant Routing Algorithm.:

Label the _¥^•<•• switches of the crossbar, by the column numbers, 1 through N, as shown in Figure 2. Every switch in column i, i < N, contains a label L = ∑. Before any connections are made, all switches are latched in the state 1234.

Any message to be sent from source i to destination./ is preceded by a routing header composed of the three parts: a 2_V-bit binary status vector Si, the destination address D (=j), and a Turn signal (single bit) which is initialized to 1, where;

Si = SH I SV where SH and Sy are each JV-bit binary vectors; | indicates concatenation

SH_. ] = 1 if switch (∑, k) has been determined to be faulty, else Sn[k] = 0, Sγ[/] = 1 if switch (j, l) has been determined to be faulty, else Sv[/] - 0, where 1 < k, I <N. When a header traveling in the horizontal path arrives at a switch, i.e., when the message header enters a switch at input II, the following sequence occurs:

In the horizontal path, from switch (∑, 1) through (i,j) (at worst, the message will go up to (i,j)), all switches (∑, k), k <j, compares SH with its mask vector Mk, an _V-bit binary vector with Mk[ ] = 1, if / = k+l is a faulty switch, and Mk[/] = 0 otherwise. If Mk = SH_. then the header is routed out to the right diagonal output O3 to switch (i+l, k+l) (see Figure 6(i)), else the header is routed via output Ol to switch

(i, k+l). The address part of the header, D (=j), is compared with the label L (= k) in switch (∑, k).

If D = L and Turn = 1, the switch mode and the turn signal is reset (low), and the header is routed to the immediate lower switch (ι^'+l, k). Else if (D != L and Turn = High), the header is routed to the right switch (∑, k+l).

When a header traveling in the vertical path arrives at switch, i.e., when the message header enters a switch from input 12, the following sequence occurs:

In the vertical path, from switch (i,j) through (N,j), all switches (/,/), / < i, compares Sy with its mask vector Ml, an TV-bit binary vector with M\[p] = I, ifp = l+l, and Mk[p] = 0 otherwise.

If Mk = Sv, then the header is routed out to the right diagonal output 04 to switch (l+l,j+l) (see Figure 6(i)), else the header is routed via output 01 to switch (/+1, j).

The address part of the header, D (= ^'), is compared with the label L (= k) in switch (∑, k).

If D < L, then the header is routed out to the left diagonal output 03 to switch (/- 1,7-1) (see Figure 6(ii)), else the header is routed to the immediate lower switch (/, j+l) via output 02. Multiple Module Redundancy The crossbar switch can be employed in a variety of well known fashions such that a single set of inputs and outputs may transmit and receive data via a multiplicity of crossbar switches. (Figure 1). These methods include both "cold" and "hot" backup systems. When a fault occurs in a "cold" backup system the network ceases to use the current crossbar and begins to utilize the next redundant crossbar while isolating the faulty switch on the former crossbar. In a "hot" backup system data is constantly being transmitted over all crossbars simultaneously so that if an error occurs in one, the network can use the other switches to perform error correction, avoid the necessity of retransmission, and isolate the faulty switch on the faulty crossbar.

The foregoing description is representative of one embodiment and many other embodiments will be obvious to one skilled in the art. This description in no way limits the scope of our invention. The only limitations are expressed in the following claims;

Claims

1. A non-blocking, self-routing, fault-tolerant crossbar matrix comprising a set of identical n x n switches, wherein n is greater than two and represents the number of inputs and outputs of the switch, said matrix comprising: an array of switches arranged into an orthogonal M x N, logically planar matrix said matrix having M inputs and N outputs, each switch connected to each horizontally and vertically adjacent switch by a primary I/O pair; each switch also connected to at least one other switch in said matrix; and said array accessing data from a plurality of inputs and transmitting data to a plurality of outputs, any output being accessible to any input.

2. An apparatus according to claim 1 wherein n is 4 and each switch is connected to each diagonally adjacent switch by a secondary I/O pair such that there are two sets of secondary I/O pairs for each switch, as well as two sets of primary I/O pairs for each switch.

3. An apparatus according to claim 2 wherein the secondary I/O pairs are employed by the array to route data around faulty switches.

4. A self-routing 4x4 switch comprised of: two primary input gates, two primary output gates, two secondary input gates and two secondary output gates so arranged and disposed that; each primary input is routable to any of the four outputs; and each secondary input is also routable to any of the four outputs.

5. An apparatus according to claim 4 wherein: each input is routable to an explicit set of outputs under an explicit set of conditions.

6. An apparatus according to claim 5 wherein: a first primary input is routable to either primary output or to a first secondary output; a second primary input is routable to a second primary output or to said first secondary output; a first secondary input is routable to any primary or secondary output; and a second secondary input is routable to said second primary output.

7. A self-routing, non-blocking, fault tolerant crossbar matrix utilizing 4x4 switches in a logical plane wherein each switch is connected to every other adjacent switch by one-way I/O pairs comprising: primary input gates capable of passing data to either primary output gate under normal conditions; said primary gates being capable of passing data to a secondary gate in the event that an adjacent destination switch, connected to the primary output line, is faulty; said secondary input gates being able to receive data from the secondary output of a diagonally adjacent switch in the event of a faulty adjacent switch, from which data would normally be received at a primary input gate; and said secondary gates being capable of passing data to either primary or secondary outputs in accordance with predetermined logical rules for determining the chosen output.

8. A method of identifying a faulty switch in a self-routing, non-blocking, fault tolerant crossbar matrix utilizing 4x4 switches in a logical plane wherein each switch is connected to every other adjacent switch by one-way I/O pairs comprising the steps of : transmitting a message through each input of said matrix to each output of said matrix; determining if said message inerrantly reaches said output; recording the set of messages not correctly received as a data set; and using said data set to locate said faulty switch.

9. A method according to claim 8 wherein the method is performed by transmitting data to all inputs simultaneously.

10. A method according to claim 8 wherein the method is performed by broadcasting data from a single input to multiple outputs simultaneously.

11. A method of rerouting data in a self-routing, non-blocking, fault tolerant crossbar matrix utilizing 4x4 switches in a logical plane wherein each switch is connected to every other adjacent switch by one-way I/O pairs comprising the steps of: identifying each faulty switch; in the switch immediately preceding a faulty switch in an original data path, rerouting said data from said switch's primary output to a secondary output thereby re- routing data along a substitute path; and transmitting said data along said substitute path until said data may be routed back onto said original data path.

12. A method according to claim 11 wherein rerouting may occur multiple times due to multiple faulty switches.

13. A method according to claim 12 wherein said matrix may remain operational until a substantial percentage of switches have failed.

14. A method according to claim 13 wherein the functionality of said matrix will degrade gracefully.

15. A n-way redundant set of matrices employing a multiplicities of matrices as set forth in claim 1.

16 An apparatus according to claim 15 wherein redundant matrices are cold backups.

17. An apparatus according to claim 15 wherein redundant matrices are hot backups.