WO2011097385A2

WO2011097385A2 - Duo-dual tcam architecture for routing tables with incremental update

Info

Publication number: WO2011097385A2
Application number: PCT/US2011/023611
Authority: WO
Inventors: Sartaj Kumer Sahni; Tania Mishra
Original assignee: University Of Florida Research Foundation, Inc.
Priority date: 2010-02-03
Filing date: 2011-02-03
Publication date: 2011-08-11
Also published as: WO2011097385A3

Abstract

Various embodiments manage router tables by classifying a set prefixes in a plurality of router table prefixes as a set of leaf prefixes and a remaining set of prefixes in the plurality of router table prefixes as a set of internal prefixes. The set of internal prefixes is stored in a first ternary content addressable memory (TCAM). The set of leaf prefixes is stored in a second TCAM. A corresponding destination hop is stored in a first random access memory (RAM). A corresponding destination hop is stored in a second RAM. A packet with at least one destination address is received. A simultaneous lookup is performed in the first and second TCAMs to retrieve up to two index values using the destination address. A next hop is retrieved from the second RAM in response to the second TCAM returning an index. The packet is then routed to the next hop.

Description

DUO-DUAL TCAM ARCHITECTURE FOR ROUTING TABLES

WITH INCREMENTAL UPDATE

Cross-Reference To Related Applications

This application is based upon and claims priority to U.S. Provisional Patent Application Serial No. 61/300,945 filed February 3, 2010 the disclosure of which is hereby incorporated by reference in its entirety.

Statement Regarding Federally Sponsored Research

This invention was made with Government support under Contract No.: 0829916. The Government has certain rights in this invention.

Field of the Invention The present invention generally relates to the field of content addressable memories, and more particularly relates to ternary content addressable memories (TCAMs).

Background of the Invention

The high-speed table lookup property of ternary content addressable memories (TCAMs) is a key feature for implementation of fast engines to be used in packet forwarding. Some conventional implementations of TCAMs generally do not use any memory management mechanisms to keep track of the free slots in the TCAM and instead rely on either a TCAM search operation to find an empty slot when such a slot is needed or keep all unused slots in a contiguous block and allocate from either end of this block when a free slot is needed. Since a TCAM usually cannot perform a data plane search concurrent with a control plane search, update operations that perform TCAM searches delay data plane lookups. As TCAM lookups consume a significant amount of energy relative to that consumed by TCAM read/write operations, using lookups to locate free TCAM slots increases total energy consumption for updates significantly. Other conventional implementations of TCAMs that are directed at reducing the total TCAM power used to search routing tables of a given size unfortunately increase the total TCAM size needed relative to non- indexed TCAMs.

Brief Summary In one embodiment, a method for managing router tables is disclosed. The method comprises classifying a set prefixes in a plurality of router table prefixes as a set of leaf prefixes and a remaining set of prefixes in the plurality of router table prefixes as a set of internal prefixes. A leaf prefix is not a prefix of another prefix in a router table. The set of internal prefixes is stored in a first ternary content addressable memory. The set of leaf prefixes is stored in a second ternary content addressable memory. A corresponding destination hop is stored in a first random access memory for each internal prefix stored in the first ternary content addressable memory. A corresponding destination hop is stored in a second random access memory storing for each leaf prefix stored in the second ternary content addressable memory. A packet with at least one destination address is received. A simultaneous lookup is performed in the first ternary content addressable memory and the second ternary content addressable memory to retrieve up to two index values using the destination address. A next hop is retrieved from the second random access memory in response to the second ternary content addressable memory returning an index. The packet is then routed to the next hop.

In another embodiment, an information processing system for managing router tables is disclosed. The information processing system comprises a processor and a first ternary content addressable memory that is coupled to the processor. A second ternary content addressable memory is coupled to the processor. A first random access memory and a second random access memory are also coupled to the processor. The processor is configured to perform a method comprising classifying a set prefixes in a plurality of router table prefixes as a set of leaf prefixes and a remaining set of prefixes in the plurality of router table prefixes as a set of internal prefixes. A leaf prefix is not a prefix of another prefix in a router table. The set of internal prefixes is stored in a first ternary content addressable memory. The set of leaf prefixes is stored in a second ternary content addressable memory. A corresponding destination hop is stored in a first random access memory for each internal prefix stored in the first ternary content addressable memory. A corresponding destination hop is stored in a second random access memory storing for each leaf prefix stored in the second ternary content addressable memory. A packet with at least one destination address is received. A simultaneous lookup is performed in the first ternary content addressable memory and the second ternary content addressable memory to retrieve up to two index values using the destination address. If there is a match in the second ternary content addressable memory, then the next hop is retrieved from the second random access memory. Lookup in the first ternary content addressable memory takes more time due to the presence of a priority encoder, used to select the best match among multiple matches. Hence, if a match is found in the second ternary content addressable memory then the yet-to- complete lookup in the first ternary content memory is aborted. Otherwise, if there is no match in the second ternary content addressable memory, then the next hop is retrieved from the first random access memory corresponding to the best matching entry in the first ternary content addressable memory. The packet is then routed to the next hop.

In yet another embodiment, a computer program product for managing router tables is disclosed. The computer program product comprises a storage medium that is readable by a processing circuit and stores instructions for execution by the processing circuit for performing a method. The method comprises classifying a set prefixes in a plurality of router table prefixes as a set of leaf prefixes and a remaining set of prefixes in the plurality of router table prefixes as a set of internal prefixes. A leaf prefix is not a prefix of another prefix in a router table. The set of internal prefixes is stored in a first ternary content addressable memory. The set of leaf prefixes is stored in a second ternary content addressable memory. A corresponding destination hop is stored in a first random access memory for each internal prefix stored in the first ternary content addressable memory. A corresponding destination hop is stored in a second random access memory storing for each leaf prefix stored in the second ternary content addressable memory. A packet with at least one destination address is received. A simultaneous lookup is performed in the first ternary content addressable memory and the second ternary content addressable memory to retrieve up to two index values using the destination address. If there is a match in the second ternary content addressable memory, then the next hop is retrieved from the second random access memory. Lookup in the first ternary content addressable memory takes more time due to the presence of a priority encoder, used to select the best match among multiple matches. Hence, if a match is found in the second ternary content addressable memory then the yet-to-complete lookup in the first ternary content memory is aborted. Otherwise, if there is no match in the second ternary content addressable memory, then the next hop is retrieved from the first random access memory corresponding to the best matching entry in the first ternary content addressable memory. The packet is then routed to the next hop.

Brief Description of the Drawings

The accompanying figures where like reference numerals refer to identical or functionally similar elements throughout the separate views, and which together with the detailed description below are incorporated in and form part of the specification, serve to further illustrate various embodiments and to explain various principles and advantages all in accordance with the present invention, in which:

Figure 1 illustrates the high level functions of control and data planes;

Figure 2 illustrates an insertion and deletion into a TCAM implementing leaf pushing in MIPS;

Figure 3 illustrates a dual TCAM with a SRAM according to one embodiment of the present invention;

Figures 4A to 4C illustrate DUOS for an example 5-prefix forwarding table according to one embodiment of the present invention;

Figure 5 shows a table of control-plane trie functions according to one embodiment of the present invention;

Figure 6 shows a table of functions used for incremental update according to one embodiment of the present invention; Figure 7 shows one algorithm for inserting into DUOS according to one embodiment of the present invention;

Figure 8 shows one algorithm to delete from DUOS according to one embodiment of the present invention;

Figure 9 shows one algorithm to change a next hop in DUOS according to one embodiment of the present invention;

Figure 10 shows algorithms to insert, delete or change a prefix in the ITCAM of DUOS according to one embodiment of the present invention;

Figure 11 shows algorithms to insert, delete or change a prefix in the LTCAM of DUOS according to one embodiment of the present invention; Figure 12 shows one algorithm for a move function from an ITCAM[src] to an ITCAM[dest] according to one embodiment of the present invention;

Figures 13A to 13E illustrate a prefix arrangement in ITCAM for a first memory management mechanism for Internet Protocol version 4 (IPv4) according to one embodiment of the present invention;

Figure 14 illustrates an algorithm associated with the first memory management mechanism for getting a free slot to insert a prefix whose length is len according to one embodiment of the present invention;

Figure 15 illustrates an algorithm associated with the first memory management mechanism to free a slot previously occupied by a prefix of length len according to one embodiment of the present invention;

Figures 16A to 16E illustrate an ITCAM layout for a second memory management mechanism (also known as DFS_PLO (Distributed Free Space with Prefix Length Ordering Constraint)) according to one embodiment of the present invention;

Figure 17 illustrates an algorithm associated with the second memory management mechanism for getting a free slot to insert a prefix whose length is len according to one embodiment of the present invention;

Figure 18 illustrates supporting algorithms for the algorithm shown in Figure 17 according to one embodiment of the present invention; Figure 19 illustrates an algorithm associated with the second memory management mechanism to free a slot according to one embodiment of the present invention; Figures 20A to 20G illustrate an ITCAM layout for a third memory management mechanism (also known as DLFS_PLO (Distributed and Linked Free Space with Prefix Length Ordering Constraint))with move for insert and delete according to one embodiment of the present invention;

Figure 21 illustrates an algorithm associated with the third memory management mechanism for getting a free slot to insert a prefix whose length is len according to one embodiment of the present invention;

Figure 22 illustrates supporting algorithms for the algorithm shown in Figure 21 according to one embodiment of the present invention;

Figure 23 illustrates an algorithm associated with the third memory management mechanism to free a slot according to one embodiment of the present invention; Figure 24 illustrates a getSlot algorithm associated with the fourth memory management mechanism according to one embodiment of the present invention;

Figure 25 illustrates a freeSlot algorithm associated with the fourth memory management mechanism according to one embodiment of the present invention;

Figure 26 illustrates supporting control plane trie algorithms used by the getSlot and freeSlot algorithms associated with the fourth memory management mechanism according to one embodiment of the present invention;

Figures 27A and 27B illustrate carving using a conventional method and carving according to one embodiment of the present invention;

Figure 28 illustrates one algorithm to carve a leaf trie to obtain disjoint Q(N)s according to one embodiment of the present invention;

Figure 29 illustrates a DUOW algorithm to insert a prefix into the LTCAM according to one embodiment of the present invention;

Figure 30 illustrates one algorithm to add a suffix to a wide LSRAM word according to one embodiment of the present invention; Figure 31 illustrates one algorithm to split a wide LSRAM word into two according to one embodiment of the present invention;

Figure 32 illustrates a DUOW algorithm to delete a leaf prefix according to one embodiment of the present invention; Figure 33 illustrates a DUOW algorithm to change the next hop of a leaf prefix according to one embodiment of the present invention;

Figure 34 illustrates an assignment of the prefixes shown in Figures 4A to 4C to the two TCAMs in the dual TCAM architecture according to one embodiment of the present invention; Figure 35 illustrates a DLTCAM insert algorithm for IDUOW according to one embodiment of the present invention;

Figure 36 illustrates algorithm for adding a suffix to a DLSRAM word in IDUOW according to one embodiment of the present invention;

Figure 37 illustrates one algorithm for splitting a DLSRAM word in IDUOW according to one embodiment of the present invention;

Figure 38 illustrates one algorithm for deleting a leaf prefix in IDUOW according to one embodiment of the present invention;

Figure 39 illustrates one algorithm for changing the next hop of a leaf prefix in IDUOW according to one embodiment of the present invention; Figure 40 illustrates one algorithm for deleting prefixes in IDUOW according to one embodiment of the present invention;

Figure 41 illustrates a conventional l-12Wc configuration;

Figure 42 illustrates a l-12Wc configuration according to one embodiment of the present invention;

Figure 43 illustrates an algorithm for assigning a new bucket in a modified l-12Wc configuration according to one embodiment of the present invention;

Figure 44 illustrates a conventional M-12Wb configuration;

Figure 45 illustrates a Visit algorithm used in a modified M-12Wb style of prefix assignment according to one embodiment of the present invention;

Figure 46 illustrates a Split-a-node algorithm used in a M-12Wb style of prefix assignment according to one embodiment of the present invention;

Figure 47 illustrates an Assign-a-new-bucket algorithm used in a M-12Wb style of prefix assignment according to one embodiment of the present invention; Figure 48 illustrates a mcremenl and Decrement room algorithm used In a M-12Wb style of prefix assignment according to one embodiment of the picseiil invention;

Figure 49 shows the characteristics of datasefa stored in a simple TCA ;

Figure SO shows the total number of prefix moves ("'·*·, number of invocations of movcO) required for Inserts (includes raw inserts change next bop inserts) and deletes in test update sequences when the prefixes are stored in a simple TCAM;

Figure 51 shows the average number of prefix moves 0-*·, number of invocations of moveQ) required for inserts ind deletes in te<a update sequences when the prefixes are stored in a simple TCAM;

Figure 52 shows the number of wait ritcs (sum of invocations of wah Write V alkiateO and invalidateWakWriteO). which b the equal to me sum of inserts, deletes and moves for the simple TCAM and reflects the update performance for the four memory management mechanisms of various embodiments of the present invention;

Figure S3A shows the normalized average number of moves for each memory management mechanism of one or more embodiments of the present invention on & logarithmic scale;

Figure S3B shows (he normalized average waltWrites invoked by the memory management mechanisms of various embodiments of the present inventio ;

Figure 54 shows the number of moves for inserts based on TCAM occupancy using the third memory management anechax ism according to one embodiment of the present invention;

Figure 55 shows a distribution of prefixes, inserts, and deletes for DUOS according to one embodiment of the present invention;

Figure 56 shows a number of moves for Inserts and deletes in the ITCAM of DUOS according lo one embodiment of the present invention;

Figure 57 shows the number of waif rites in the ITCAM of DUOS according to one embodiment of the present invention;

Figure 58 shows the number of LTCAM moves and waifWrites for DUOS according to one embodiment of the present Invention;

Figure 59 shows the number of prefixes to be stored in the LTCAM and associated wide SRAM according to one embodiment of the present invention;

Figure 60 shows the number of wait Writes in the LTCAM of DUOW according to one ernbodiraent of the present invention; Figure 61 shows the number of waitWrites for the ILTCAM of IDUOW according to one embodiment of the present invention;

Figure 62 shows statistics for the DLTCAM of IDUOW using a l-12Wc according to one embodiment of the present invention; Figure 63 shows statistics for the ILTCAM of IDUOW using a M-12Wb according to one embodiment of the present invention;

Figure 64 shows statistics for the DLTCAM of IDUOW using a M-12Wb according to one embodiment of the present invention;

Figure 65 shows the total number of waitWrites required to perform the test update sequences using different architectures;

Figure 66 A shows the normalized average waitWrites for the different architectures;

Figure 66B shows the normalized average power for the different architectures;

Figure 67 shows the maximum number of write operations required by an insert or delete in the test update sequences; Figure 68 the power consumption characteristics of MIPS, CAO_OPT and DUO in terms of the number of entries enabled during a search operation;

Figure 69 illustrates one example of the extra level of TCAM (index); and

Figure 70 illustrates one example of an operating environment according to one embodiment of the present invention. Detailed Description

As required, detailed embodiments of the present invention are disclosed herein; however, it is to be understood that the disclosed embodiments are merely examples of the invention, which can be embodied in various forms. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a basis for the claims and as a representative basis for teaching one skilled in the art to variously employ the present invention in virtually any appropriately detailed structure and function. Further, the terms and phrases used herein are not intended to be limiting; but rather, to provide an understandable description of the invention. The terms "a" or "an", as used herein, are defined as one or more than one. The term plurality, as used herein, is defined as two or more than two. The term another, as used herein, is defined as at least a second or more. The terms including and/or having, as used herein, are defined as comprising (i.e., open language). The term coupled, as used herein, is defined as connected, although not necessarily directly, and not necessarily mechanically. The terms program, software application, and other similar terms as used herein, are defined as a sequence of instructions designed for execution on a computer system. A program, computer program, or software application may include a subroutine, a function, a procedure, an object method, an object implementation, an executable application, an applet, a servlet, a source code, an object code, a shared library/dynamic load library and/or other sequence of instructions designed for execution on a computer system.

1 Identified Problems

Research on TCAM routers has focused on lowering the power consumption [See 11, 8, 4, 9, 10, 2, 22, 21, 15, 14, 16, 17, which are hereby incorporated by reference in their entireties], creating new router architectures involving multiple TCAMs that achieve even faster lookup [See 27, 26, which are hereby incorporated by reference in their entireties], and developing efficient strategies for incremental updates [See 23, 18, 19, which are hereby incorporated by reference in their entireties]. Various embodiments of the present invention provide router architectures that have efficient support for incremental updates. The following is an overview of TCAM incremental updates.

Shah and Gupta [23] describe incremental update algorithms for TCAMs using two different strategies to place prefixes in the TCAM. In PLO_OPT (prefix length ordering constraint), the prefixes are placed in the TCAM in decreasing order of length. Unused TCAM slots/words are in the middle of the TCAM. So, prefixes of length W , · · · , W/2+ 1 are above the free slots and the remaining prefixes are below the free slots, where W = 32 for IPv4. An insert or delete requires at most W/2 prefix moves in PLO_OPT. In CAO_OPT (chain-ancestor ordering constraint), the prefixes are placed in the TCAM so that if two prefixes are nested, the longer prefix precedes the shorter one. If the binary trie representation of the prefixes of the routing table are started with, the prefixes along any path from the trie root to a trie leaf are nested. So, every root to leaf path in the trie defines a chain of nested prefixes. In CAO_OPT, the prefixes on every chain appear in reverse order in the TCAM. This placement ensures that the first prefix in the TCAM that matches a destination address is the longest matching prefix. The TCAM free slots are in the middle of the TCAM. If the maximum number of prefixes in a nested chain is q, then at most [q/2] prefixes of a chain are above the free slots. An insert or delete in CAO_OPT requires at most [q/2] = W/2 moves. Since q is about 6 in practical routing tables, CAO_OPT gives a performance improvement over PLO_OPT in practice (though the worst-case performance of both is the same).

Wang et al. [18] define a consistent rule table to be a rule table in which the rule matched (including the action associated with the rule) by a look up operation performed in the data plane is either the rule (including action) that would be matched just before or just after any ongoing update operation in the control plane. Wang et al. [18] develop a scheme for consistent table update without locking the TCAM at any time, essentially allowing a search to proceed while the table is being updated. Consistency is ensured by avoiding overwriting of a TCAM entry. Their CoPTUA algorithm can be applied to the PLO_OPT and CAO_OPT mechanisms of [23] so that rule updates can be carried out without locking the table for data plane lookups under suitable assumptions for TCAM operation [18].

Wang and Tzeng [19] also propose a consistent TCAM mechanism. Their mechanism, MIPS, however delays data plane lookups that match TCAM slots whose next hop information is being updated. In MIPS, the TCAM stores a set of independent prefixes (i.e., disjoint). This set of independent prefixes is obtained from the original set of prefixes by using the leaf pushing technique [See 25, which is hereby incorporated by reference in its entirety] followed by a compression step. Since the prefixes in the TCAM are independent, at most one prefix matches any given destination address. Hence, the independent prefixes may be placed in the TCAM in any order and the priority encoder logic of the TCAM can be dispensed, which results in a reduction in TCAM lookup latency by about 50% [See 24, which is hereby incorporated by reference in its entirety]. Further, a new prefix may be inserted into any free slot of the TCAM and an old prefix deleted by simply setting the associated slot's valid bit to 0. While the use of an independent prefix set simplifies table management, leaf pushing replicates a prefix many times. In the worst case, an insert or delete, requires changes to Q(n)TCAM entries, where n the number of independent prefixes in the TCAM (Figures 2A and 2B). With respect to Figure 2A (which shows a 4-prefix trie 200), insertion of the root prefix requires the insertion of 4 new independent prefixes into the TCAM. Similarly, with respect to Figure 2B, which illustrates another trie 2002 (where Insert < * /0, HO >), the deletion of the root prefix requires withdrawal of these 4 prefixes from the TCAM.

Furthermore, the number of independent prefixes that result from leaf pushing and compression can be quite large as, in the worst-case, the compression step may fail to do any reduction in the prefix set following leaf pushing. Experimental results presented in [19] suggest, however, that, on practical rule sets, leaf expansion and compression actually reduce the number of prefixes by 20% to 68% because of the prevalence of a large number of redundant prefixes in practical rule sets. Further, each update operation results in between one and two accesses to the TCAM on average. Wang and Zheng [19] do not use any memory management mechanisms to keep track of the free slots in the TCAM and instead rely on a TCAM search operation to find an empty slot when such a slot is needed. Since a TCAM cannot perform a data plane search concurrent with a control plane search, update operations delay data plane lookups. In practice, since the number of updates per second is quite small and since each routing table update results in only one or two TCAM update operations (on average) the delay caused by control plane lookups on data plane lookups is quite small. As TCAM lookups consume a significant amount of energy relative to that consumed by TCAM read/write operations, using lookups to locate free TCAM slots increases total energy consumption for updates significantly. Zane et al. [21] propose an indexed TCAM scheme to reduce the total TCAM power used to search routing tables of a given size. The indexed TCAM schemes of [21], however, increase the total TCAM size needed relative to non-indexed TCAMs. Lu and Sahni [4] couple indexed TCAMs with wide SRAMs to reduce both power and TCAM memory by a significant amount. Although the strategies of [4] are power and memory efficient, they are not well suited to incremental update. Similarly the prefix compaction methods of [11, 8, 22], while resulting in power and memory reduction, do not lend themselves well to incremental update. Chang [2] proposes a TCAM partitioning and indexing scheme in which the TCAM index is stored in a pivot prefix SRAM and an index SRAM. In Chang's scheme [2], the TCAM index is searched using a binary search that makes 0(logK)SRAM accesses to determine the TCAM bucket that is to be searched. On the other hand, the scheme of Zane et al. [21] stores its index in a TCAM enabling the determination of the bucket for further search by a query on the index TCAM. As a result, a lookup takes 2 TCAM searches when the mechanism of [21] is used and take 1 TCAM search plus 0(logK)SRAM accesses when the scheme of [2] is used.

2 Overview The primary function of an Internet router is to forward packets using a table of rules. A packet forwarding rule (P, H) comprises a prefix P and a next hop H. A packet with destination address d is forwarded to H where H is the next hop associated with the rule that has the longest prefix that matches d. The set of rules is referred to as the rule table or forwarding table. Packet forwarding is performed in the data plane while route updates are done in the control plane. Whereas the data plane receives tens or even hundreds of millions of packets per second, the control plane receives only thousands of update requests per second. Figure 1 illustrates the high level functions 100 of control and data planes.

With the rapid global spread of the Internet, the forwarding table size at each router is growing fast as is the number of route updates that are received by a router due to extensive interconnections. Presently, the largest forwarding tables have about one million rules and the number of updates peaks at about 10,000 updates per second. At a line rate of lOGbps and a minimum packet size of 40 bytes, the number of data plane lookups per second exceeds 30 million.

A number of fast lookup mechanisms are described in the literature that use TCAMs as the main hardware component since TCAMs are simple to use and provide high-speed table lookup. See 12, 13, which also survey non-TCAM approaches to routing table management and are hereby incorporated by reference in their entireties. A TCAM is a special type of content addressable memory (CAM) that allows each memory bit to store one of the three values: 0, 1, x (don't care). The prefix of a rule is stored in a word of TCAM and the next hop is stored in the corresponding word of an associated static random access memory (SRAM). The entries of a TCAM may be searched in parallel or simultaneously for a prefix that matches a given destination address. If multiple matching entries are found then the best match is selected by a priority encoder. This priority encoder can select the best match by a first found algorithm, a last found algorithm, or any other algorithm. The best match is quite frequently identified as the first entry that matches. Using the index of the best matched TCAM entry, the corresponding SRAM word is accessed to determine the next hop. When the prefixes are stored in decreasing order of prefix length, it is possible to determine the TCAM index of the longest matching prefix for any destination address in one TCAM cycle. It should be noted that a TCAM word is 32 bits for IPv4 applications. The discussed TCAM mechanism is referred to as the simple TCAM mechanism [4].

The main drawback of using TCAMs in a router's forwarding engine is that a TCAM consumes a high amount of power for each lookup operation since every TCAM cell in the array is activated for each lookup. There has been a significant amount of research in trying to reduce the power consumption in TCAMs [4, 18, 19, 23, 21, 22]. Lu and Sahni in [4] propose a technique that utilizes wide SRAMs to store portions of prefixes along with their next hops in each SRAM word. This mechanism reduces the TCAM size and power requirement drastically. The Simple TCAM with Wide SRAM (STW) organization is the basic mechanism in [4] that demonstrates the potential of saving TCAM space and power by utilizing wide SRAM words. One drawback of the STW mechanism is that incremental update algorithms are complex because of the need to handle covering prefixes that may be replicated many times. On the other hand, batch update algorithms require twice the memory footprint so forwarding and updating can be applied on two separate copies of the forwarding table [22] .

Wang et al. [18] propose a consistent table update mechanism that eliminates the need to lock the forwarding table during an update, preserving the correctness of rule matching at all times. Since lookups can proceed at their usual speed even as updates are being carried out, there is no need to minimize the number of rule moves required to incorporate an update as long as the rate of processing keeps up with the arrival rate for updates. However, this does not undermine the advantage of a fast update process requiring a smaller number of rule moves since with a faster process fewer packets will be forwarded to non-optimal next hops. Wang and Tzeng [19] use leaf pushing to transform the prefixes in the routing table into a set of independent prefixes, which are then stored in a TCAM (in any order). Their consistent update mechanism, however, delays data plane lookups that match TCAM slots whose next hop information is being updated. Although, on average, each insert or delete request results in a very small number of insert/delete operations on the set of independent prefixes stored in the TCAM, worst-case inserts and deletes require Q(n)insert/delete operations on the set of independent prefixes, where n is the number of independent prefixes. Hence, an adversary can significantly compromise the router by maliciously injecting a sequence of worst-case updates. Further, the method of [19] uses a TCAM search to find a free TCAM slot for an insert and this search interrupts the lookups taking place in the data plane.

Therefore, various embodiments of the present invention provide a novel dual TCAM architecture herein referred to as "DUO". The following discusses three embodiments of DUO, along with advanced memory management mechanisms for performing efficient and consistent incremental updates without degrading lookup speed. For example, router table updates can be performed without interrupting lookup operations in the LTCAM and ITCAM. A first embodiment of the architecture is DUOS, a dual TCAM with simple SRAM, where both the TCAMs have a simple associated SRAM that is used for storing next hops. A second embodiment of the architecture is DUOW, a dual TCAM with wide SRAM, where one or both the TCAMs have wide associated SRAMs that are used to store suffixes as well as next hops. A third embodiment is IDUOW, an indexed dual TCAM with wide SRAM, in which either or both TCAMs have an associated index TCAM.

Advantagesof the dual TCAM architecture and the memory management mechanisms are: (1) Incremental updates in DUOS require far fewer rule moves than required by the simple TCAM mechanism. The total TCAM and SRAM space used by DUOS is the same as that used by the simple TCAM mechanism. (2) The wide SRAM mechanism of [4] may be coupled with DUOS to arrive at DUOW and IDUOW, which provide considerable reduction in TCAM memory and power while preserving the efficient incremental update capability of DUOS. (3) Employing the third memory management mechanism architecture (discussed below) to manage the memory of a simple TCAM enables the simple TCAM to outperform the CAO_OPT mechanism (Mechanism 4) of [23] with respect to the time required to complete update sequences that arise in practice. Compared to the PLO_OPT memory management mechanism (Mechanism 1) discovered in [23], however, CAO_OPT is superior (as expected by the analysis of [23]).

The following discussion is organized as follows. The DUOS architecture and various memory management embodiments/mechanisms are discussed in Section 3 , DUOW is discussed in Section 4, and IDUOW is discussed in Section 5. An experimental evaluation of DUO is presented in Section 6 and a conclusion is provided in Section 7.

3 Simple Dual TCAM - DUOS

DUOS uses any reasonably efficient data structure to store the routing-table rules in the control plane. For example, a simple data structure such as a binary trie or 1-bit trie stored in a 100ns DRAM, permits about 300K IPv4 lookups, inserts, and deletes per second. This performance is quite adequate for the anticipated tens of thousands of control plane operations. For concreteness, it is assumed that a binary trie is used, in the control plane, to store the routing-table rules. Additionally, DUOS uses two TCAMs each with an associated SRAM. The TCAMs are labeled ITCAM 302 (Interior TCAM) and LTCAM 304 (Leaf TCAM) in Figure 3. The associated SRAMs are similarly labeled ISRAM 306 and LSRAM 308. Prefixes stored in leaf (or non-leaf or interior) nodes of the control plane trie are stored also in the LTCAM (or ITCAM) and their associated next hops are stored in the LSRAM (or ISRAM). Since the LTCAM stores only leaf prefixes, the prefixes in the LTCAM are disjoint and at most one may match a given destination address, whereas the ITCAM can have multiple matches. Hence, the LTCAM does not require a priority encoder 310, as does the ITCAM 302. Therefore, the LTCAM runs much faster, approximately 50% faster, than the ITCAM with the priority encoder. A data plane lookup is performed by doing a search for the packet' s destination address in both ITCAM and LTCAM. The ITCAM search yields the next hop associated with the longest matching non-leaf prefix while the LTCAM search yields the next hop associated with at most one leaf prefix that matches the destination address. Additional logic shown in Figure 3 returns the next hop (if any) from the LTCAM search; the next hop from the ITCAM search is returned only if the LTCAM search found no match. The correctness of the lookup is readily established. Figures 4A to 4C shows a 5- prefix forwarding table 400 (Figure 4A) together with its corresponding binary trie 402 (Figure 4B) that is stored in the control plane as well as the content 404 (Figure 4C) of the two TCAMs and the two SRAMs of DUOS. Each node of the control plane trie has fields such as prefix, slot, nexthop and length in which the prefix (if any) stored at this node is recorded along with the ITCAM or LTCAM slot in which the prefix is stored and the nexthop and length of the prefix. Functions 500 for basic operations on the control plane trie (hereinafter simply referred to as trie) are assumed and shown in FIG. 5.

As the control plane will modify the ITCAM, LTCAM, ISRAM, and LSRAM while the data plane performs lookups, the TCAMs are dual ported. Specifically, the following assumptions are made: (l) Each TCAM has two ports, which can be used to simultaneously access the TCAM from the control plane and the data plane. (2) Each TCAM entry/slot is tagged with a valid bit, that is set to 1 if the content for the entry is valid, and to 0 otherwise. A TCAM lookup engages only those slots whose valid bit is 1. The TCAM slots engaged in a lookup are determined at the start of a lookup to be those slots whose valid bits are 1 at that time. Changing a valid bit from 1 to 0 during a data plane lookup does not disengage that slot from the ongoing lookup. Similarly, changing a valid bit from 0 to 1 during a data plane lookup does not engage that slot until the next lookup.

In this embodiment, the function waitWrite Validate, which writes to a TCAM slot and sets the valid bit to 1, is assumed to be available. In case the TCAM slot being written to is the subject of ongoing data plane lookup, the write is delayed till this lookup completes. During the write, the TCAM slot being written to is excluded from data plane lookups. In one embodiment, a possible mechanism to accomplish this exclusion is to set the valid bit to 0 before commencing the write and to change this bit to 1 when the write completes. This exclusion is equivalent to the requirement that "After a rule is matched, resetting the valid bit has no effect on the action return process" [18], and to setting the valid entry to "hit" [19]. Similarly, the function invalidateWaitWrite, which sets the valid bit of a TCAM slot to 0 and then writes an address to the associated SRAM word in such a way that the outcome of the ongoing lookup is unaffected, is assumed to be available.

It is noted that waitWrite Validate may, at times, write the prefix and nexthop information in the TCAM and associated SRAM slot and validate it, without any wait. This happens, for example, when the writing is to be done to a TCAM slot that is not the subject of the ongoing data plane lookup. The wait component of the function waitWrite Validate is said to be null in this case.

Figure 6 lists the various update algorithms 600 that are defined later in this section for DUOS and its associated ITCAM and LTCAM. The indentation represents the hierarchy of function calls. A function at one level of indentation calls one or more functions below it at the next level of indentation or at the same level of indentation.

3.1 DUOS Incremental Update Algorithms

3.1.1 Insert

Figure 7 shows an algorithm 700 to insert a new prefix p of length 1 and nexthop h for DUOS. For simplicity, it is assumed that p is, in fact new (i.e., p is not already in the rule table). First, p is inserted into the trie using the trie insertion algorithm, which returns nodes m and n, where m is the trie node storing p and n is the nearest ancestor (if any) of m that has a prefix. When m is a leaf of the trie, there is a possibility that the insertion of p transformed a prefix that was previously a leaf prefix into a non-leaf prefix. If so, this prefix is moved from the LTCAM to the ITCAM. Regardless, p is inserted into he LTCAM. When m is not a leaf, p is inserted into the ITCAM.

3.1.2 Delete

Figure 8 shows an algorithm 800 to delete the prefix p from DUOS. For simplicity, it is assumed that p is, in fact, present in the rule table and so may be deleted. First, p is deleted from the trie. The trie deletion function returns nodes m and n, where m is the trie node where p was stored and n is the nearest ancestor (if any) of m that has a prefix. If m was a leaf, then p is to be deleted from the LTCAM. In this case, the prefix (if any) in n may become a leaf prefix. If so, the prefix in n is to be moved from the ITCAM to the LTCAM. When m is not a leaf, p is deleted from the ITCAM.

3.1.3 Change

To change the nexthop of an existing prefix to newH, first the next hop of the prefix in the trie is changed and the node m that contains p is returned. Then, depending on whether m is a leaf or non-leaf, the change function for the corresponding TCAM is invoked. Figure 9 shows an algorithm 900 or this process.

3.2 ITCAM Algorithms

The prefixes in the ITCAM are stored in such a manner as to support determining the longest matching prefix (i.e., in any topological order that conforms to the precedence constraints defined by the binary trie- pi must come before p2 whenever pi is a descendent of p2 [23]). Decreasing order of length is a commonly used ordering. The function getSlot(length) returns an ITCAM slot such that insertion of the new prefix into this slot satisfies the ordering constraint in use provided the new prefix has the specified length; the function freeSlot(slot, length) frees a slot previously occupied by a prefix of the specified length and makes this slot available for reuse later. These functions, which are discussed in Section 3.4 below, are used in an ITCAM insert algorithm 1000, delete algorithm 1002, and change algorithm 1004 shown in Figure 10, which are self explanatory, of various embodiments of the present invention.

Notice that following the first step of the change algorithm, the prefix whose next hop is being changed is in two valid slots of the ITCAM-oldSlot and slot. This duplication does not affect correctness of data plane lookups as whichever one is matched by the ITCAM, the next hop that is valid either before or after the change operation is returned. On the other hand, if an attempt is made to change the next hop in ISRAM[oldSlot]directly, an ongoing lookup may return a garbled next hop. Similarly, if the delete is performed first and then insert, lookups that take place between the delete and the insert may return a next hop that doesn't correspond to the routing table state either before or after the change. If a waitWrite Validate is used to change ISRAM[oldSlot] to nexthop, oldSlot becomes unavailable for data plane lookups during the write operation and inconsistent results are returned in case the prefix in TCAM[oldSlot] is the longest matching prefix.

3.3 LTCAM Algorithms

The prefixes in the LTCAM are disjoint and so may be stored in any order. The unused (or free) slots of the LTCAM/LSRAM are linked together into a chain using the words of the LSRAM to build this chain. A computer variable AV is used to store the index of the first available LSRAM word on the chain. Stated differently, AV is an integer variable used to store the address of the first available or free slot in the LTCAM-LSRAM system. The non-available slots store valid prefixes and corresponding nexthops in the LTCAM and the LSRAM, respectively. So, the free slots are AV, LSRAM[AV ], LSRAM[LSRAM[AV ]], and so on. The last free slot on the AV chain has LSRAM[last] = -1. The LTCAM insert algorithm 1100, delete algorithm 1102, and change algorithm 1 104 are shown in Figure 11.

3.4 ITCAM Memory Management

In this section, four embodiments of memory management mechanisms for an ITCAM are discussed. The discussion of each memory management mechanism includes an implementation of the getSlot and freeSlot functions discussed above in Section 3.2 to get and free ITCAM slots. The implementations employ the function move (Figure 12) that moves the content of an in-use ITCAM slot to a free ITCAM slot in such a way as to maintain data plane lookup consistency. The memory management algorithms of various embodiments of the present invention maintain the invariant that an ITCAM slot has its valid bit set to 0 iff (if and only if) that slot was not matched by the ongoing data plane lookup (if any); that is, iff the slot is not involved in the ongoing data plane lookup. 3.4.1 Memory Management Mechanism 1

This memory management embodiment, which is the PLO_OPT mechanism of [23], has the ITCAM slots indexed 0 through N, and is shown in Figures 13A to 13E. In Figure 13 A to 13E. In particular, Figures 13A to 13E show the prefix arrangement 1300 in ITCAM for the first memory management mechanism. The blocks with a diagonal pattern indicate the free space pool. The numbers "1" and "2" by the curved arrows correspond to the first and second move, respectively. FIG. 13A shows the initial arrangement. Figure 13B shows an insert p/30 operation. Figure 13C shows free space available in block 30 for insert. Figure 13D shows a delete p/24 operation. Figure 13E shows that the free space has been returned to the pool. The prefixes are stored in decreasing order of length in the TCAM, which ensures that the longest matching prefix is returned as the first matching prefix. The pool of free slots is kept at the logical center of the TCAM, that is, the first free slot in the pool appears after all blocks of prefixes of length W/2 + 1 or more and the last free slot appears before all blocks of prefixes of length W/2 or less, where W is the width of the IP address (32 in the case of IPv4). As noted in [23], this architecture requires at most W/2 moves for each getSlot and freeSlot request. This first embodiment of the memory management mechanism provides an implementation that maintains consistency of data plane lookups.

This lookup consistent implementation of getSlot and freeSlot employs the following variables:

W = prefix length (32 for IPv4);

top[i] = first slot used by block i, l<i<W/2;

bot[i] = last slot used by block i, W/2 + l<i<W .

The following invariants are maintained:

top[i] = top[i-l] iff block i is empty, l<i<W/2

bot[i] = bot[i+l] iff block i is empty, W/2 + l<i<W.

3.4 ITCAM Memory Management

Initially, all blocks are empty and top[0 : W/2] = N+l and bot[W/2 + 1 : W + 1] = -1 (recall that the ITCAM slots are indexed 0:N). Figures 14 and 15, respectively, show the getSlot and freeSlot algorithms 1400, 1500, respectively, for Mechanism 1. The getSlot algorithm 1400 for getting a free slot to insert a prefix whose length is len. The freeSlot algorithm 1500 is for freeing a slot previously occupied by a prefix of length len. Their correctness and the fact that data plane lookup consistency is preserved are easily established. 3.4.2 Memory Management Mechanism 2 This memory management embodiment is a variation of the first memory management embodiment discussed above in which the free slots are in the boundary between two prefix blocks, and is shown in Figures 16A to 16E. This memory management embodiment is also called DFS_PLO (Distributed Free Space with Prefix Length Ordering Constraint). In particular, Figures 16A to 16E show the prefix arrangement 1600 in ITCAM for the second memory management mechanism. The blocks with a diagonal pattern indicate the free space pool. The curved arrows correspond to a move. FIG. 16A shows the initial arrangement. Figure 16B shows an insert p/30 operation. Figure 16C shows free space available in block 30 for insert. Figure 16D shows a delete p/24 operation. Figure 16E shows that the free space has been returned to the adjacent pool. At the time the ITCAM is initialized, the available free slots are distributed in proportion to the number of prefixes in a block with the caveat that an empty block gets 1 free slot at its boundary. In this embodiment, top[i] is the slot where the first prefix of length i is stored and bot[i] is the slot where the last prefix of length i is stored, 0<i<W (i.e., these variables define the start and end of block i). Note that top[i] < bot[i] for a non-empty block i and top[i] > bot[i] for an empty block. For convenience, the following is defined: top[0]=bot[0]=N +1 and top[W +l]=bot[W +1] = -1. For an empty ITCAM, top[i] = N + 1 for l<i<W; bot[i] = -1 for l<i<W.

The getSlot algorithm 1700 for the second memory management mechanism, as shown in FIG. 17, provides a free slot from either block boundary when there is a free slot on the block boundary. Otherwise, the algorithm moves a free slot from the nearest block boundary that has a free slot. This algorithm utilizes several supporting algorithms 1800, 1802, 1804, 1806 that are shown in Figure 18. The algorithms movesFromAbove 1800 and movesFromBelow 1802 return the number of prefix moves that are required to get the nearest free slot from above and below, respectively, the block where it is needed. The algorithms getFromAbove 1804 and getFromBelow 1806 get the nearest free slot above or below the block where the free slot is needed, respectively. The algorithm 1900 to free a slot for the second memory management mechanism, as shown in Figure 19, simply moves the slot to be freed to the block boundary unless this slot is at the boundary to begin with. Again, correctness and consistency are established easily. Although the worst-case performance of the algorithms of this second memory management embodiment is the same as that of the algorithms of the first memory management embodiment, it is expected that the second memory management embodiment algorithms have better performance on average. 3.4.3 Memory Management Mechanism 3

This memory management embodiment is an enhancement of the second memory management embodiment in which a doubly-linked list of free slots is maintained within each block in addition to contiguous free slots at the block boundaries. This memory management embodiment is also called DLFS_PLO (Distributed and Linked Free Space with Prefix Length Ordering Constraint). Figures 20A to 20G show the ITCAM layout 2000 for this third memory management mechanism with moves for insert and delete. The curved arrows on the right show the forward links in the list of free spaces. In particular, FIG. 20A shows the initial arrangement. Figure 20B shows an insert p/30 operation. Figure 20C shows free space available. Figure 20D shows a delete p/24 operation. Figure 20E shows a delete p2/24 operation. Figure 20F shows a delete p3/24 operation. Figure 20G shows an insert p/24 operation. The lists of free slots within a block enable one or more embodiments to avoid the move that is done by the second memory management embodiment freeSlot algorithm 1900 of Figure 19. The forward links, called next[], of the doubly-linked list are maintained using the ISRAM words corresponding to the free ITCAM slots with AV[i] recording the first slot on the list for the ith block. The backward links, called prev[], are maintained in these ISRAM words in case an ISRAM word is large enough to accommodate two links and in the control plane memory otherwise. All variables, including the array AV[], are, of course, stored in the control plane memory.

The getSlot algorithm 2100, as shown in Figure 21, for the third memory management embodiment first attempts to make available a slot from the doubly-linked list for the desired block. When this list is empty, the algorithm behaves like the getSlot algorithm for second memory management embodiment and the supporting algorithms 2200. 2202, 204 of Figure 22 are similar to the corresponding supporting algorithms for the second memory management embodiment.

The free a slot algorithm 2300, as shown in Figure 23, differs from that for the second memory management embodiment in that when the slot being freed is inside a block it is added to the doubly-linked list of free slots. Again, correctness and consistency are established easily. Although the worst-case performance of the third memory management embodiment algorithms is the same as that of the algorithms for the first two memory management embodiments, it is expected the third memory management embodiment algorithms have better performance on average.

3.4.4 Memory Management Mechanism 4

This memory management embodiment is the CAO_OPT mechanism presented in [23]. Here, prefixes are arranged in chain order, with the free space pool in the middle of the ITCAM. Figures 24-26 show the necessary algorithms. For example, Figure 24 shows a getSlot algorithm 2400 for the fourth memory management mechanism. Figure 5 shows a freeSlot algorithm 2500 for the fourth memory management mechanism. Figure 26 shows an isTopHeavy algorithm 2600, a parent algorithm 2602, and a child algorithm 2604. The interfaces are different from those used by the first three memory management embodiments. The input to getSlot 2400 is p, which is the node in the trie where the prefix being inserted is stored. Each trie node stores wt, wt ptr, held ptr, lchild, rchild, which are explained in greater detail in [23]. In addition to these the following variables are also used: slot: address of ITCAM slot in which prefix is entered. If prefix has not yet been entered, then this variable is set to -1.

firstFree: first free space

lastFree: last free space

shift[0:W/2]: temporary array of nodes

Also used in this embodiment is an array of nodes, such as nodeMap[0:N] for ITCAM[0:N] that comprises the node address of each valid prefix in the ITCAM, so that they can be located in the trie.

4 Wide Dual TCAM-DUQW

In this section the DUOS embodiment is extended to the case when wide SRAMs (such as 32-bit words or larger) are in use. In this embodiment, the TCAM and SRAM configuration is similar that shown in Figure 3, but with the SRAMs being wide SRAMs. The extension is discussed only for the case when the LSRAM is wide. The case when the ISRAM is wide uses techniques almost identical to those used in [4] while for a wide LSRAM, these techniques are modified by one or more embodiments of the present invention. As in [4], a wide LSRAM word is used to store a subtree of the binary trie of a forwarding table. However, instead of beginning with the binary trie for all prefixes as is done in [4], this embodiment begins with the binary trie, leaf trie, for only the leaf prefixes. When a subtree of the leaf trie is stored in an LSRAM word, that subtree is removed from (or carved out of) the leaf trie before another subtree is identified for carving. Let N be the root of the subtree being carved and let Q(N) be the prefix defined by the path from the root of the trie to N. Q(N) is stored in the LTCAM, and IPil - IQ(N)I suffix bits, of each prefix Pi in the carved subtree rooted at N, are stored in the LSRAM word. Note that each suffix stored in the LSRAM word is a suffix of a leaf prefix that begins with Q(N). By repeating this carving process, all leaf prefixes are allocated to the LTCAM and LSRAM. To obtain the mapping of leaf prefixes to the LTCAM and LSRAM, this embodiment uses a carving algorithm that ensures that the Q(N)s stored in the LTCAM are disjoint. Since the carving algorithm of [4] does not ensure disjointedness, a new carving algorithm is provided in this embodiment.

As an example, consider the binary trie 2700 of Figure 27(a), which has been carved using a carving algorithm that ensures that each carved subtree 2702, 2704, 2706 has at most 2 leaf prefixes. The LTCAM will need to store Q(N1), Q(N2) and Q(N3). Even though the prefixes in the binary trie are disjoint, the Q(N)s in the LTCAM are not disjoint (e.g., Q(N1) is a descendant of Q(N2) and so Q(N2) matches all IP addresses matched by Q(N1)). To retain much of the simplicity of the LTCAM management mechanism of DUOS the leaf trie is carved in such a way that all Q(N)s in the LTCAM are disjoint. As in [4], carving is performed via a postorder traversal of the binary trie. However, the current embodiment uses the visit algorithm 2800 of Figure 28 to do the carving. In particular, the algorithm 2800 of Figure 2800 carves a leaf trie to obtain disjoints Q(N)s. In this algorithm 2800, w is the number of bits in an LSRAM word and x→size is the number of bits needed to store (1) the suffix bits corresponding to prefixes in the subtrie rooted at x, (2) the length of each suffix, (3) the next hop for each suffix, (4) the number of suffixes in the word, and (5) the length of Q(x), which is the corresponding prefix stored in the LTCAM. Algorithm splitNode(q) does the actual carving of the subtree rooted at node q. The algorithm splitNode(q) is known to those skilled in TCAM research. The basic idea in the current embodiment's carving algorithm is to forbid carving at two nodes that have an ancestor-descendent relationship. This ensures that the Q(N)s are disjoint. Figure 27(b) shows the subtrees 2708, 2710, 2712 carved by the current embodiment's algorithm. As can be seen, Q(N1), Q(N2), Q(N3) are disjoint. Although this carving algorithm generally results in more Q(N)s than when the carving algorithm of [4] is used, this carving algorithm retains the flexibility to store the Q(N)s in any order in the LTCAM as the Q(N)s are independent. The LTCAM algorithms to insert, delete, change, and necessary support algorithms are shown in Figures 29-33. For example, Figure 29 shows an insert algorithm 2900 for inserting a prefix into the LTCAM. Figure 30 shows an addSuffix algorithm 3000 for adding a suffix to a wide LSRAM word. Figure 31 shows a split algorithm 3100 for splitting a wide LSRAM word into two words. Figure 32 shows a delete algorithm 3200 and a carve algorithm 3202 for deleting a leaf prefix. Figure 33 shows a change algorithm 3300 for changing the next hop of a leaf prefix. The function carve is invoked by both the insert and delete algorithms under different contexts that are analyzed below. When a prefix is deleted, the LSRAMword storing its suffix (corresponding to the LTCAMword for Q(cNode)) may have remaining suffixes that can be merged with another LSRAM word. This merge is accomplished by the carve function, by carving the trie at tNode, which is the nearest ancestor with two children, of cNode. Thus carve helps to reduce the LTCAM entries by one. When a prefix is inserted, it may be possible to add the suffix bits of the new prefix in the LSRAM word that corresponds to the LTCAM slot for Q(cNode). If there is no cNode in the path between the new prefix node and the root, then carving at tNode is attempted, which is the nearest degree 2 ancestor of the new prefix node, and therefore includes the new prefix along with other existing prefixes. So, in this case, using carve one or more embodiments of the present invention prevent the addition of a new LTCAM entry for the new prefix.

Next, it is shown that tNode is indeed an appropriate node to carve and the algorithm preserves the property of carving at only one node along any path from the root. tNode is carved only if the number of bits needed to store all suffixes in the subtree rooted at tNode is less than the size of an LSRAM word. In this case there is a single otherNode that is a descendant of tNode and for which Q(otherNode) is in the LTCAM. To see that there cannot be more than one otherNode, suppose there are q such nodes with Q(q) in the LTCAM. All of these q nodes must be in the subtree of tNode that does not contain the target node, which is cNode for a delete and the new prefix node for an insert. This is because, if there was one carved node t among the q nodes in the subtree of cNode, for a delete, then t must occur either in the path between cNode and tNode, or as a descendant of cNode, given that tNode is the nearest ancestor of cNode with two children. In either case, t violates the property of a single carving along any path from the root. Similarly for an insert, if there were a carved node t in the same subtree that contained the newly added prefix, then t would have served as the cNode and the carve algorithm would not have been started in the first place. Since all q nodes must appear in the same subtree rooted at either the left or right child of tNode, and the sum of their sizes is small enough to fit in an LSRAM word, the carving algorithm of the current embodiment would have carved that child of tNode. Thus, there is only one otherNode. Since one or more embodiments delete Q(cNode) and Q(otherNode) right after adding Q(tNode), the property of carving only once along any path is maintained.

Figure 34 shows a possible assignment 3400 of the 5-prefix example in Figure 4. The intermediate prefixes PI and P2 are stored in the ITCAM 3402, while the leaf prefixes P3, P4 and P5 are stored in the LTCAM 3404 using a wide LSRAM 3408. The suffix nodes begin with the prefix length field of 2 bits in this example followed by the suffix count field of 2 bits. Next comes the (length, suffix, nexthop) triplet for each prefix encoded in the suffix node, the number of allocated bits being (2bits, 4 bits, 6 bits) respectively for the three fields in the triplet.

5 Indexed DUQW-IDUQW

Zane et al. [21] introduced the concept of an indexed TCAM that reduces significantly the power consumed by a TCAM lookup. This concept was refined by Lu and Sahni [4] to reduce both the TCAM power and space requirements substantially. One or more embodiments of the present invention incorporate an index TCAM in conjunction with an LTCAM that uses a wide LSRAM (i.e., an index for the LTCAMof DUOW). When the LTCAM is indexed, the current embodiment has two TCAMs replacing the LTCAM, a data TCAM referred to as DLTCAM 6902 and in index TCAM referred to as ILTCAM 6910, as shown in Figure 69. The associated SRAMs are data SRAM (DLSRAM) 6908 and ILSRAM 6912, as shown in Figure 69. The IDUOW architecture 6900 of Figure 69 further shows the ITCAM 6902 and the ISRAM 6906 similar to that shown in Figure 3. It should be noted that the in another embodiment, an indexed ITCAM and an indexed ISRAM can be placed before the ITCAM 6902 similar for that shown for the DLTCAM 6904. The two most effective index TCAM strategies of [4]-l-12Wc and M-12Wb are considered. The former is best for power whereas the latter is the best overall mechanism consuming least TCAM space and low power for lookups [4]. Both l-12Wc and M-12Wb organize the DLTCAM into fixed size buckets that are indexed using the ILTCAM and ILSRAM, which also is a wide SRAM that stores suffixes and associated information. 5.1 Memory Management for DLTCAM and ILTCAM

In this embodiment, each DLTCAM bucket is assigned a unique number between 0 and totalSlots bucketSize, where totalSlots is the total number of DLTCAM slots. The unique number so assigned to a bucket is called its index. A bucket index is stored in the trie node (in field blndex) that is carved and represents an index prefix enclosing the DLTCAM prefixes in the bucket. The free slots in a bucket are linked through the associated DLSRAM. The first several bits (32 should be enough) of a DLSRAM word store the address of the next free DLTCAM slot in the same bucket. The last free slot in a bucket stores -1 in bits 0-31 of the corresponding DLSRAM word. For each bucket one free slot is kept at all times. This free slot is used for consistent updates, to copy the new prefix before deleting the old one. The first free slot in a bucket is stored in an array AV indexed by the bucket index. The array AV is initialized and maintained in the control plane. A list of free buckets is maintained in the DLSRAM using additional bits of each DLSRAM word (12 bits are sufficient when the number of buckets is at most 4096). The first available slot in a free bucket stores the bucket index of the next free bucket in the DLSRAMbits and so on. The free bucket chain is terminated by a -1 in the bits used to store the index of the next free bucket. The variable bucket A V keeps track of the first bucket on the free bucket chain. In the algorithms of one or more embodiments of the present invention the array nextBucket is used to represent the forward links in the bucket list.

When the prefixes in an ILTCAM are disjoint, one or more embodiments may use the simple memory management mechanism used for the LTCAM of DUOS and when these prefixes are not disjoint, they must be ordered and any of the memory management mechanisms for the ITCAM of DUOS in Section 3.4 above may be used. The update algorithms ( insert algorithm 3500, addSuffix algorithm 3600, spilt algorithm 3700, delete algorithm 3800, carve algorithm 3802, change algorithm 3900, deleteBucket algorithm 3902, splitBucket algorithm 3904, deletePrefixes algorithm 4000) shown in Figures 35^4-0 are almost identical for l -12Wc and M-12Wb organizations. The differences are explained in the next two subsections.

5.2 l-12Wc

This two-level TCAM organization in [4] employs wide SRAMs 4108, 4112 in association with both the data and index TCAMs 4102, 4110 as shown in the Figure 41. The strategy adopted in [4] to fill up the TCAMs and the SRAMs is summarized as follows. Firstly, suffix nodes are created for prefixes in the 1- bit trie, as described in Section 4, using Lu's carving heuristic. Secondly, every Q(N) to be entered in the data TCAM, is treated as a prefix and the subtree split algorithm [4] is applied to carve index nodes in the trie. The carving is done so that the number of data TCAM prefixes enclosed by the node being carved, is less than or equal to the size b of a data TCAMbucket. A new bucket is assigned to every index node. An enclosed data TCAM prefix and the corresponding suffix node are entered in a new entry in the bucket. When an index node encloses fewer than b prefixes, the remaining entries in the bucket are padded with null prefixes. Finally, the index nodes are treated as prefixes, the algorithm to create suffix nodes is run on the trie containing only index prefixes. The newly carved index Q(N) prefixes and the corresponding suffix nodes are entered in the index TCAM and the associated wide SRAM respectively. Using this strategy, the bucket numbers corresponding to the suffixes in an index SRAMsuffix node, happen to be consecutive. Hence, the index SRAM omits the bucket number for all suffixes except the starting suffix, as shown in the Figure 41.

During incremental updates, if a bucket overflows then assigning a new bucket immediately next to the overflowing bucket may require a large number of moves. Hence the suffix node format in IDUOW stores the bucket number for each suffix, which makes it possible to assign any empty bucket in case of an overflow. The suffix node format for the ILSRAM for l -12Wc is shown in Figure 42. Similar to Figure 41, Figure 42 shows wide SRAMs 4208, 4212 in association with both the data and index TCAMs 4202, 4210. Also, in keeping with the main idea of storing independent prefixes in the LTCAM, the visit postorder algorithm is used instead of the subtree split algorithm in [4] while filling out the TCAMs.

The prefix assignment algorithm for l-12Wc is given below. 1. Suffix nodes corresponding to prefixes in the forwarding table are created using the visit postorder algorithm on the 1-bit leaf prefix trie as shown in Section 4.

2. Each Q(N) prefix resulting from Step 1 is to be entered into DLTCAM and is marked as a DLTCAMprefix in the trie.

3. The visit postorder algorithm is applied to carve the index prefix nodes. The symbols used in the visit postorder algorithm have slightly different meaning now: x→size represents the number of DLTCAM prefixes enclosed by node x, and w is b - 1, where b is the size of a DLTCAM bucket with one free slot for consistent updates. As an index node is carved, the enclosed DLTCAM prefixes are entered in a new DLTCAMbucket, and the bucket index is stored in the trie node, corresponding to the index, in field blndex. 4. Each Q(N), for the index nodes carved in Step 3, is marked as an index prefix in the trie.

5. Suffix nodes are created for the index prefixes using the visit postorder algorithm on the 1-bit trie containing the index prefixes. The Q(N) prefixes corresponding to the carved nodes are entered in the ILTCAM. Suffixes for the index prefixes are entered in ILSRAM along with their bucket indexes, in the ILSRAM suffix node format as shown in the Figure 42. The functions incrementRoom and decrementRoom are not relevant for l-12Wc and are null functions. The assignNewBucket function 4300 is outlined in Figure 43. The l-12Wc mechanism loses space efficiency as independent index prefix nodes are carved out and a single bucket is used to store the DLTCAM prefixes enclosed by a single index prefix. The M-12Wb mechanism does not have this deficiency as DLTCAM prefixes from index prefixes are stored in the same bucket. 5.3 M-12Wb

The characteristic of the many-1 architectures in [4] is that all DTCAM buckets, except the last one, can be completely filled. Thus multiple index nodes use the same bucket to store their enclosed data TCAM prefixes. The configuration for M-12Wb is shown in Figure 44. Similar to Figures 41 and 42, Figure 44 shows wide SRAMs 4408, 4412 in association with both the data and index TCAMs 4402, 4410 The algorithm for carving and prefix assignment is as follows and is supported by the visit algorithm 4500 of Fi gure 45, the splitNode algorithm 4600 of Figure 46, the assignNewBucket algorithm of Figure 47, and the incrementRoom and decrementRoom algorithms 4800, 4802 of Figure 48:

Step 1 : [Seed the DLTCAMbuckets] Run feasibleST2(T, b - l)[n/(b - 1)] times. // b - 1, since one free slot is needed in a bucket for consistent updates. Each time call splitNode to carve the found bestST from T (thereby updating T) and pack bestST into a new DLTCAMbucket. The function splitNode adds one or more prefixes to the ILTCAM.

Step 2: [Fill the buckets] While there is a DLTCAMbucket that is not full and T is not empty, repeat Step 3.

Step 3: [Add to a bucket] Let B be the DLTCAMbucket with the fewest number of prefixes. Let s be the number of prefixes in B. Run feasibleST2(T, b - s). Using splitNode carve the found bestST from T (thereby updating T) and pack bestST into B. The function splitNode adds one or more prefixes to the ILTCAM.

Step 4: [Use additional buckets as needed] While T is not empty, fill a new DLTCAM bucket by making repeated invocations of feasibleST2(T, q), where q is the remaining capacity of the bucket. Add ILTCAM prefixes as needed. There are three main differences between this algorithm and the PS2 algorithm in [4]. The first difference is reflected in the visit2 algorithm 4500 (invoked by feasibleST2), shown in Figure 45, in that covering prefixes are not stored in the TCAMs. The second difference is in supplying b - 1 as available space in an empty bucket of size b, reserving one free slot for consistent updates. The third difference is in the use of carving function splitNode 4600, shown in Figure 46, which helps to create independent prefixes for IDUOW.

Apart from the data structures already defined for the two-level indexing mechanisms, the M-12Wb requires a doubly-linked list of used buckets to keep track of the buckets and the available spaces in them. An instance of a class BList is maintained in the control plane which includes the doubly linked list of buckets as well as an array to get to the right bucket quickly using a bucket index. Each bucket in the list has fields room to indicate available bucket slots and index to indicate the index of the bucket. The room in a bucket decreases from head to tail of the list. BList uses function add to add a new bucket to the list and the array and getBucket to get the appropriate bucket based on bucket index.

6 Experimental Results

The performance of the different versions of DUO using 21 IPv4 routing tables and update sequences downloaded from [6] and [7], which are hereby incorporated by reference in their entireties, have been evaluated. Figure 49 shows the characteristics 4900 of these datasets. The update sequences for the first 20 routing tables were captured from files storing update announcements from 12am on February 1, 2009 for the stated number of hours; the update sequence for the last routing table rrc00May20 was captured from files storing eight hours of activity starting from 12am on May 20, 2008. The columns labeled #RawInserts, #RawDeletes and #RawChanges, respectively, give the number of insert, delete, and change next hop requests in the update sequences. Using consistent updates, a next hop change request is implemented (see Figure 10 for example) as an insert (of the prefix with the new next hop) followed by a delete (of the prefix with the old nexthop). Therefore, all results henceforth are in terms of the effective inserts and deletes. Note that the number of effective inserts (#Inserts) and deletes (#Deletes) is given by the following equations.

#Inserts = #RawInserts + #RawChanges; (1) #Deletes = #RawDeletes + #RawChanges; (2)

6.1 Evaluation of Memory Management Mechanisms

The first ran a set of experiments on the simple TCAM [4] to compare the four memory management mechanisms 1-4 discussed above. The simple TCAM that was instantiated for the experiments has 300,000 slots. Figures 50 and 51, respectively, show a table 5000, 5100 of the total and average number of prefix moves (i.e., number of invocations of move()) required for an insert (includes raw inserts change next hop inserts) and a delete in the test update sequences (the data in Figure 51 is obtained from that in Figure 50 by dividing by #Inserts or #Deletes). Note that the theoretical worst-case number of moves for an insert/delete in IPv4 for the four memory management mechanisms is, respectively, 16, 32, 32 and 16. From Figures 50 and 51, the following observations were made: 1. The first memory management mechanism (PLO_OPT) required the maximum number of moves (sum of moves for inserts and deletes) for all the test sets and Mechanism 3 required the least. In fact, the disparity among the four memory management mechanisms is very significant with the third memory management mechanism, also referred to as Distributed and Linked Free Space with Prefix Length Ordering Constraint (DLFS_PLO) requiring a total number of moves that is orders of magnitude less than that required by the remaining mechanisms. Memory management mechanisms 2 and 4 have similar performance and the first memory management mechanism requires 10 times (or more) as many moves as required by the second and fourth memory management mechanisms. 2. The number of moves due to inserts in the second memory management mechanism, also referred to as Distributed Free Space with Prefix Length Ordering Constraint (DFS_PLO) is lower than those in the fourth memory management mechanism (CAO_OPT) by orders of magnitude. For some of the test sets, inserts required no moves when the second memory management mechanism was used.

3. The number of moves due to deletes in the second memory management mechanism (DFS_PLO) is comparable to that in the fourth memory management mechanism (CAO_OPT).

4. The number of moves due to inserts in the third memory management mechanism (DLFS_PLO) is lower than that in the fourth memory management mechanism (CAO_OPT) by orders of magnitude. For the inserts in some of the test sets, the third memory management mechanism required no moves at all.

5. The number of moves due to deletes is 0 in the third memory management mechanism (DLFS_PLO) because in this mechanism the slot within a block, freed by a delete is simply appended to the free space list for the block.

It is also noted that for memory management mechanisms 2 and 4, the number of moves due to deletes is much more than that due to inserts. For the fourth memory management mechanism (CAO_OPT) this is because a delete rarely occurs adjacent to either of the two boundaries of the free space pool and nonboundary deletes require at least one move to shift the empty slot to the free space pool. However, since the prefix trie is shallow and the free space pool cuts each root to leaf path in the middle, many of the inserts in an update sequence are expected to occur at a boundary of the free space pool. So, inserts take much less than 1 move, on average, when the fourth memory management mechanism (CAO_OPT) is used. Similarly, when the second memory management mechanism (DFS_PLO) is used, most deletes are from within a block rather than at a block boundary. These non-boundary deletes require 1 move each. However, an insert requires no moves if there is a free slot at the top or bottom of its block, a likely occurrence.

Figure 52 shows a table 5200 of the number of waitWrites (sum of invocations of waitWriteValidate() and invalidate WaitWriteO), which is the equal to the sum of inserts, deletes and moves for the simple TCAM and reflects the update performance for the four memory management mechanisms. As expected, the third memory management mechanism requires the least number of operations, due to the small number of moves. For the third memory management mechanism , the average number of waitWrites per insert and delete (number of waitWrites/(#Inserts + #Deletes)) ranged from a low of 1 for rrcOl, rrc07, rrcl6, route- views. wide to a high of 1.0053 for rrcl5. Figure 53(a) illustrates a graph 5300 showing the normalized average number of moves for each mechanism on a logarithmic scale. For this figure, the average number of moves per Insert/Delete for each data set was computed. Then the average of these averages was computed and normalized by the average of averages for the third memory management mechanism . Figure 53(b) illustrates a graph 5302 showing shows the normalized average waitWrites invoked by the different mechanisms. For this figure, the average number of waitWrites per Insert/Delete for each data set was computed, then the average of these averages was computed for each memory management mechanism and finally normalized by the average of the averages for the third memory management mechanism.

Effect of TCAM Size on Memory Management Mechanisms The number of moves required by an update sequence is independent of the size of the TCAM (provided there are enough slots to accommodate all prefixes) when memory management mechanisms 1 and 4 are used. This, however, is not the case for memory management mechanisms 2 and 3. Because of the relatively poor performance of the second memory management mechanism in the earlier test (Figure 50), the impact of TCAM size on the number of moves using this mechanism was not studied. Figure 54 shows a table 5400 of the number of moves required by the inserts (effective) in each of the test update sequences for varying TCAM size. The column labeled #Prefixes gives the initial number of prefixes in the routing table while that labeled #MaxPrefixes gives the maximum size attained by the routing table during the course of the update sequence. The TCAM occupancy is defined to be #MaxPrefixes/(TCAM size)*100%. For the experiment, the TCAM size was selected so as to have occupancies of 80%, 90%, 95%, 97%, and 99%. As can be seen, even with an occupancy of 99%, the third memory management mechanism does very well. In fact, its nearest competitor, the fourth memory management mechanism (CAO_OPT), requires between 93 and 74000 times as many moves (for inserts and deletes combined) as required by the third memory management mechanism (see Figure 50 for the number of moves required by Mechanism 4).

6.2 Evaluation of DUOS In DUOS, each prefix in the forwarding table occupies a slot in either the ITCAM or the LTCAM. Columns 2 and 5 of the table 5500 in Figure 55 show the initial prefix distribution between the 2 TCAMs of DUOS. Columns 3 and 6 give the distribution of the inserts (i.e., number of non-leaf inserts and number of leaf inserts) while columns 4 and 7 give the distribution of the deletes. It is noted that a leaf insert/delete may trigger additional insert and/or delete operations on the TCAMS of DUOS. These additional inserts/deletes are accounted for in Figure 55. As a result,

ITCAM.#inserts + LTCAM.#inserts > #Inserts (3)

It is interesting to note that more than 90% of the prefixes in each data set are leaf prefixes and that more than 90% of the inserts and deletes in each update sequence are directed at the LTCAM. Given the distribution of the prefixes and insert and delete operations, an LTCAM with 300,000 slots and an ITCAM with 28,000 slots were instantiated for the DUOS experiments. Since the performance of DUOS is determined by the number of waitWrite operations, this quantity was measured for the datasets. In addition, since the number of moves directly impacts the number of waitWrite operations, the number of moves was measured separately so to compare the effect of the four memory management mechanisms for ITCAM. Figure 56 shows a table 5600 of the number of ITCAM moves for inserts and deletes. The number of moves shown in Figure 56 includes the ITCAM moves resulting from ITCAM operations triggered by LTCAM inserts and deletes as well (for example, when inserting a leaf prefix, insert into the LTCAM is performed and its parent prefix (if any) is deleted from the LTCAM and reinsert this parent prefix into the ITCAM). The relative performance of the 4 memory management mechanisms for ITCAM is quite similar to that observed for a simple TCAM organization and the third memory management mechanism outperforms the remaining memory management mechanisms handily. Figure 57 shows a table 5700 of the number of waitWrites generated in the ITCAM and it is found that the third memory management mechanism is the best for this metric as expected from the smaller number of moves required by the third memory management mechanism. Figure 58 shows a table 5800 of the number of LTCAM moves required by the test update sequences. As expected, the number of LTCAM moves is zero (recall that, in an LTCAM, an insert may be done in any free slot and a slot freed by a delete is simply linked to the free space list). The total number of moves for the simple TCAM is between 17-24 times that for DUOS using the first memory management mechanism (PLO_OPT), between 9-14 times using the second memory management mechanism , 7-227 times using the third memory management mechanism , and 8-13 times using the fourth memory management mechanism (CAO_OPT). Note that the number of waitWrites in an LTCAM equals the number of inserts and deletes on the LTCAM and wait Write Validates in an LTCAM, have null wait as no invalid slot is involved in an ongoing lookup. This is ensured by using invalidate WaitWrite to free a slot. Note that invalidate WaitWrite waits till an ongoing lookup is complete and then invalidates the slot. Since updates are done serially in the control plane, invalidate WaitWrites from an LTCAM delete must complete before the next update operation begins.

6.3 Evaluation of DUOW

In evaluating DUOW, a wide SRAM was used in conjunction with the LTCAM only as the ITCAM has relatively few (about 10%) prefixes. An LTCAM with 100,000 slots was instantiated and used the same configuration for the ITCAM as used in the evaluation of DUOS. For the DUOW evaluation, only the third memory management mechanism was used for memory management in the ITCAM. Figure 59 shows a table 5900 of the number of LTCAM prefixes carved by Lu's carving heuristic [4] and the carving heuristic of Section 4 discussed above. The carving by both methods is done only on the trie of leaf prefixes as only leaf prefixes are stored in the LTCAM and its associated wide SRAM. We discovered surprisingly, the number of prefixes that result when the method of one or more embodiments of the present invention is used is fewer than when the method of [4] is used. This is surprising because the method of one or more embodiments carves out independent prefixes while the method of [4] may carve any set of prefixes. The approximately 1 % drop in the number of prefixes when the embodied carving method is used results from the observation that when the embodiment method is used it is not needed to supplement the carving prefixes with covering prefixes while covering prefixes need to be added to the set of carving prefixes generated by the method of [4]. Since covering prefixes account for approximately 8% of the prefixes generated by the method of [4], a 1 % drop in the total number of prefixes when the embodied method is used implies a roughly 7% increase in carving prefixes before accounting for covering prefixes.

Figure 60 shows a table 6000 of the number of inserts and deletes applied on the LTCAM of DUOW as well as the number waitWrites. It was observed that the number of waitWrites for the LTCAM of DUOW is more than the number of inserts and deletes done in the LTCAM. This is in contrast to DUOS where the number of waitWrites is the same as the number of inserts and deletes. This is because additional writes are needed in DUOW to maintain lookup consistency when the contents of an SRAM word are split or merged or when a suffix is added to or deleted from an existing SRAM word.

It is noted that the number of ITCAM inserts and deletes as well as the number of ITCAM waitWrites are unaffected by the coupling of a wide SRAM to the LTCAM. So, the numbers shown in Figure 56 are valid for the DUOW ITCAM as well as for the DUOS ITCAM.

6.4 Evaluation of IDUOW

As was the case for the DUOW evaluation discussed above, for IDUOW as well, a wide SRAM was used only in conjunction with the LTCAM. Further, an index TCAM (ILTCAM) with an associated wide SRAM was added only to the LTCAM. The instantiated DLTCAM and ILTCAM had 200,000 and 20,000 slots, respectively. The DLTCAM bucket size was set to 512 slots for both mechanisms discussed above in Section 5. Figures 61 and 62 show tables 6100, 6200 of the number of inserts and deletes as well as the number of waitWrites for the ILTCAM and DLTCAM using l-12Wc, respectively, while Figures 63 and 64 show tables 6200, 6300 of these numbers for the M-12Wb indexing mechanism, respectively. As can be seen, the l-12Wc architecture required between 209 to 227 buckets, thereby using up between 107008 and 116224 DLTCAM slots. The number of moves resulting from bucket splits varied from 0 to 1085. The M- 12Wb architecture is more space efficient requiring between 128 and 153 buckets, thereby using up between 65536 and 78336 DLTCAM slots. However, the number of moves is between 800 and 15753 when M-12Wb is used. (It is shown later below that the worst-case number of moves for these two architectures is comparable). Just as in DUOW, the number of waitWrites is more than the number of inserts and deletes and for DLTCAM there is an additional source for writes-prefix moves resulting from bucket overflows. 6.5 Comparison with MIPS [191 and CAP OPT [231

MIPS [19] and an update consistent version of CAO_OPT [23] obtained using the method of [18] are the competitors of DUO. In this section, the consistent update TCAM architectures MIPS, CAO_OPT, and DUO are compared. In MIPS, a data plane lookup is delayed if the lookup matches a TCAM slot whose next hop information is being updated. To avoid this delay while changing the nexthop of a prefix, a new entry with latest nexthop is first inserted, and then the existing entry is deleted, in the experiments for MIPS. This ensures that data plane lookups are consistent and correct and are not delayed by control plane operations. Also as noted earlier, the MIPS architecture as described in [19] uses no memory management architecture and free slots are determined using TCAM lookups that delay data plane lookups. To avoid these data plane lookup delays, for the experiments, the MIPS mechanism of [19] was augmented with the memory management architecture employed by the embodiment for the LTCAM (Section 4 above). For the ITCAM of DUO, memory management is done using Mechanism 3. Since the performance of the 3 TCAM mechanisms is characterized by the total number of the waitWrite operations required by an update sequence as well as the maximum number of operations for an individual update request, the experiments measured these quantities.

Figure 65 shows a table 6500 of the total number of waitWrites required to perform the test update sequences. It can be seen that the DUO architecture of one or more embodiments of the present invention requires fewer write operations than MIPS and CAO_OPT. The average number of waitWrites per operation (Insert or Delete) ranged from a low of 1.5729 to a high of 3.1848 for MIPS, from 1.4908 to 1.6378 for CAO_OPT, from 1 to 1.0639 for DUOS, from 1.0008 to 1.3305 for DUOW, from 1.0008 to 1.3635 for IDUOW with 1- 12Wc and from 1.0053 to 1.4714 for IDUOW with M-12Wb. Since the various DUO mechanisms require a similar number of writes, M-12Wb is to be preferred because of its lower TCAM memory and power requirement. Figure 66A illustrates a graph 6600 of the normalized average waitWrites for the different architectures. For this figure, the average number of waitWrites per Insert/Delete were first computed for each dataset. Similarly, Figure 66B illustrates a graph 6602 of the normalized power for the different architectures.

Figure 67 shows a table 6700 of the maximum number of write operations required by an insert or delete in the test update sequences. As can be seen, MIPS uses a larger number of writes in the worst case than any of the remaining mechanisms. It was noticed that the worst-case number of writes for rrc00May20 is particularly large for MIPS. This is because the update sequence for rrc00May20 contains announcements and withdrawals of routes for prefixes of small lengths, such as 2 and 4. Each of these translates into a very large number of inserts/deletes of independent prefixes.

The DUOS and DUOW architectures of one or more embodiments of the present invention have better worst-case performance (on a per update basis) than MIPS. DUOS is generally better than CAO_OPT. Even though, the worst-case number of writes with IDUOW is more than that for CAO_OPT, the number of writes is bounded by the size of a bucket. Thus, the worst-case writes may be reduced by using a smaller bucket size than the 512 size used in the experiments. For example, when the bucket size as 32, the maximum number of write operations in DLTCAM of IDUOW is also 32. This is because when an index node is split, the split node that has the smaller number of DLTCAM prefixes is relocated in one or more embodiments. Thus, at most 16 prefixes are moved, and hence there are 32 write operations at most.

Theoretically, it is possible for each update in MIPS to require a number of TCAM writes equal to the number of prefixes in the table. This happens for example when there is a trie in which no leaf prefix has a sibling after the leaf pushing and prefix compression steps, and to that trie if a default prefix of length 0 is inserted or deleted (see Figure 2). On the other hand, CAO_OPT requires at most W/2 moves per update(W = 32 for IPv4). Hence, CAO_OPT requires W/2 writes per update in the worst case. For DUOS, the worst case writes occur when a prefix is to be inserted to LTCAM and this requires a prefix deletion from LTCAM and a prefix insertion at ITCAM. The two LTCAM operations require 2 writes, whereas the ITCAM operation requires W writes in the worst case using architecture 3. Thus DUOS requires (W + 2) writes in the worst case. For DUOW, the worst case scenario is same as that for DUOS, except that a LTCAM insert can require 3 writes when a SRAM word is split (1 delete to remove the split word and 2 inserts for the new words). Similarly, a LTCAM delete can also require 3 writes when a SRAM word is merged (2 deletes for the two words merged and 1 insert for the new word). Thus, DUOW requires (W +6) writes in the worst case. For IDUOW, the worst case combination involves the ITCAM, ILTCAM and DLTCAM. IDUOW requires at most W writes for ITCAM and 6 writes for ILTCAM and bucketSize writes for DLTCAM, with a maximum of (W + bucketSize + 6) writes for a single update.

Figure 68 shows a table 6800 of the power consumption characteristics of MIPS, CAO_OPT and DUO in terms of the number of entries enabled during a search operation. The TCAM entries are counted based on the initial layout of prefixes for the input routing table. MIPS, CAO_OPT, DUOS and DUOW enable all valid TCAM entries during a search operation. IDUOW, on the other hand, enables all valid TCAM entries for ITCAM and ILTCAM, and only a bucket of entries for DLTCAM. Column 2 shows the number of enabled entries for MIPS, while column 3 shows the number of enabled entries for CAO_OPT on the simple TCAM and also for DUOS which is obtained by summing up the number of ITCAM and LTCAM entries. Both CAO_OPT and DUOS have the same number of entries in TCAM since they store each prefix in a single TCAM entry. Column 4 shows the number of enabled entries for DUOW, which is obtained as the sum of valid ITCAM and LTCAM entries. Columns 5 and 6 show the number of enabled entries for IDUOW with l-12Wc and M-12Wb, respectively. This number is obtained as the sum of valid entries in ITCAM, ILTCAM and the number of entries in a bucket in DLTCAM (fixed to 512 for the experiments). It is observed that for MIPS, the leaf pushing and prefix compression steps have reduced the number of TCAM entries, and hence the power compared to CAO_OPT and DUOS. MIPS requires about 1.5 to 2 times the power required by DUOW for all the tests, except rrc06 and rrcl5. In the case of rrc06, MIPS requires about 7% more power than DUOW while it requires about 7% less power on rrcl5. MIPS consumes between 3 to 10 times the power consumed by IDUOW. Figure 66(b) shows the normalized average power for the different mechanisms. For this figure, the average number of enabled entries is first computed for every TCAM search for each architecture. Then, the average was normalized by the average number of enabled entries for IDUOW with l-12Wc. Note that the power requirement for DUOW can be reduced further by using a wider SRAM than the 144 bit wide SRAM used for the experiments. The power requirements for IDUOW may be reduced by increasing SRAM width and by adding an index TCAM and a wide SRAM to the ITCAM. For example, the power consumed by DLTCAM and ILTCAM of IDUOW was less than 560 for the l-12Wc mechanism and less than 630 for the M-12Wb mechanism. When an index TCAM and wide SRAM is added to the ITCAM to the IDUOW, the power requirement for the ITCAM is expected to approximate that for the LTCAM(assuming the same bucket size is used). So, the IDUOW power requirement would drop to about 1120 for l-12Wc and about 1260 forM-12Wb. So, with the addition of an index TCAM and a wide SRAM to the ITCAM of IDUOW, the power required by MIPS is between 68 to 248 times that required by IDUOW.

7 Conclusion

As discussed above, a dual TCAM architecture-DUO-for routing tables is provided by various embodiments of the present invention. Four memory management mechanisms are also provided for ITCAM of DUO. Of these mechanisms, memory management mechanism3, which maintains free slots at TCAM block boundaries as well as free slot lists within each block, was found to perform best on the test data, requiring between 1/74000 and 1/93 times the number of moves required by its nearest competitor memory management mechanism4, which is based on CAO_OPT [23]. The DUO architectures of one or more embodiments, like those based on the CoPTUA [18], provide for consistent data-plane lookups and incremental control-plane updates that do not delay data-plane lookups. While the MIPS architecture of [19] provides consistent data-plane lookups, these lookups may encounter delays by ongoing control-plane operations that, for example, change the next hop associated with a prefix. These delays may be eliminated by implementing a next hop change as an insert followed by a delete as suggested in [19]. Delays caused by control-plane operations that require a free slot to be found may be eliminated using one of the various memory management embodiments, such as the third memory management embodiment. Making these two modifications to MIPS results in a delay-free MIPS.

Experiments with delay-free MIPS and a consistent lookup version of CAO_OPT indicate that these two architectures make, on average, between 1.5 and 2 times as many TCAM writes as made by any the DUO architectures of one or more embodiments to perform control-plane updates. In terms of the worst-case number of writes needed for an insert or delete, MIPS requires as many writes as prefixes in the table while CAO_OPT requires 16 for IPv4, DUOS includes 34, DUOW requires 38, and IDUOW requires 38+bucketSize. On the test data, MIPS required up to 98,867 writes for a single insert/delete while CAO_OPT required at most 8 writes, DUOS required at most 10 writes, DUOW required at most 11 writes, and IDUOW required at most 513 writes. The maximum number of writes for IDUOW may be reduced by reducing the bucket size. The very large number of worst-case writes for MIPS is a serious problem as this makes the router very susceptible to malicious users who inject a stream of worst-case inserts/deletes into the update stream. While this also is an issue, though to a lesser extent, for IDUOW, IDUOW offers power advantages over the remaining DUO mechanisms. On the test data, MIPS reduced power consumption by between 4% and 69% relative to CAO_OPT and DUOS, which take the same amount of power. However, MIPS generally required between 1.5 and 2 times the power required by DUOW and between 3 and 10 times that required by one embodiment of IDUOW. However, by adding an index TCAM and a wide SRAM to the ITCAM of IDUOW, the power required by MIPS is between 68 and 248 times that required by the enhanced IDUOW. Further reduction in power required by DUOW and IDUOW result from using a wider SRAM than the 144-bit wide SRAM used in the experiments.

The DUO architectures outperform MIPS and CAO_OPT in terms of the total number of writes needed to perform an update mechanism. Additionally, DUOW and IDUOW architectures use significantly less power than used by MIPS, CAO_OPT, and DUOS. Operating Environment

According to one embodiment of the present invention, as shown in FIG. 70, an information processing system 7000 is illustrated.

It should be noted that Figure. 70 only shows one environment in which a TCAM is applicable. The various embodiments of the present invention are not limited to a single information processing system or an information processing system in general. For example, TCAMs can be utilized within a wide variety of electronic devices.

In particular, FIG. 70 is a block diagram illustrating a detailed view an information processing system 7000 according to one embodiment of the present invention. The information processing system is based upon a suitably configured processing system adapted to implement one or more embodiments of the present invention. Any suitably configured processing system is similarly able to be used as the information processing system 7000 by embodiments of the present invention such as a personal computer, workstation, or the like. The information processing system 7000 includes a computer 7002. The computer 7002 has one or more processors 7004 that are connected to one or more memories 7008 that can implement the TCAM architectures shown in Figure 3 and Figure 70 comprising the DUOS, DUOW, and IDUOW embodiments discussed above. The one or more processors 7002 are also coupled to a mass storage interface 7010 and network adapter hardware 7012. A system bus 7014 interconnects these system components. The mass storage interface 7010 is used to connect mass storage devices, such as data storage device 7016, to the information processing system 7000. One specific type of data storage device is an optical drive such as a CD/DVD drive, which may be used to store data to and read data from a computer readable medium or storage product such as (but not limited to) a CD/DVD 7018. Another type of data storage device is a data storage device configured to support, for example, NTFS type file system operations.

In one embodiment, the information processing system 7000 utilizes conventional virtual addressing mechanisms to allow programs to behave as if they have access to a large, single storage entity, referred to herein as a computer system memory, instead of access to multiple, smaller storage entities, other memories 7008, and data storage device 7016. Note that the term "computer system memory" is used herein to generically refer to the entire virtual memory of the information processing system 7000.

Although only one CPU 7004 is illustrated for computer 7002, computer systems with multiple CPUs can be used equally effectively. Embodiments of the present invention further incorporate interfaces that each includes separate, fully programmed microprocessors that are used to off-load processing from the CPU 7004. An operating system (not shown) included in the main memory is a suitable multitasking operating system such as the Linu70, UNIX, Windows XP, and Windows Server operating system. Embodiments of the present invention are able to use any other suitable operating system. Some embodiments of the present invention utilize architectures, such as an object oriented framework mechanism, that allows instructions of the components of the operating system (not shown) to be executed on any processor located within the information processing system 7002. The network adapter hardware 7012 is used to provide an interface to a network 7020. Embodiments of the present invention are able to be adapted to work with any data communications connections including present day analog and/or digital techniques or via a future networking mechanism.

Non-Limiting Examples

The present invention can be realized in hardware, software, or a combination of hardware and software. A system according to one embodiment of the present invention can be realized in a centralized fashion in one computer system or in a distributed fashion where different elements are spread across several interconnected computer systems. Any kind of computer system - or other apparatus adapted for carrying out the methods described herein - is suited. A typical combination of hardware and software could be a general purpose computer system with a computer program that, when being loaded and executed, controls the computer system such that it carries out the methods described herein. Although specific embodiments of the invention have been disclosed, those having ordinary skill in the art will understand that changes can be made to the specific embodiments without departing from the spirit and scope of the invention. The scope of the invention is not to be restricted, therefore, to the specific embodiments, and it is intended that the appended claims cover any and all such applications, modifications, and embodiments within the scope of the present invention.

References

Each of the following twenty-eight references are hereby incorporated by reference in their entirety.

[I] M. Akhbarizadch, M. Nourani, R. Panigrahy and S. Sharma, A TCAM-based parallel architecture for high-speed packet forwarding, IEEE Trans, on Computers, 56, 1, 2007, 58-72.

[2] Y. Chang, Power-efficient TCAM partitioning for IP lookups with incremental updates, ICOIA'Pmceedings, Lecture Notes in Computer Science, Springer Verlag, 3391, 2005, 531-540.

[3] H. Lu, Improved Trie Partitioning for Cooler TCAMs, ACST, 2004.

[4] W. Lu and S. Sahni, Low Power TCAMs For Very Large Forwarding Tables, Proceedings, of INPOCOM, 2008.

[5] W. Lu and S. Sahni, Succinct representation of static packet classifiers, htternational Conference on Computer Networking, 2007.

[6] http://bgp.potaroo.net,2007.

[7] http://www.ripe.net/projects/ris/rawdata.htmal, 2008.

[8] H. Liu, Routing Table Compaction in Ternary-CAM, IEEE Micro, 22, 3, 2002.

[9] VC. Ravikumar, R. N. Mahapatra, and L. N. Bhuyan, EaseCAM: An Energy And Storage Efficient TCAM-Based Router Architecture for IP Lookup, IEEE Transactions on Computers, 54, 5, May 2005, 521-533.

[10]VC. Ravilaimar, R. N. Mahapatra, and L. N. Bliuyan, TCAM architecture for TP lookup using prefix properties, IEEE Micro, 24, 2, March 2004, 60-69.

[I I] R.. Daves, C. King, S. Venkatachaty, and B.Zill, Constructing Optimal IP Routing Tables, Proceedings of INFOCOM, 1999.

[12] M. Ruiz-Sanchez, E.13iersack, and W. Dabbous, Survey and taxonomy of IP address lookup algorithms, IEEE Network, 2001, 8-23.

[13] S. Sahni, K. Kim, and 11. Lu, Data structures for one-dimensional packet classification using most- specific -rule matching, International Journal on Foundations of Computer Science, 14, 3, 2003, 337- 358.

[14] C. A. Zukowski, and S. Wang, Use of Selective Precharge for Low-Power Content-Addressable Memories, IEEE International Symposium on Circuits and Systems, 1997.

[15] N. Mohan, and M. Sachdev, Low Power Dual Matchline Ternary Content Addressable Memory, IEEE International Symposium on Circuits and Systems, 2004. [16] H. Miyatake, M. Tanaka, and Y.Mori. A design for high-speed low-power CMOS fully parallel content addressable memory macros. IEEE Journal of Solid Slate CiralitS, 36, 6, June 2001, 956-968.

[17] C.-S. Lin, J.-C. Chang, and B.-D Liu, A low-power pre-computation based fully parallel content addressable memory, IEEE Journal of Solid State Circuits, 38, 4, April 2003, 654-662.

[18] Z. Wang, H. Che, M. Kumar, and S.K. Das, CoPTUA: Consistent Policy Table Update Algorithm for TCAM without Locking, IEEE Transactions on Computers, 53, 12, December 2004, 1602-1614.

[19] G. Wang and N. 'fzeng TCAM-Based Forwarding Engine with Minimum Independent Prefix Set (MIPS) for Fast Updating, IEEE International Conference of Communications Volume 1, June 2006, 103-109

[20] M. Wang, S. Deering, T. Hain, and L. Dunn, Non-random Generator for IPv6 Tables, 12th Annual IEEE .Qvnlposiur on High Performance Interconnects, 2004.

[21] F. Zane, G. Narlikar and A. Basu, CoolCAMs: Power -Efficient TCAMs for Forwarding Engines, INFOCOM, 2003.

[22] T. Mishra and S. Sahni, PETCAM— A Power Efficient TCAM For Forwarding Tables, IEEE Symposium an Computers and Communications, 2009.

[23] a Shah and P. Gupta, Fast Updating Algorithms on TCAMs, IEEE Micro Volume 21, Issue 1, Jan-Feb 2001, 36-47.

[24] M. Akhbarizadeh and M. Nourani, Efficient Prefix Cache For Network Processors, IEEE Symp. on High Performance Interconnects, 41-46, 2004.

[25] V Srinivasan and G. Varghese, Faster IP lookups using controlled prefix expansion, SIGMETRICS, 1998.

[26] K. Zheng, C. Hu, H. Lu and B. Liu, An Ultra High Throughput and Power Efficient TCAM Based 1 ? Lookup Engine, Proceedings of INFOCOM 2004.

[27] M. Akhbanzadeh, M. Nourani, R. Panigmhy and S. Sharma, A TCAM -based parallel architecture for high-speed packet forwarding, IEEE Raw. on Computers, 56, 1, 2007, 58-2007.

[28] T. Mishra and S. Sahni, DUO-Dual TCAM architecture for routing tables with incremental update, IEEE Symposium on Computers and Communications, June 2010.

What is claimed is:

Claims

1. A method for managing router tables, the method comprising:

classifying a set prefixes in a plurality of router table prefixes as a set of leaf prefixes and a remaining set of prefixes in the plurality of router table prefixes as a set of internal prefixes, wherein a leaf prefix is not a prefix of another prefix in a router table;

storing the set of internal prefixes in a first ternary content addressable memory;

storing the set of leaf prefixes in a second ternary content addressable memory;

storing for each internal prefix stored in the first ternary content addressable memory, a corresponding destination hop in a first random access memory;

storing for each leaf prefix stored in the second ternary content addressable memory, a corresponding destination hop in a second random access memory;

receiving a packet with at least one destination address;

performing, using the destination address, a simultaneous lookup in the first ternary content addressable memory and the second ternary content addressable memory to retrieve up to two index values; in response to the second ternary content addressable memory returning an index, retrieving a next hop from the second random access memory; and

routing the packet to the next hop.

2. The method of claim 1, further comprising:

in response to the second ternary content addressable memory failing to return an index, retrieving a next hop from the first random access memory; and

routing the packet to the next hop..

3. The method of claim 1, wherein the first ternary content addressable memory comprises a priority encoder, and wherein the second ternary content addressable memory does not comprise a priority encoder.

4. The method of claim 1, wherein performing the simultaneous lookup further comprises:

determining that a match was found in the second ternary content addressable memory; and

aborting the lookup in the first ternary content addressable memory in response to determining that a match was found in the second ternary content addressable memory.

5. The method of claim 1, further comprising:

performing router-table updates one at a time.

6. The method of claim 5, wherein each update is performed without interrupting lookup operations in the first ternary content addressable memory and the second ternary content addressable memory.

7. The method of claim 1, wherein at least one of the first random access memory and the second random access memory is a wide static random access memory comprising at least 32 bits.

8. The method of claim 7, further comprising:

storing, using a suffix node format, a subtree of a binary trie data structure into a given word size of the wide static random access memory.

9. The method of claim 8, further comprising:

storing internal indices to the subtree in a word of the wide static random access memory into one of the first ternary content addressable memory and the second ternary content addressable memory.

10. The method of claim 9, wherein the suffix node format includes a suffix count, a suffix length, and a next hop for packet routing.

11. The method of claim 8, wherein the subtree is a partition including one or more nodes of a trie representing prefixes for destination addresses.

12. The method of claim 1, further comprising:

managing unused memory in the first ternary content addressable memory by distributing free space between ternary content addressable memory blocks in the first ternary content addressable memory.

13. The method of claim 1, further comprising:

managing unused memory in the first ternary content addressable memory by having contiguous free space between ternary content addressable memory blocks as well as free space within a block in the first ternary content addressable memory.

14. The method of claim 1, further comprising:

managing unused memory in the second ternary content addressable memory by linking a set of free slots through the second random access memory and having a point to a first free slot in the set of free slots in a control plane memory of a router

15. The method of claim 1, wherein the second ternary content addressable memory is a data ternary content addressable memory, and where the receiving further comprises:

receiving the packet at an indexed ternary content addressable memory, wherein the indexed ternary content addressable memory indexes an indexed static random access memory, where in response to indexing the indexed static random access memory, an indication of a set of addresses in the data ternary content addressable memory to be searched is obtained.

16. The method of claim 1, wherein the receiving further comprises:

receiving the packet at an indexed ternary content addressable memory, wherein the indexed ternary content addressable memory indexes an indexed static random access memory, where in response to indexing the indexed static random access memory, an indication of a set of addresses in the first content addressable memory to be searched is obtained.

17. An information processing system for managing router tables, the information comprising:

a processor;

a first ternary content addressable memory coupled to the processor;

a second ternary content addressable memory coupled to the processor;

a first random access memory coupled to the processor; and

a second random access memory coupled to the processor,

wherein the processor is configured to perform a method comprising:

storing the set of internal prefixes in the first ternary content addressable memory;

storing the set of leaf prefixes in the second ternary content addressable memory;

storing for each internal prefix stored in the first ternary content addressable memory, a corresponding destination hop in the first random access memory;

storing for each leaf prefix stored in the second ternary content addressable memory, a corresponding destination hop in the second random access memory;

receiving a packet with at least one destination address;

performing, using the destination address, a simultaneous lookup in the first ternary content addressable memory and the second ternary content addressable memory to retrieve up to two index values;

in response to the second ternary content addressable memory returning an index, retrieving a next hop from the second random access memory; and

routing the packet to the next hop.

18. The information processing system of claim 17, wherein performing the simultaneous lookup further comprises:

19. A computer program product for managing router tables, the computer program product comprising:

a storage medium readable by a processing circuit and storing instructions for execution by the processing circuit for performing a method comprising:

receiving a packet with at least one destination address;

routing the packet to the next hop.

20. The computer program product of claim 18, wherein performing the simultaneous lookup further comprises: