US20080098208A1 - Analyzing and transforming a computer program for executing on asymmetric multiprocessing systems - Google Patents

Analyzing and transforming a computer program for executing on asymmetric multiprocessing systems Download PDF

Info

Publication number
US20080098208A1
US20080098208A1 US11/898,360 US89836007A US2008098208A1 US 20080098208 A1 US20080098208 A1 US 20080098208A1 US 89836007 A US89836007 A US 89836007A US 2008098208 A1 US2008098208 A1 US 2008098208A1
Authority
US
United States
Prior art keywords
program
data
sections
computer program
code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11/898,360
Inventor
Alastair David Reid
Simon Andrew Ford
Yuan Lin
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ARM Ltd
University of Michigan
Original Assignee
ARM Ltd
University of Michigan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ARM Ltd, University of Michigan filed Critical ARM Ltd
Priority to US11/898,360 priority Critical patent/US20080098208A1/en
Assigned to REGENTS OF THE UNIVERSITY OF MICHIGAN, THE reassignment REGENTS OF THE UNIVERSITY OF MICHIGAN, THE ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: LIN, YUAN
Assigned to ARM LIMITED reassignment ARM LIMITED ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: FORD, SIMON ANDREW, REID, ALASTAIR DAVID
Publication of US20080098208A1 publication Critical patent/US20080098208A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/362Software debugging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/28Error detection; Error correction; Monitoring by checking the correct order of processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/362Software debugging
    • G06F11/3636Software debugging by tracing the execution of the program
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring

Definitions

  • the field of the invention relates to data processing and in particular to improving the performance of program execution.
  • Such complex systems may comprise a number of different processing or execution units and they may be heterogeneous or asymmetric, with specialised processing units being used to increase energy efficiency and lower gate count.
  • specialised processing units being used to increase energy efficiency and lower gate count.
  • the programming of embedded systems with their hardware restriction, demand for efficiency and the ever decreasing time to market is becoming a real problem.
  • Decoupling programs to produce a number of threads communicating via FIFO pipelines has been used many times before: Smith (James E. Smith, “Decoupled access/execute computer architectures”, ACM Transactions Computing Systems, 2(4), 289-308, 1984) applies the technique manually to Cray assembly code; Palacharla and Smith (S. Palacharla and J. E. Smith, “Decoupling integer execution in superscalar processors”, in MICRO 28: Proc. of International Symposium on Microarchitecture, 285-290, 1995) describe the use of program slicing to automate the separation. These uses of decoupling were targeted at hiding memory latency by having one thread perform all load-store operations while the other thread performs all arithmetic operations.
  • SoC system on chip
  • a first aspect of the present invention provides a method of transforming a portion of a computer program comprising a list of sequential instructions comprising control code and data processing code and a program separation indicator indicating a point where said sequential instructions may be divided to form separate sections that are capable of being separately executed and that each comprise different data processing code, said method comprising the steps of: (i) analysing said portion of said program to determine if said sequential instructions can be divided at said point indicated by said program separation indicator and in response to determining that it can: (iia) providing data communication between said separate sections indicated by said program separation indicator, such that said separate sections can be decoupled from each other, such that at least one of said sections is capable of being separately executed by an execution mechanism that is separate from an execution mechanism executing another of said separate sections, said at least one of said sections being capable of generating data and communicating said data to at least one other of said separate sections; and in response to determining it can not: (iib) not performing step (iia).
  • the present method provides a tool for analysing a portion of the program to determine if the instructions can be divided at a point indicated by a separation indicator.
  • separation indicators are provided within at least a section of the program and indicate where it is desirable to divide the program.
  • the division of the program is determined to some degree by these separation indicators and can thus, be controlled by a programmer.
  • the method of the present invention forms an analysis of a program that actually includes the separation indicators and decides if the program can indeed be separated at these points. If it is decided that it can be it provides data communication between the two sections to allow them to be decoupled from each other.
  • the program can be split into sections suitable for separate execution allowing a program to be efficiently processed by a variety of different, often complex devices. If it decides it cannot be divided at this point then it does not perform the data communication step.
  • a warning indicating an error in the computer program is output. Providing the programmer with a warning may be the most appropriate thing to do if the separation indicators are not in the correct position.
  • said step (iib) comprises amending said computer program such that said sequential instructions can be divided at said point and then performing step (iia).
  • the method can amend the computer program so that the sequential instructions can be divided at this point and then the data communication can be divided between the different sections. It may be that it is a relatively simple matter to amend the computer program so that it can be divided at the point indicated and if this is the case then the method can perform this step rather than outputting a warning.
  • said step of amending said computer program comprises inserting data transfer instructions at said point indicated by said program separation indicator.
  • the step required to amend the computer program may be one of inserting data transfer instructions at the point indicated by the program separation indicator.
  • said step (iib) comprises merging said two sections together and removing said program separation indicator.
  • said program separation indicator comprises at least one data transfer instruction, said data communication between said separate sections being provided in dependence upon said at least one data transfer instruction.
  • program separation indicators can take a number of forms it is quite efficient if they take the form of data transfer instructions.
  • providing program separation indictors in the form of data transfer instructions may facilitate their separation by providing the data communication required.
  • said step (iia) of providing data communication comprises inserting at least one “put data into a data store” instruction and at least one “get data from said data store” instruction into said instruction stream, and dividing said computer program between said put and get instructions to form said at least one separate section.
  • the step of providing data communication can be done by inserting a put data into data store and a get data from said data store instructions into the instruction stream. This allows data to be removed from one section of the program and then input into the other section via a data store. Thus, the two sections are in effect decoupled from each other but data can travel between the two via this data store.
  • said data store comprises a FIFO buffer.
  • the data store may comprise a FIFO buffer as this is clearly the simplest arrangement where the first data to exit from a section of the program is the first data to enter the next section, it may be that the data is not required in a particular order or indeed that all the data generated by one section is not required by the other.
  • a variety of different data stores and different arrangements can be used in some embodiments. For example, a stack which has a last in first out semantics could be used, one advantage of this is that a stack is simple to implement.
  • said step (iia) comprises providing cyclic data communication between said separate sections.
  • the decoupling of threads can be further extended to use where communication between threads is cyclic. Cyclic thread dependencies can lead to deadlock that is, two threads may not run in parallel because of data dependencies between them and thus, in devices of the prior art decoupling is limited to acyclic thread dependencies. Embodiments of the present invention address this problem and provide cyclic dependencies. This may be done, for example, by using put and get instructions and not requiring the number of puts to be equal to the number of gets. This is in contrast to the prior art where put and get operations are always inserted in corresponding places in each thread. Allowing put operations to be inserted in places that do not correspond to get operations in other threads, means that code such as is illustrated in FIG. 4 can be produced.
  • said separate sections comprise the same control code.
  • control code is the same in the two sections as the computer program is divided such that different data processing steps are performed under the same control in each divided section. Duplicating control code in this way enables the program to be divided.
  • control code will be different. This is because it may be advantageous occasionally to slightly modify the control code in one of the sections such that, for example, a conditional code that is no longer required is not present.
  • said portion of said computer program comprises a plurality of program separation indicators each indicating a point where said sequential instructions may be divided to form separate sections, each of said separate sections being capable of being separately executed and comprising different data processing code, said method providing data communication between said separate sections indicated by said plurality of program separation indicators.
  • An instruction loop having several data processing steps for example can be divided by embodiments of the present invention into two sections by allowing the different sections to have different data processing codes. This can increase the performance of a system significantly. Generally this is done by duplicating the control code and in effect performing two loops, one performing one or more of the data processing steps of the original loop and the other performing the rest of the steps.
  • said transformed computer program is suitable for execution upon respective execution mechanisms of a heterogeneous system having a complex asymmetric memory hierarchy and a plurality of execution mechanisms.
  • Embodiments of the present invention although applicable to symmetric systems are particularly valuable in asymmetric heterogeneous systems wherein it is often difficult to separately execute sections of a program, particularly where at least a portion of the program is written sequentially.
  • a section of code is executed by a single execution mechanisms
  • said control code of at least one of said sections is operable to be processed by a processor of said heterogeneous system and said data processing code of said section is operable to be processed by an execution mechanism under control of said control code processed by said processor.
  • An execution mechanism may be a simple mechanism designed for a particular function, such as a memory transfer unit (colloquially known as a “DMA engine”) and in such cases it may be that the control code is performed on a separate processor, while the data processing operations are performed on the simpler mechanism.
  • DMA engine memory transfer unit
  • the plurality of execution mechanisms can take a number of forms, including a general purpose processor; a direct memory access unit; a coprocessor; a VLIW processor; a digital signal processor; and a hardware accelerator unit.
  • said method step comprises an initial step performed before step (i) of: defining said portion of said computer program by marking said computer program with indications delimiting said portion of said sequential instructions within which said at least two sections are to be located.
  • said computer program comprises said portion having a number of instructions to be executed sequentially and at least one further portion having instructions to be performed in parallel with each other.
  • a computer program may have different portions, some for execution sequentially and some already written for parallel processing. In such a case, it is the portion that has the instructions for sequential execution that is analysed to see if it can be divided into sections for separate execution. It should be noted that a portion to be analysed may be within a section that is to be executed in parallel. Furthermore, a portion to be analysed may also contain two or more sections that are to be executed in parallel.
  • said portion of said computer program comprises an instruction loop comprising at least two data processing instructions, and said at least two sections each comprise said instruction loop each section comprising at least one of said at least two data processing instructions, said at least two sections comprising different data processing instructions.
  • An instruction loop having several data processing steps can be divided into two sections, and thereby increase the performance of a system significantly.
  • the present method is able to duplicate the control code and perform in effect two loops, one performing one or more of the data processing steps of the initial loop and the other performing the rest of the steps.
  • said portion of said computer program comprises a whole computer program.
  • a second aspect of the present invention provides a computer-readable storage medium comprising a computer program for controlling a computer to perform the method of the first aspect of the present invention.
  • a third aspect of the invention provides a computer executing a computer program to perform the method of a first aspect of the present invention.
  • a further aspect of the present invention provides a method of transforming a portion of a computer program comprising a list of sequential instructions and a program separation indicator indicating a point where said sequential instructions may be divided to form separate sections that are capable of being separately executed and that each comprise different data processing code, said list of instructions comprising control code and data processing code, said method comprising the step of:
  • the program comprises program separation indictors indicating points where they may be divided then the program can be transformed by providing data communication between the separate sections at the points indicated so that they can be decoupled from each other. This allows the program to be split into sections suitable for separate execution and allows the program to be efficiently processed by a variety of different often complex devices. This enables future analysis of the program via a programmer to be relatively straight forward and yet still enable it to execute efficiently on a parallel system.
  • said method comprises a further initial step (0) performed before step (i) of in response to said program separation indicator, analysing said portion of said computer program and determining which of said sequential instructions should be in which of said separate sections prior to providing said data communication.
  • FIGS. 1 a to 1 c show flow diagrams of methods according to embodiments of the present invention
  • FIG. 2 a to 2 d schematically shows the splitting into separately executable sections of a computer program according to an embodiment of the present invention
  • FIG. 3 a to 3 b schematically shows a method of splitting and then merging sections of a computer program
  • FIG. 4 schematically shows data communication between two sections of a program
  • FIG. 5 a shows a simple computer program annotated according to an embodiment of the present invention
  • FIG. 5 b shows the maximal set of threads for the program of FIG. 4 a.
  • FIG. 6 schematically illustrates an asymmetric multiprocessing apparatus with an asymmetric memory hierarchy
  • FIG. 7 illustrates an architectural description
  • FIG. 8 illustrates a communication requirement
  • FIG. 9 illustrates communication support
  • FIG. 1 a shows a flow diagram illustrating a method according to an embodiment of the present invention.
  • a first step is performed in which a portion of a computer program comprising a list of sequential instructions and a program separation indicator indicating a point where the sequential instructions may be divided to form separate sections that are capable of being separately executed is analysed. The analysis determines if the sequential instructions can be split at the point indicated by the separation indicator into separate sections that can be processed on different execution mechanisms. If it determines it can the sequential instructions are divided into the separate sections at the point indicated by the program separation indicator. If it determines they cannot be separated at this point then a warning is output to the programmer to indicate an error in the program.
  • FIG. 1 b illustrates an alternative embodiment in which rather than outputting a warning if the program cannot be decoupled and separated at the indicated point, the program is amended by inserting data communication instructions into the list of sequential instructions, these data communication instructions enabling the different sections to be decoupled and thus, separated. The separation can then be performed.
  • FIG. 1 c provides an alternative embodiment in which in response to determining that the program cannot be separated at the indicated point the two sections are merged together and the program separation indicator removed.
  • the three embodiments provide different solutions to an analysis that determines that it is not possible to separate a program into sections at the point indicated. Different ones of these embodiments may be used in the same analysis of a program for different separation indicators depending on circumstances. Thus, it may be that the preferred course of action is to amend the program to make it separable at he indicated point, if this cannot be done then it may be chosen to merge the two portions, or if this would result in an unacceptably large portion then a warning may be output.
  • FIG. 2 a shows a portion of a computer program comprising a loop in which data items are processed, function f operating on the data items, and function g operating on the data items output by function f and then function h operating on these items. These functions being performed n times in a row for values of i from 1 to n.
  • control flow can be seen as following the solid arrows while data flow follows the dotted arrows.
  • decouple or program separation indications are inserted into the data flow where it is seen as being desirable to split the portion into sections that are decoupled from each other and can thus, be executed on separate execution mechanisms.
  • a decouple indication is provided between the data processing operations f and g. This can be seen as being equivalent to inserting a buffer in the data flow, as the two sections are decoupled by providing a data store between then so that the function f can produce its results which can then be accessed at a different time by function g.
  • the program is then analysed to see if it can indeed be decoupled at the point indicated by the separation indicators. If it can then the method proceeds to FIG. 2 c . If it cannot then a warning may be output to the programmer, the program may be amended to enable it to be decoupled at this point, or the decouple indication may be removed from the program and the two sections merged.
  • FIG. 2 c shows how the separate sections of the program are decoupled by the insertion of “put” and “get” instructions into the data stream. These result in the data being generated by the f function being put into a data store, from which it is retrieved by the get instruction to be processed by function g.
  • the two sections of the program are in effect decoupled from each other and can be executed on separate executions mechanisms.
  • one of the functions may be a function suitable for processing by an accelerator in which case it can be directed to an accelerator, while the other portion is processed by say, the CPU of the apparatus.
  • the splitting of the program results in the control code of the program being duplicated in both section, while the data processing code is different in each section.
  • the put and get operations used in FIG. 2 c can be used in programs both for scalar and non-scalar values but they are inefficient for large (non-scalar) values as they require a memory copy.
  • a different embodiment of the invention applies this idea to the channel interface, by replacing the simple ‘put’ operation with two functions: put_begin obtains the address of the next free buffer in the channel and put end makes this buffer available to readers of the channel:
  • sequences of code such as:
  • Cyclic thread dependencies can lead to deadlock—that is, two threads may not run in parallel because of data dependencies between them and thus, in devices of the prior art decoupling is generally limited to acyclic thread dependencies.
  • FIGS. 3 a and 3 b schematically illustrate the program code shown in FIG. 2 .
  • a data store is provided to decouple functions f and g, but one is not provided between g and h.
  • analysis of the program to decouple it is performed automatically and several potential sections are provided, in this case these are loops having functions f, g and h in them. The automatic analysis then checks that each loop can be executed separately and in this case identifies a missing data path between functions g and h. Thus, these two functions are remerged to provide two sections with a data path between.
  • a further example of a code fragment that can be split by an embodiment of the present invention is provided below. Since communication lies at the boundaries between threads, the compiler's job is to “color in” the code that lies between the boundaries. This is done through a dependency analysis to decide the set of operations that are on the “producer” side of a channel and the set of operations on the “consumer” side of a channel. The compiler then partitions the operations according to that analysis; and generates a separate thread for each equivalence class.
  • the PIPELINE annotation on line 1 identifies the region of code to be split into threads.
  • the FIFO annotation on line 5 identifies that the communication between threads is to be performed between f and g.
  • the compiler performs a data and control flow analysis to determine that the call to g has a data dependency on the FIFO operation and also has control dependencies on the if statement (line 4) and the for loop (line 2). This results in the following thread:
  • the data and control flow analysis also determines that the FIFO operation (line 5) has a data dependency on the call to f (line 3) and also has control dependencies on the if statement (line 4) and the for loop (line 2). This results in the following thread:
  • Decoupling must make two essential decisions: “What variables and operations to replicate?” and “What operations to place in the same thread?”.
  • the task of decoupling is to split the region of code into as many threads as possible, without introducing timing-dependent behaviour, using channels to communicate between threads.
  • the generated threads do not strictly partition the statements in the original code: some variables and operations (principally those used for control) are duplicated between threads.
  • the choice of what to duplicate is an essential part of the transformation: if too much code or data is duplicated, the transformed program can run slower and use much more memory than the original program. While these decisions could be controlled using annotations on every variable and statement, some simple default rules can be used that give the programmer control without requiring excessive annotation.
  • variables to be duplicated are determined by the location of their declaration (variables declared inside the PIPELINE annotation may be duplicated) and their size (scalar variables may be duplicated). Operations other than function calls may be duplicated unless they have side-effects or modify a non-duplicable variable.
  • the dependency analysis stage forms a large number of “pre-threads” by computing a backward data and control slice (see Mark Weiser, “Program slicing”, in ICSE '81: Proc. of International Conference on Software Engineering, 439-449, 1981) from each unduplicated operation ignoring data dependencies on FIFO operations but including all duplicable and unduplicable operations in the slice. That is, we repeatedly apply rules (1) and (2) to form prethreads. In our running example, there are three prethreads: one each for f( ), FIFO(2,x), and g(x).
  • the prethread for f( ) is:
  • the merging stage combines “prethreads” by merging threads that contain the same non-duplicable operation or variable. For example, the prethread for f( ) is merged with the prethread for FIFO(2,x) because they both contain the operation f( ), resulting in the prethread:
  • the thread production stage converts prethreads to threads by inserting channel declarations and initialization, privatizing duplicable variables, replacing FIFO operations with fifo_put operations and inserting a fifo_get operation in every thread that contains an operation dependent on a FIFO operation. If multiple threads contain operations dependent on the same FIFO operation, a separate channel has to be introduced for each fifo_get operation introduced and the FIFO operation is replaced with a fifo_put operation on each channel.
  • the problem can be fixed by moving the FIFO operation to before the if-statement or by arranging to pass the if-condition through a channel by changing the if-statement to if (FIFO(2,x>0)) ⁇ . . . ⁇ .
  • Decoupling can be used with all channel types except non-blocking FIFOs.
  • FIG. 4 shows a further example of how an original piece of code can be split into two threads to be executed in parallel using of put and get instructions.
  • Parallelizing at a coarse granularity allows the duplication of more control code between threads which reduces and simplifies inter-thread communication allowing the generation of distributed schedules. That is, we can distribute the control code across multiple processors both by putting each control thread on a different processor and by putting different parts of a single control thread onto different processors.
  • the transfer of data may be done by, writing the data to a particular buffer such as a FIFO. Alternatively it may simply be done by providing the other section of the program with information as to where the data has been stored.
  • the way of transferring the data depends on the system the program is executing on.
  • the architecture does not have shared memory, it is necessary to insert DMA copies from a buffer in one memory to a buffer in a different memory. This can lead to a lot of changes in the code: declaring both buffers, performing the copy, etc.
  • an analysis is performed to determine which buffers need to be replicated in multiple memory regions and to determine exactly which form of copy should be used.
  • DMA copies are also inserted automatically subject to some heuristics when the benefit from having the programmer make the decision themselves is too small.
  • Atomic channels provide atomic access to an element: an atomic_get operation acquires a copy of the element and makes the element unavailable to other threads (i.e., it “locks” the variable) and an atomic_put operation makes the variable available for use by other threads (i.e., it “unlocks” the variable).
  • an atomic_get operation acquires a copy of the element and makes the element unavailable to other threads (i.e., it “locks” the variable) and an atomic_put operation makes the variable available for use by other threads (i.e., it “unlocks” the variable).
  • atomic channels are equivalent to a fifo channel of maximum length 1.
  • Nonblocking put and get channels are a variant on fifo channels where the nbpfifo_put operation returns an error code if the channel is full instead of blocking as fifo channels do. These channels are for use in interrupt handlers since it is possible to block a thread but not an interrupt. We also provide channels that provide a non-blocking nbgfifo_get operation.
  • Timed channels provide time-indexed access to data.
  • a timestamp is specified:
  • the ts_get operation returns the entry with the closest timestamp to that specified. All ts_put operations must use strictly increasing times and all ts_get operations must use strictly increasing times. This restriction allows entries to be discarded when they can no longer be accessed.
  • Timed channels allow for more parallelism between threads since, after the first ts_put is performed, ts_get operations never block because there is always an entry with a closest timestamp.
  • the cost of this increased performance is less precise synchronization between threads than with FIFO channels: applications that use timed channels are unlikely to give deterministic results.
  • Timed channels are useful for implementing time-sensitive information where it is important to use current data.
  • mobile telephones implementing the “3rd generation” W-CDMA protocol use rake receivers to increase the bandwidth of links subject to multipath fading (i.e., where the signal contains delayed echoes of the transmitted signal typically due to reflections off buildings).
  • Rake receivers estimate the strength and delay of these reflections and use these estimates to combine delayed copies of the received signal to maximize the amount of received signal energy.
  • the best estimate to use is the one closest in time to the data arrived which may be different from the next estimate generated.
  • Timed channels are an example of a channel type which makes sense in some domains or applications but not in others. Rather than fix the set of channel types in the language, our compiler allows new channel types to be added using annotations to identify channel types and put and get operations.
  • the only properties on which SoC-C relies are that operations are atomic, directional and obey a copy semantics. That is, put operations atomically copy data into a channel and get operations atomically copy data out of a channel.
  • FIG. 5 a shows a simple computer program annotated according to an embodiment of the present invention. An analysis of this program is performed initially and parts of the program are identified by programmer annotation in this embodiment although it could be identified by some other analysis including static analysis, profile driven feedback, etc. The parts identified are as follows:
  • the “replicatable objects” that is variables and operations which it is acceptable to replicate.
  • a simple rule of thumb is that scalar variables (i.e., not arrays) which are not used outside the scope, scalar operations which only depend on and only modify replicatable variables, and control flow operations should be replicated but more sophisticated policies are possible.
  • the algorithm splits the operations in the scope into a number of threads whose execution will produce the same result as the original program under any scheduling policy that respects the FIFO access ordering of the channels used to communicate between threads.
  • the particular decoupling algorithm used generates a maximal set of threads such that the following properties hold:
  • FIG. 5 b shows the maximal set of threads for the program of FIG. 5 a .
  • One way to generate the set of threads shown in FIG. 5 b is as follows:
  • Each replicatable variable must be initialized at the start of each thread with the value of the original variable before entering the scope and one of the copies of each replicatable variable should be copied back into the master copy on leaving the scope. (Executing all these protothreads is highly unlikely to give the same answer as the original program, because it lacks the necessary synchronization between threads. This is fixed by the next steps.)
  • Another way is to pick an operation, identify all the operations which must be in the same thread as that operation by repeatedly adding operations which would be merged (in step 2 above). Then pick the next operation not yet assigned to a thread and add all operations which must be in the same thread as that operation. Repeat until there are no more non-replicatable operations. It should be noted that this is just one possible way of tackling this problem: basically, we are forming equivalence classes based on a partial order and there are many other known ways to do this.
  • the above method splits a program into a number of sections which can be executed in parallel. There are many possible mechanisms that can be used to accomplish this task.
  • FIG. 6 shows a flow diagram of a method according to an embodiment of the present invention.
  • An initial step of the method comprises analysing the computer program that contains sequential code and program separation indicators. For each program separation indicator the program is analysed to determine how it can be divided into separate sections around this separation indicator. First of all it is checked that it is reasonable to divide it there. If it can be divided at this point then data communication between the two sections is provided and this may be done in a number of ways including the insertion of put and get instructions as was discussed earlier and then the program is analysed to determine if there are further program separation indicators. If there is then the program at this point is analysed to determine if it can be divided. If it cannot in its present state then it is checked to determine if it requires further data transfer instructions to divide it.
  • Program separation indicators can take a number of forms. They may simply be some sort of indicator such as a split indicating that the program needs to be divided there or they can in fact be data transfer functions which do by themselves provide a data communication between the two sections. Thus, they may be fifo instructions indicating that data should be sent fifo to transfer the data between the two sections as was discussed earlier or they may be put and get instructions. It may be that several variables need to be transferred between the two sections and that there are not sufficient data transfer instructions within the program. In such a case then the further data transfer instructions for these variables which have not been addressed can be inserted into the program and then the data communication between the two sections is provided and can then be separated before separate execution on separate execution mechanisms.
  • the compilation tools can take a program that is either sequential or contains few threads and map it onto the available hardware, introducing parallelism in the process.
  • the task of programming a SoC is to map different parts of an application onto different parts of the hardware.
  • blocks of code must be mapped onto processors, data engines, accelerators, etc. and data must be mapped onto various memories.
  • the mapping process is both tedious and error-prone because the mappings must be consistent with each other and with the capabilities of the hardware. We reduce these problems using program analysis which:
  • the number of legal mappings is usually large but once the programmer has made a few choices, the number of legal options usually drops significantly so it is feasible to ask the programmer to make a few key choices and then have the tool fill in the less obvious choices automatically.
  • Programmable accelerators may have limited program memory so it is desirable to upload new code while old code is running. For correctness, we must guarantee that the new code is uploaded (and I-caches made consistent) before we start running it.
  • Our compiler uses program analysis to check this and/or to schedule uploading of code at appropriate places.
  • mappings of an application For applications with highly variable load, it is desirable to have multiple mappings of an application and to switch dynamically between different mappings.
  • annotations which provide the semantics we want.
  • the primary annotations are on data and on code. If a tag is repeated, it indicates alternative mappings.
  • the tags associated with data include:
  • the tags associated with code include:
  • processor P1 is to execute fft followed by P1.
  • the semantics is similar to that of a synchronous remote procedure call: when control reaches this code, free variables are marshalled and sent to processor P1, processor P1 starts executing the code and the program continues when the code finishes executing.
  • the tags associated with functions are:
  • An error such as mapping a piece of code to a fixed-function accelerator that does not support that function should probably just be reported as an error that the programmer must fix.
  • Errors such as omitting synchronization can sometimes be fixed by automatically inserting synchronization. Errors such as requiring more variables to a memory bank than will fit can be solved, to some extent, using overlay techniques. Errors such as mapping an overly large variable to a memory can be resolved using software managed paging though this may need hardware support or require that the kernel be compiled with software paging turned on (note: software paging is fairly unusual so we have to implement it before we can turn it on!). Errors such as omitting memory barriers, cache flush/invalidate operations or DMA transfers can always be fixed automatically though it can require heuristics to insert them efficiently and, in some cases, it is more appropriate to request that the programmer fix the problem themselves.
  • Our compiler uses information about the SoC architecture, extracted from the architecture description, to determine how to implement the communication requirements specified within the program. This enables it to generate the glue code necessary for communication to occur efficiently and correctly. This can include generation of memory barriers, cache maintenance operations, DMA transfers and synchronisation on different processing elements.
  • This automation reduces programming complexity, increases reliability and flexibility, and provides a useful mechanism for extended debugging options.
  • RPCs Remote Procedure Calls
  • RPC abstraction can be expressed as functions mapped to particular execution mechanisms:
  • This provides a simple mechanism to express invocation of functions, and the associated resourcing, communication and synchronisation requirements.
  • Code can be translated to target the selected processing elements, providing the associated synchronisation and communication. For example, this could include checking the resource is free, configuring it, starting it and copying the results on completion.
  • the compiler can select appropriate glue mechanisms based on the source and target of the function call. For example, an accelerator is likely to be invoked primarily by glue on a processor using a mechanism specific to the accelerator.
  • the glue code may be generated automatically based on a high level description of the accelerator or the programmer may write one or more pieces of glue by hand.
  • processor on which the operation runs can be determined statically or can be determined dynamically. For example, if there are two identical DMA engines, one might indicate that the operation can be mapped onto either engine depending on which is available first.
  • the compiler optimisations based on the desired RPC interface can range from a dynamically linked interface to inter-procedural specialisation of the particular RPC interface.
  • RPC calls may be synchronous or asynchronous.
  • Asynchronous calls naturally introduce parallelism, while synchronous calls are useful as a simpler function call model, and may be used in conjunction with fork-join parallelism. In fact, parallelism is not necessary for efficiency; a synchronous call alone can get the majority of the gain when targeting accelerators. Manually and automatically selecting between asynchronous and synchronous options can benefit debugging, tracing and optimisation.
  • RPC calls may be re-entrant or non-reentrant, and these decisions can be made implicitly, explicitly or through program analysis to provide benefit such as optimisation where appropriate.
  • This mechanism enables a particular function to have a number of different execution targets within a program, but each of those targets can be associated back to the original function; debugging and trace can exploit this information.
  • This enables a user to set a breakpoint on a particular function, and the debug and trace mechanisms be arranged such that it can be caught wherever it executes, or on a restricted subset (e.g. a particular processing element).
  • RPC interface implementation can be abstracted away in some debugging views.
  • the datatypes are often bulk datastructures such as arrays of data, multimedia data, signal processing data, network packets, etc. and the operations may be executed with some degree of parallelism on a coprocessor, DSP processor, accelerator, etc. It is therefore possible to view programs as a series of often quite coarse-grained operations applied to quite large data structures instead of the conventional view of a program as a sequence of ‘scalar’ operations (like ‘32 bit add’) applied to ‘scalar’ values like 32-bit integers or the small sets of values found in SIMD within a register (SWAR) processing such as that found in NEON. It is also advantageous to do so because this coarse-grained view can be a good match for accelerators found in modern SoCs.
  • SWAR register
  • optimization techniques known to work on fine-grained operations and data can be adapted to operate on coarse-grained operations and data.
  • Our compiler understands the semantics associated with the data structures and their use within the system, and can manipulate them and the program to perform transformations and optimisations to enable and optimise execution of the program.
  • this might be generated automatically from a precise description of the operation (including the implementation of the operation) or it might be generated from an approximate description of the main effects of the operation or it might be provided as a direct annotation.
  • compilers do something similar for scalar variables: the value of a scalar variable ‘x’ might sometimes live on the stack, sometimes in register 3, sometimes in register 6, etc. and the compiler keeps track of which copies currently contain the live value.
  • the compiler can provide improved memory allocation through memory reuse because it can identify opportunities to place two different variables in the same memory location. Indeed, one can use many algorithms normally used for register allocation (where the registers contain scalar values) to perform allocation of data structures. One modification required is that one must handle the varying size of buffers whereas, typically, all scalar registers are the same size.
  • Compiler books list many other standard transformations that can be performed to scalar code. Some of the mapping and optimisation techniques that can be applied at the coarse-grain we discuss include value splitting, spilling, coalescing, dead variable removal, recomputation, loop hoisting and CSE.
  • Data structures will be passed as arguments, possibly as part of an ABI. Optimisations such as specialisation and not conforming to the ABI when it is not exposed can be applied.
  • Our compiler supports a variety of code generation strategies which allow the parallelized control code to run on a control processor in a real time operating system, in interrupt handlers or in a polling loop (using ‘wait for event’ if available to reduce power) and it also supports distributed scheduling where some control code runs on one or more control processors, some control code runs on programmable accelerators, some simple parts of the code are implemented using conventional task-chaining hardware mechanisms. It is also possible to design special ‘scheduler devices’ which could execute some parts of the control code. The advantage of not running all the control code on the control processor is that it can greatly decrease the load on the control processor.
  • the basic decoupling algorithm splits a block of code into a number of threads that pass data between each other via FIFO channels.
  • the algorithm requires us to identify (by programmer annotation or by some other analysis including static analysis, profile driven feedback, etc.) the following parts of the program:
  • the algorithm splits the operations in the scope into a number of threads whose execution will produce the same result as the original program under any scheduling policy that respects the FIFO access ordering of the channels used to communicate between threads.
  • the put and get operations used when decoupling can be used both for scalar and non-scalar values (i.e., both for individual values (scalars) and arrays of values (non-scalars) but they are inefficient for large scalar values because they require a memory copy. Therefore, for coarse-grained decoupling, it is desirable to use an optimized mechanism to pass data between threads.
  • get operation is split into a get_begin and get_end pair
  • the modified decoupling algorithm treats the puts and gets in much the same way that the standard algorithm treats data boundaries. Specifically, it constructs the maximal set of threads such that:
  • the modified decoupling algorithm will produce:
  • Writing code using explicit puts can also be performed as a preprocessing step. For example, we could transform:
  • a First-In First-Out (FIFO) channel preserves the order of values that pass through it: the first value inserted is the first value extracted, the second value inserted is the second value extracted, etc.
  • Other kinds of channel are possible including:
  • Exclusive access can be arranged in several ways. For example, one may ‘acquire’ (aka ‘lock) a ‘lock’ (aka ‘mutex’) before starting to access the resource and ‘release’ (aka ‘unlock’) the lock after using the resource. Exclusive access may also be arranged by disabling pre-emption (such as interrupts) while in a critical section (i.e., a section in which exclusive access is required). In some circumstances, one might also use a ‘lock free’ mechanism where multiple users may use a resource but at some point during use (in particular, at the end), they will detect the conflict, clean up and retry.
  • Some examples of wanting exclusive access include having exclusive access to a hardware accelerator, exclusive access to a block of memory or exclusive access to an input/output device. Note that in these cases, it is usually not necessary to preserve the order of accesses to the resource.
  • the basic decoupling algorithm avoids introducing race conditions by preserving all ordering dependencies on statements that access non-replicated resources. Where locks have been inserted into the program, the basic decoupling algorithm is modified as follows:
  • Decoupling can be applied to any sequential section of a parallel program. If the section communicates with the parallel program, we must determine any ordering dependencies that apply to operations within the section (a safe default is that the order of such operations should be preserved). What I'm saying here is that one of the nice properties of decoupling is that it interacts well with other forms of parallellization including manual parallelization.
  • the decoupling algorithm generates sections of code that are suitable for execution on separate processors but can be executed on a variety of different execution engines by modifying the “back end” of the compiler. That is, by applying a further transformation to the code after decoupling to better match the hardware or the context we wish it to run in.
  • the most straightforward execution model is to execute each separate section in the decoupled program on a separate processor or, on a processor that supports multiple hardware contexts (i.e., threads), to execute each separate section on a separate thread.
  • SoC System on Chip
  • DSPs digital signal processors
  • GPUs graphics processing units
  • DMA direct memory access
  • data engines programmable accelerators or fixed-function accelerators.
  • This data processing can be modelled as a synchronous remote procedure call.
  • a memory copy operation on a DMA engine can be modelled as a function call to perform a memory copy.
  • the thread will typically:
  • This mode of execution can be especially effective because one ‘control processor’ can keep a number of accelerator's busy with the control processor possibly doing little more than deciding which accelerator to start next and on what data. This mode of execution can be usefully combined with all of the following forms of execution.
  • a thread library such as operating system (OS) or real time operating system (RTOS) running on one or more processors to execute the threads introduced by decoupling.
  • OS operating system
  • RTOS real time operating system
  • transformations can be viewed as a way of transforming a thread into a state machine with each context switch point representing a state and the code that continues execution from each context switch point viewed as a transition function to determine the next state.
  • Execution of transformed threads can be viewed as having been transformed to an event-based model where all execution occurs in response to external events such as responses from input/output devices or from accelerators. It is not necessary to transform all threads: event-based execution can coexist with threaded execution.
  • Transforming threads as described above is also a good match for polling-based execution where the control processor tests for completion of tasks on a set of accelerators by reading a status register associated with each accelerator. This is essentially the same as interrupt-driven execution except that the state of the accelerators is updated by polling and the polling loop executes until all threads complete execution.
  • Distributed scheduling can be done in various ways. Some part of a program may be simple enough that it can be implemented using a simple state machine which schedules one invocation of an accelerator after completion of another accelerator. Or, a control processor can hand over execution of a section within a thread to another processor. In both cases, this can be viewed as a RPC like mechanism (“ ⁇ foo( ); bar( )@P0; ⁇ @P1”). In the first case, one way to implement it is to first transform the thread to event-based form and then opportunistically spot that a sequence of system states can be mapped onto a simple state machine and/or you may perform transformations to make it map better.
  • a system has to meet a set of deadlines and the threads within the system share resources such as processors, it is common to use a priority mechanism to select which thread to run next.
  • priorities might be static or they may depend on dynamic properties such as the time until the next deadline or how full/empty input and output queues are.
  • a long-standing problem of parallelizing compilers is that it is hard to relate the view of execution seen by debug mechanisms to the view of execution the programmer expects from the original sequential program.
  • Our tools can take an execution trace obtained from running a program on parallel hardware and reorder it to obtain a sequential trace that matches the original program. This is especially applicable to but not limited to the coarse-grained nature of our parallelization method.
  • partial reconstruction can be achieved by using points in the program that synchronize with each other to guide the matching process.
  • the resulting trace will not be sequential but will be easier to understand.
  • a useful application is to make it simpler to understand a trace of a program written using an event-based programming style (e.g., a GUI, interrupt handlers, device drivers, etc.)
  • Partial reconstruction could also be used to simplify parallel programs running on systems that use release consistency. Such programs must use explicit memory barriers at all synchronization points so it will be possible to simplify traces to reduce the degree of parallelism the programmer must consider.
  • HP has been looking at using trace to enable performance debugging of distributed protocols. Their focus is on data mining and performance not reconstructing a sequential trace.
  • each section is a node in a directed graph and there is an edge from a node M to a node N if the section corresponding to M writes to an address x and the section corresponding to N reads from address x and, in the original trace, no write to x happens between M's write to x and N's read from x.
  • This directed dataflow graph shows how different sections communicate with each other and can be used for a variety of purposes:
  • the first section talks about what you need for the general case of a program that has been parallelized and you would like to serialize trace from a run of the parallel program based on some understanding of what transformations were done during parallelization (i.e., you know how different bits of the program relate to the original program).
  • the second part talks about how you would specifically do this if the parallellization process included decoupling.
  • the sketch describes the simplest case in which it can work but it is possible to relax these restrictions significantly.
  • Condition 10 onwards relate mainly to what decoupling aims to achieve. But, some conditions are relevant such as conditions 5 and 6 because, in practice, it is useful to be able to relax these conditions slightly.
  • (5) says that kernels have exclusive access to buffers but it is obviously ok to have multiple readers of the same buffer and it would also be ok (in most real programs) for two kernels to (atomically) invoke ‘malloc’ and ‘free’ in the middle of the kernels even though the particular heap areas returned will depend on the precise interleaving of those calls and it may even be ok for debugging printfs from each kernel to be ordered.
  • Consequences of (1)-(4) We can identify each transaction with a kernel instance and we can see all transactions a kernel performs.
  • Consequences of (1)-(6) Given a trace consisting of the interleaved transactions of a set of kernel instances, we can reorder the transactions such that all transactions of a kernel are contiguous and the resulting trace satisfies all read after write data dependencies. That is, we can construct a sequentially consistent view of the transactions as though kernels executed atomically and sequentially.
  • Consequences of (7)-(9) Given a trace of the state transitions and synchronizations, we can reorder them into any of the set of legal transitions those state machines could have made where a transition is legal if it respects synchronization dependencies.
  • Consequences of (1)-(9) Given a trace of all kernel transactions and all state transitions and synchronizations, we can reorder them into any legal trace which respects the same synchronization dependencies and data dependencies.
  • Consequences of (1)-(10) We can reorder any trace to match a sequential version of the same program.
  • decoupling gives us property (10) (i.e., that any trace of the decoupled program can be reordered to give a trace of the original program and to show how to do that reordering), we need to establish a relationship between the parallel state machine and the master state machine (i.e., the original program). This relationship is an “embedding” (i.e., a mapping between states in the parallel and the master machines such that the transitions map to each other in the obvious way). It is probably easiest to prove this by considering what happens when we decouple a single state machine (i.e., a program) into two parallel state machines.
  • Extensions of decoupling allow the programmer to indicate that two operations can be executed in either order even though there is a data dependency between them (e.g., both increment a variable atomically). This mostly needs us to relax the definition of what trace reconstruction is meant to do. One major requirement is that the choice of order doesn't have any knock-on effects on control flow.
  • a sufficient (and almost necessary) condition is that a put and a get on the same channel must not be inside corresponding critical sections (in different threads):
  • a useful and safe special case is that all initialization code does N puts, a loop then contains only put_get pairs and then finalization code does at most N gets. It should be possible to prove that this special case is ok.
  • a task may have a deadline or it may require that it receive 2 seconds of CPU in every 10 second interval but tasks rarely require that they receive a particular pattern of scheduling.
  • the idea is to use the flexibility that the system provides to explore different sequences from those that a traditional scheduler would provide.
  • schedulers in common use are ‘work conserving schedulers’: if the resources needed to run a task are available and the task is due to execute, the task is started. In contrast, a non-work-conserving scheduler might choose to leave a resource idle for a short time even though it could be used. Non-work-conserving schedulers are normally used to improve efficiency where there is a possibility that a better choice of task will become available if the scheduler delays for a short time.
  • a non-work-conserving scheduler for testing concurrent systems because they provide more flexibility over the precise timing of different tasks than does a work-conserving scheduler.
  • the modification of the schedule is probably done within the constraints of the real-time requirements of the tasks. For example, when a task becomes runnable, one might establish how much ‘slack’ there is in the schedule and then choose to delay the task for at most that amount. In particular, when exploring different phases, if the second event doesn't happen within that period of slack, then the first event must be sent to the system and we will hope to explore that phase the next time the event triggers.
  • the scheduler can choose to execute the thread that would run that piece of code. (Again, it may be necessary to insert instrumentation into the code to help the scheduler figure out the status of each thread so that it can execute them in the correct order.)

Abstract

A method is disclosed for transforming a portion of a computer program comprising a list of sequential instructions comprising control code and data processing code and a program separation indicator indicating a point where said sequential instructions may be divided to form separate sections that are capable of being separately executed and that each comprise different data processing code. The m method comprises the steps of: (i) analysing said portion of said program to determine if said sequential instructions can be divided at said point indicated by said program separation indicator and in response to determining that it can: (iia) providing data communication between said separate sections indicated by said program separation indicator, such that said separate sections can be decoupled from each other, such that at least one of said sections is capable of being separately executed by an execution mechanism that is separate from an execution mechanism executing another of said separate sections, said at least one of said sections being capable of generating data and communicating said data to at least one other of said separate sections; and in response to determining it can not: (iib) not performing step (iia). If step (iia) is not performed then a warning may be output, or the program may be amended so it can be separated at that point, or the program separation indicator may be removed and the sections that were to be separated merged.

Description

  • This application claims the benefit of U.S. Provisional Application No. 60/853,756, filed Oct. 24, 2006, the entire content of which is hereby incorporated by reference in this application.
  • BACKGROUND OF THE INVENTION
  • 1. Field of the Invention
  • The field of the invention relates to data processing and in particular to improving the performance of program execution.
  • 2. Description of the Prior Art
  • It is becoming increasingly difficult to provide programs that are simple to write and yet perform efficiently on complex systems. Such complex systems may comprise a number of different processing or execution units and they may be heterogeneous or asymmetric, with specialised processing units being used to increase energy efficiency and lower gate count. In particular, the programming of embedded systems with their hardware restriction, demand for efficiency and the ever decreasing time to market is becoming a real problem.
  • It is known in high level computing systems to provide multiple processors that are identical and to split processing between them to increase efficiency. In such systems a loop that executes say 100 times, may be split so that a first processor processes the loop for the first fifty times and the second for the second fifty.
  • It is also known to address the problem of programming multiprocessing systems by sending individual instructions to be processed to separate execution units where the processors communicate with each other via FIFO queues. This is effective in allowing a sequential program to be processed using many different execution units on a complex system, however, there is a need for low latency high throughput data communication mechanism to enable this to function efficiently. Clearly such mechanisms have an overhead of area and power.
  • Decoupling programs to produce a number of threads communicating via FIFO pipelines has been used many times before: Smith (James E. Smith, “Decoupled access/execute computer architectures”, ACM Transactions Computing Systems, 2(4), 289-308, 1984) applies the technique manually to Cray assembly code; Palacharla and Smith (S. Palacharla and J. E. Smith, “Decoupling integer execution in superscalar processors”, in MICRO 28: Proc. of International Symposium on Microarchitecture, 285-290, 1995) describe the use of program slicing to automate the separation. These uses of decoupling were targeted at hiding memory latency by having one thread perform all load-store operations while the other thread performs all arithmetic operations. Decoupling has experienced a revival in some very interesting recent work: Ottoni et al. (G. Ottoni, R. Rangan, A. Stoler and D. August, “Automatic Thread Extraction with Decoupled Software Pipelining”, in MICRO '05: Proc. Int. Symposium on Microarchitecture, 2005), use decoupling to parallelize inner loops for multiprocessors, and Dai et al. (J. Dai, B. Huang, L. Li and L. Harrison, “Automatically partitioning packet processing applications for pipelined architectures”, in Proc. 2005 Conf. on Programming language design and implementation, 237-248, ACM Press, 2005), use decoupling to parallelize packet processing code for multithreaded, multiprocessor packet processing processors. In both papers, the boundary between threads is automatically chosen to optimize performance through load balancing and reducing communication cost.
  • The majority of prior work on decoupling uses what we call “fine-grained decoupling”: the operations being split between threads are individual machine instructions and the data passed between threads consists of scalar values. In contrast, Ziegler et al. (H. Ziegler, B. So, M. Hall and P. Diniz, “Coarse-Grain Pipelining on Multiple {FPGA} Architectures”, in FCCM '02: Proc. of Symposium on Field-Programmable Custom Computing Machines, 2002) and Du et al. (W. Du, R. Ferreira and G. Agrawal, “Compiler Support for Exploiting Coarse-Grained Pipelined Parallelism”, in Proceedings of the ACM/IEEE SC2003 Conference on High Performance Networking and Computing, 2003) perform what we call “coarse-grained decoupling”: the operations being split between threads are procedure calls and the data passed between threads consists of arrays. An example is Ottoni et al.'s work where a special queue is needed to support access by a speculative, out-of-order processor.
  • The architectures of system on chip (SoC) platforms found in high end consumer devices are getting more and more complex as designers strive to deliver compute intensive applications on ever shrinking power budgets. Workloads running on these platforms require the exploitation of large amounts of heterogeneous parallelism and increasingly irregular memory hierarchies. The conventional approach to programming such hardware is very low level. The consequence of this is that when the core functionality of an application is mapped to the platform, its logic becomes obscured by the transformations and the glue code used during the mapping process. This yields software which is intimately and inseparably tied to the details of the platform it was originally designed for, limiting the software's portability and ultimately the architectural choices available to future generations of platform architects.
  • It would be desirable to provide a sequential program which could be automatically or at least semi-automatically transformed such that sections of it could be processed independently by different execution mechanisms, thereby improving performance. Many of the problems experienced in mapping applications onto SoC platforms come not from deciding how to map a program onto the hardware but from the fragmentation of the application that results from implementing these decisions. This is particularly so when the system itself decides how to parallelise the program as this becomes extremely complex and means that future analysis of the program by a programmer is very difficult. In particular, although sequential programs can be fairly readily understood by a human being, once they are parallelised it becomes increasingly difficult for the human mind to understand them. Thus, it would be desirable if some control could be provided as to where to divide such programs in order to generate a system that is more readily understandable by a human and yet can execute efficiently on a parallel system.
  • SUMMARY OF THE INVENTION
  • A first aspect of the present invention provides a method of transforming a portion of a computer program comprising a list of sequential instructions comprising control code and data processing code and a program separation indicator indicating a point where said sequential instructions may be divided to form separate sections that are capable of being separately executed and that each comprise different data processing code, said method comprising the steps of: (i) analysing said portion of said program to determine if said sequential instructions can be divided at said point indicated by said program separation indicator and in response to determining that it can: (iia) providing data communication between said separate sections indicated by said program separation indicator, such that said separate sections can be decoupled from each other, such that at least one of said sections is capable of being separately executed by an execution mechanism that is separate from an execution mechanism executing another of said separate sections, said at least one of said sections being capable of generating data and communicating said data to at least one other of said separate sections; and in response to determining it can not: (iib) not performing step (iia).
  • As noted above it would be desirable to be able to split a program having a list of instructions to be processed sequentially into sections that can be processed independently. The list of sequential instructions that is to be split comprises a single thread and does not provide parallel/concurrent operations. There may however, be branches or loops within this portion of the program. Thus, if the program is to be run on a system that can handle parallel execution of instructions, either by virtue of it having several execution mechanisms or a single processing device that can handle parallel execution, these sections could be split up to be executed separately.
  • The present method provides a tool for analysing a portion of the program to determine if the instructions can be divided at a point indicated by a separation indicator. These separation indicators are provided within at least a section of the program and indicate where it is desirable to divide the program. Thus, the division of the program is determined to some degree by these separation indicators and can thus, be controlled by a programmer. The method of the present invention forms an analysis of a program that actually includes the separation indicators and decides if the program can indeed be separated at these points. If it is decided that it can be it provides data communication between the two sections to allow them to be decoupled from each other. Thus, if appropriate the program can be split into sections suitable for separate execution allowing a program to be efficiently processed by a variety of different, often complex devices. If it decides it cannot be divided at this point then it does not perform the data communication step.
  • In some embodiments, if it decides it cannot be separated at a point indicated by a separation indicator then a warning indicating an error in the computer program is output. Providing the programmer with a warning may be the most appropriate thing to do if the separation indicators are not in the correct position.
  • In some embodiments, said step (iib) comprises amending said computer program such that said sequential instructions can be divided at said point and then performing step (iia).
  • Alternatively, to outputting a warning the method can amend the computer program so that the sequential instructions can be divided at this point and then the data communication can be divided between the different sections. It may be that it is a relatively simple matter to amend the computer program so that it can be divided at the point indicated and if this is the case then the method can perform this step rather than outputting a warning.
  • In some embodiments, said step of amending said computer program comprises inserting data transfer instructions at said point indicated by said program separation indicator.
  • The step required to amend the computer program may be one of inserting data transfer instructions at the point indicated by the program separation indicator.
  • In some embodiments, said step (iib) comprises merging said two sections together and removing said program separation indicator.
  • One way of dealing with discovering that two sections are not suitable for being executed on separate execution mechanisms is to simply remove the program separation indicator and merge the two sections together.
  • In some embodiments, said program separation indicator comprises at least one data transfer instruction, said data communication between said separate sections being provided in dependence upon said at least one data transfer instruction.
  • Although the program separation indicators can take a number of forms it is quite efficient if they take the form of data transfer instructions.
  • In order for sections to be able to execute separately there needs to be some sort of data communication and synchronisation in place between the two sections. Thus, providing program separation indictors in the form of data transfer instructions may facilitate their separation by providing the data communication required.
  • In some embodiments, said step (iia) of providing data communication comprises inserting at least one “put data into a data store” instruction and at least one “get data from said data store” instruction into said instruction stream, and dividing said computer program between said put and get instructions to form said at least one separate section.
  • The step of providing data communication can be done by inserting a put data into data store and a get data from said data store instructions into the instruction stream. This allows data to be removed from one section of the program and then input into the other section via a data store. Thus, the two sections are in effect decoupled from each other but data can travel between the two via this data store.
  • In some embodiments, said data store comprises a FIFO buffer.
  • Although, in some embodiments, the data store may comprise a FIFO buffer as this is clearly the simplest arrangement where the first data to exit from a section of the program is the first data to enter the next section, it may be that the data is not required in a particular order or indeed that all the data generated by one section is not required by the other. Thus, a variety of different data stores and different arrangements can be used in some embodiments. For example, a stack which has a last in first out semantics could be used, one advantage of this is that a stack is simple to implement.
  • In some embodiments, said step (iia) comprises providing cyclic data communication between said separate sections.
  • The decoupling of threads can be further extended to use where communication between threads is cyclic. Cyclic thread dependencies can lead to deadlock that is, two threads may not run in parallel because of data dependencies between them and thus, in devices of the prior art decoupling is limited to acyclic thread dependencies. Embodiments of the present invention address this problem and provide cyclic dependencies. This may be done, for example, by using put and get instructions and not requiring the number of puts to be equal to the number of gets. This is in contrast to the prior art where put and get operations are always inserted in corresponding places in each thread. Allowing put operations to be inserted in places that do not correspond to get operations in other threads, means that code such as is illustrated in FIG. 4 can be produced.
  • In some embodiments, said separate sections comprise the same control code.
  • In many embodiments, the control code is the same in the two sections as the computer program is divided such that different data processing steps are performed under the same control in each divided section. Duplicating control code in this way enables the program to be divided.
  • However, in some embodiments the control code will be different. This is because it may be advantageous occasionally to slightly modify the control code in one of the sections such that, for example, a conditional code that is no longer required is not present.
  • In some embodiments, said portion of said computer program comprises a plurality of program separation indicators each indicating a point where said sequential instructions may be divided to form separate sections, each of said separate sections being capable of being separately executed and comprising different data processing code, said method providing data communication between said separate sections indicated by said plurality of program separation indicators.
  • An instruction loop having several data processing steps for example can be divided by embodiments of the present invention into two sections by allowing the different sections to have different data processing codes. This can increase the performance of a system significantly. Generally this is done by duplicating the control code and in effect performing two loops, one performing one or more of the data processing steps of the original loop and the other performing the rest of the steps.
  • In some embodiments, said transformed computer program is suitable for execution upon respective execution mechanisms of a heterogeneous system having a complex asymmetric memory hierarchy and a plurality of execution mechanisms.
  • Embodiments of the present invention, although applicable to symmetric systems are particularly valuable in asymmetric heterogeneous systems wherein it is often difficult to separately execute sections of a program, particularly where at least a portion of the program is written sequentially.
  • Although, in some embodiments, a section of code is executed by a single execution mechanisms, in other embodiments said control code of at least one of said sections is operable to be processed by a processor of said heterogeneous system and said data processing code of said section is operable to be processed by an execution mechanism under control of said control code processed by said processor.
  • An execution mechanism may be a simple mechanism designed for a particular function, such as a memory transfer unit (colloquially known as a “DMA engine”) and in such cases it may be that the control code is performed on a separate processor, while the data processing operations are performed on the simpler mechanism.
  • It should be noted that the plurality of execution mechanisms can take a number of forms, including a general purpose processor; a direct memory access unit; a coprocessor; a VLIW processor; a digital signal processor; and a hardware accelerator unit.
  • In some embodiments, said method step comprises an initial step performed before step (i) of: defining said portion of said computer program by marking said computer program with indications delimiting said portion of said sequential instructions within which said at least two sections are to be located.
  • It is often easier to identify the potential to execute short sections of a program independently than to execute long sections independently. Thus, it may be appropriate to mark the portion of the program to be analysed. This marking may be done by a programmer or it may be done automatically.
  • In some embodiments, said computer program comprises said portion having a number of instructions to be executed sequentially and at least one further portion having instructions to be performed in parallel with each other.
  • A computer program may have different portions, some for execution sequentially and some already written for parallel processing. In such a case, it is the portion that has the instructions for sequential execution that is analysed to see if it can be divided into sections for separate execution. It should be noted that a portion to be analysed may be within a section that is to be executed in parallel. Furthermore, a portion to be analysed may also contain two or more sections that are to be executed in parallel.
  • In some embodiments, said portion of said computer program comprises an instruction loop comprising at least two data processing instructions, and said at least two sections each comprise said instruction loop each section comprising at least one of said at least two data processing instructions, said at least two sections comprising different data processing instructions.
  • An instruction loop having several data processing steps, can be divided into two sections, and thereby increase the performance of a system significantly. The present method is able to duplicate the control code and perform in effect two loops, one performing one or more of the data processing steps of the initial loop and the other performing the rest of the steps.
  • In some embodiments said portion of said computer program comprises a whole computer program.
  • A second aspect of the present invention provides a computer-readable storage medium comprising a computer program for controlling a computer to perform the method of the first aspect of the present invention.
  • A third aspect of the invention provides a computer executing a computer program to perform the method of a first aspect of the present invention.
  • A further aspect of the present invention provides a method of transforming a portion of a computer program comprising a list of sequential instructions and a program separation indicator indicating a point where said sequential instructions may be divided to form separate sections that are capable of being separately executed and that each comprise different data processing code, said list of instructions comprising control code and data processing code, said method comprising the step of:
  • (i) in response to said program separation indicator, providing data communication between said separate sections such that they can be decoupled from each other, such that at least one of said sections is capable of being separately executed by an execution mechanism that is separate from an execution mechanism executing another of said separate sections and said at least one of said sections can communicate data that it generates to at least one other of said separate sections via said provided data communication.
  • It is desirable to be able to split a program having a list of instructions to be processed sequentially into sections that can be processed independently. If the program comprises program separation indictors indicating points where they may be divided then the program can be transformed by providing data communication between the separate sections at the points indicated so that they can be decoupled from each other. This allows the program to be split into sections suitable for separate execution and allows the program to be efficiently processed by a variety of different often complex devices. This enables future analysis of the program via a programmer to be relatively straight forward and yet still enable it to execute efficiently on a parallel system.
  • In some embodiments, said method comprises a further initial step (0) performed before step (i) of in response to said program separation indicator, analysing said portion of said computer program and determining which of said sequential instructions should be in which of said separate sections prior to providing said data communication.
  • An additional step of analysing the program to ensure that it can indeed be separated at these points can be helpful and allows programs that have program separation indicators in them that are potentially in an incorrect place to still be transformed.
  • The above, and other objects, features and advantages of this invention will be apparent from the following detailed description of illustrative embodiments which is to be read in connection with the accompanying drawings.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIGS. 1 a to 1 c show flow diagrams of methods according to embodiments of the present invention;
  • FIG. 2 a to 2 d schematically shows the splitting into separately executable sections of a computer program according to an embodiment of the present invention;
  • FIG. 3 a to 3 b schematically shows a method of splitting and then merging sections of a computer program;
  • FIG. 4 schematically shows data communication between two sections of a program;
  • FIG. 5 a shows a simple computer program annotated according to an embodiment of the present invention;
  • FIG. 5 b shows the maximal set of threads for the program of FIG. 4 a.
  • FIG. 6 schematically illustrates an asymmetric multiprocessing apparatus with an asymmetric memory hierarchy;
  • FIG. 7 illustrates an architectural description;
  • FIG. 8 illustrates a communication requirement; and
  • FIG. 9 illustrates communication support.
  • DESCRIPTION OF THE PREFERRED EMBODIMENTS
  • FIG. 1 a shows a flow diagram illustrating a method according to an embodiment of the present invention. A first step is performed in which a portion of a computer program comprising a list of sequential instructions and a program separation indicator indicating a point where the sequential instructions may be divided to form separate sections that are capable of being separately executed is analysed. The analysis determines if the sequential instructions can be split at the point indicated by the separation indicator into separate sections that can be processed on different execution mechanisms. If it determines it can the sequential instructions are divided into the separate sections at the point indicated by the program separation indicator. If it determines they cannot be separated at this point then a warning is output to the programmer to indicate an error in the program.
  • FIG. 1 b illustrates an alternative embodiment in which rather than outputting a warning if the program cannot be decoupled and separated at the indicated point, the program is amended by inserting data communication instructions into the list of sequential instructions, these data communication instructions enabling the different sections to be decoupled and thus, separated. The separation can then be performed.
  • FIG. 1 c provides an alternative embodiment in which in response to determining that the program cannot be separated at the indicated point the two sections are merged together and the program separation indicator removed.
  • As can be seen the three embodiments provide different solutions to an analysis that determines that it is not possible to separate a program into sections at the point indicated. Different ones of these embodiments may be used in the same analysis of a program for different separation indicators depending on circumstances. Thus, it may be that the preferred course of action is to amend the program to make it separable at he indicated point, if this cannot be done then it may be chosen to merge the two portions, or if this would result in an unacceptably large portion then a warning may be output.
  • FIG. 2 a shows a portion of a computer program comprising a loop in which data items are processed, function f operating on the data items, and function g operating on the data items output by function f and then function h operating on these items. These functions being performed n times in a row for values of i from 1 to n.
  • Thus, the control flow can be seen as following the solid arrows while data flow follows the dotted arrows. In order to try to parallelise this portion of the computer program it is analysed, either automatically or by a programmer and “decouple” or program separation indications are inserted into the data flow where it is seen as being desirable to split the portion into sections that are decoupled from each other and can thus, be executed on separate execution mechanisms. In this case, a decouple indication is provided between the data processing operations f and g. This can be seen as being equivalent to inserting a buffer in the data flow, as the two sections are decoupled by providing a data store between then so that the function f can produce its results which can then be accessed at a different time by function g.
  • The program is then analysed to see if it can indeed be decoupled at the point indicated by the separation indicators. If it can then the method proceeds to FIG. 2 c. If it cannot then a warning may be output to the programmer, the program may be amended to enable it to be decoupled at this point, or the decouple indication may be removed from the program and the two sections merged.
  • FIG. 2 c, shows how the separate sections of the program are decoupled by the insertion of “put” and “get” instructions into the data stream. These result in the data being generated by the f function being put into a data store, from which it is retrieved by the get instruction to be processed by function g. This enables the program to be split into two sections as is shown in FIG. 2 d. One section performs function f on the data for i=1 to n and puts it into a buffer data store. The other section then retrieves this data and performs functions g and h on it. Thus, by the provision of a data store the two sections of the program are in effect decoupled from each other and can be executed on separate executions mechanisms. This decoupling by the use of a specialised buffer and extra instructions to write and read data to it, are only required for systems having heterogeneous memory, whereby two execution mechanisms may not be able to access the same memory. If the memory is shared, then the data path between the two sections does not need a data copy but can simply be provided with a data store identifier.
  • Thus, if the program is being processed by a data processing apparatus having a number of different processors, the two sections can be processed in parallel which can improve the performance of the apparatus. Alternatively, one of the functions may be a function suitable for processing by an accelerator in which case it can be directed to an accelerator, while the other portion is processed by say, the CPU of the apparatus.
  • As can be seen from FIG. 2 d, the splitting of the program results in the control code of the program being duplicated in both section, while the data processing code is different in each section.
  • It should be noted that the put and get operations used in FIG. 2 c can be used in programs both for scalar and non-scalar values but they are inefficient for large (non-scalar) values as they require a memory copy. In operating systems, it is conventional to use “zero copy” interfaces for bulk data transfer: instead of generating data into one buffer and then copying the data to the final destination, the final destination is first determined and the data directly generated into the final destination. A different embodiment of the invention applies this idea to the channel interface, by replacing the simple ‘put’ operation with two functions: put_begin obtains the address of the next free buffer in the channel and put end makes this buffer available to readers of the channel:
  • void* put_begin(channel *ch);
  • void put_end(channel *ch, void* buf);
  • Similarly, the get operation is split into a get_begin and get_end pair
  • void* get_begin(channel *ch);
  • void get_end(channel *ch, void* buf);
  • Using these operations, sequences of code such as:
  • int x[100];
  • generate(x);
  • put(ch,x);
  • Can be rewritten to this more efficient sequence:
  • int px=put_begin(ch);
  • generate(px);
  • put_end(ch,px);
  • And similarly, for get:
  • int x[100];
  • get(ch,x);
  • consume(x);
  • to this more efficient sequence:
  • int px=get_begin(ch);
  • consume(px);
  • get_end(ch,px);
  • The use of puts and gets to decouple threads can be further extended to use where communication between threads is cyclic. Cyclic thread dependencies can lead to deadlock—that is, two threads may not run in parallel because of data dependencies between them and thus, in devices of the prior art decoupling is generally limited to acyclic thread dependencies.
      • 1. A particularly common case of cyclic thread dependencies is code such as
  • y = 1;
    while(1) {
      x = f(y);
      y = g(x);
    }
  • Under conventional decoupling schemes, puts are always inserted after assignment to any data boundary variable. This would require both a put outside the loop and a put at the end of the loop:
  • y1 = 1;
    put(ch,y1);
    while(1) {
      y2 = get(ch);
      x = f(y2);
      y3 = g(x);
      put(ch,y3);
    }
  • Conventional decoupling schemes only generate matched pairs of puts and gets (i.e., there is only one put on each channel and only one get on each channel) so they cannot generate such code.
  • Embodiments of the present invention use an alternative way of decoupling this code and generate:
  • y1 = 1;
    while(1) {
      put(ch,y1);
      y2 = get(ch);
      x = f(y2);
      y1 = g(x);
    }
  • This does have matched pairs of puts and gets but breaks the rule of always performing a put after any assignment to a variable.
  • FIGS. 3 a and 3 b schematically illustrate the program code shown in FIG. 2. In this Figure a data store is provided to decouple functions f and g, but one is not provided between g and h. In this embodiment analysis of the program to decouple it is performed automatically and several potential sections are provided, in this case these are loops having functions f, g and h in them. The automatic analysis then checks that each loop can be executed separately and in this case identifies a missing data path between functions g and h. Thus, these two functions are remerged to provide two sections with a data path between.
  • A further example of a code fragment that can be split by an embodiment of the present invention is provided below. Since communication lies at the boundaries between threads, the compiler's job is to “color in” the code that lies between the boundaries. This is done through a dependency analysis to decide the set of operations that are on the “producer” side of a channel and the set of operations on the “consumer” side of a channel. The compiler then partitions the operations according to that analysis; and generates a separate thread for each equivalence class.
  • 1 PIPELINE{
    2  for(int i=0; i<100; ++i) {
    3   int x = f();
    4   if (i % 10 != 0) {
    5    FIFO(2,x);
    6    g(x);
    7   }
    8  }
    9 }
  • The PIPELINE annotation on line 1 identifies the region of code to be split into threads. The FIFO annotation on line 5 identifies that the communication between threads is to be performed between f and g. The compiler performs a data and control flow analysis to determine that the call to g has a data dependency on the FIFO operation and also has control dependencies on the if statement (line 4) and the for loop (line 2). This results in the following thread:
  • 2 for(int i=0; i<100; ++i) {
    3  if (i % 10 != 0) {
    5   int x = fifo_get(fifo);
    6   g(x);
    7  }
    8 }
  • Note that the dependency on the FIFO operation caused the compiler to insert a fifo get operation.
  • The data and control flow analysis also determines that the FIFO operation (line 5) has a data dependency on the call to f (line 3) and also has control dependencies on the if statement (line 4) and the for loop (line 2). This results in the following thread:
  • 2 for(int i=0; i<100; ++i) {
    3  int x = f( );
    4  if (i % 10 != 0) {
    5   fifo_put(fifo,x);
    7  }
    8 }
  • Note that because the data dependency is from the FIFO operation, the compiler inserted a fifo_put operation.
  • Because decoupling respects data and control dependencies and FIFO channels preserve the order of their elements, the resulting threads are equivalent to the unannotated code. (Obviously, this property is not preserved when timed channels are used.)
  • Decoupling must make two essential decisions: “What variables and operations to replicate?” and “What operations to place in the same thread?”.
  • The task of decoupling is to split the region of code into as many threads as possible, without introducing timing-dependent behaviour, using channels to communicate between threads. As the example above shows, the generated threads do not strictly partition the statements in the original code: some variables and operations (principally those used for control) are duplicated between threads. The choice of what to duplicate is an essential part of the transformation: if too much code or data is duplicated, the transformed program can run slower and use much more memory than the original program. While these decisions could be controlled using annotations on every variable and statement, some simple default rules can be used that give the programmer control without requiring excessive annotation. By default, the variables to be duplicated (i.e., privatized) are determined by the location of their declaration (variables declared inside the PIPELINE annotation may be duplicated) and their size (scalar variables may be duplicated). Operations other than function calls may be duplicated unless they have side-effects or modify a non-duplicable variable.
  • The other essential decision that the transformation must make is “What operations must be in the same thread as each other?”. To avoid introducing timing-dependent behaviour, we apply the following two rules:
      • 1. To preserve data and control dependencies, any dependent operations must be in the same thread as each other unless the dependency is from an operation to a FIFO operation. This special treatment of dependencies on FIFO operations has the effect of cutting edges in the dataflow graph. Note that dependencies include both direct data dependencies due to function inputs and outputs and indirect dependencies due to side-effects.
      • 2. To avoid introducing race conditions, any operations which write to an unduplicated non-channel variable must be in the same thread as all operations which read from or write to that variable. Channels are excluded because all channel operations atomically modify the channel.
      • 3. To avoid introducing non-determinism, all puts to a given channel must be in the same thread as each other and all gets from a given channel must be in the same thread as each other.
  • Our implementation of decoupling finds the unique, maximal solution in three stages: dependency analysis, merging, and thread production.
  • The dependency analysis stage forms a large number of “pre-threads” by computing a backward data and control slice (see Mark Weiser, “Program slicing”, in ICSE '81: Proc. of International Conference on Software Engineering, 439-449, 1981) from each unduplicated operation ignoring data dependencies on FIFO operations but including all duplicable and unduplicable operations in the slice. That is, we repeatedly apply rules (1) and (2) to form prethreads. In our running example, there are three prethreads: one each for f( ), FIFO(2,x), and g(x).
  • For example, the prethread for f( ) is:
  • 1 PIPELINE{
    2  for(int i=0; i<100; ++i) {
    3  int x = f( );
    8  }
    9 }

    the prethread for FIFO(2,x) is:
  • 1 PIPELINE{
    2  for(int i=0; i<100; ++i) {
    3  int x = f( );
    4  if (i % 10 != 0) {
    5   FIFO(x,2);
    7  }
    8  }
    9 }

    and the prethread for g(x) is:
  • 1 PIPELINE{
    2  for(int i=0; i<100; ++i) {
    3  int x;
    4  if (i % 10 != 0) {
    6   g(x);
    7  }
    8  }
    9 }
  • Note that the prethread for g(x) omits the FIFO operation (line 5): the dependency rules ensure that FIFO operations are placed in the producer thread but not the consumer thread.
  • The merging stage combines “prethreads” by merging threads that contain the same non-duplicable operation or variable. For example, the prethread for f( ) is merged with the prethread for FIFO(2,x) because they both contain the operation f( ), resulting in the prethread:
  • 1 PIPELINE{
    2  for(int i=0; i<100; ++i) {
    3  int x = f( );
    4  if (i % 10 != 0) {
    5   FIFO(x,2);
    7  }
    8  }
    9 }
  • This is identical to the prethread for FIFO(2,x) because the prethread already contained the f(x) operation.
  • The thread production stage converts prethreads to threads by inserting channel declarations and initialization, privatizing duplicable variables, replacing FIFO operations with fifo_put operations and inserting a fifo_get operation in every thread that contains an operation dependent on a FIFO operation. If multiple threads contain operations dependent on the same FIFO operation, a separate channel has to be introduced for each fifo_get operation introduced and the FIFO operation is replaced with a fifo_put operation on each channel.
  • When adding FIFO annotations, it is easy for the programmer to overlook dependencies between operations. For example, if the if-statement were changed to x>0, then the control dependency of g(x) on the if-statement and the data dependency of the condition x>0 requires the prethread for x=f( ) to be merged with the prethread for g(x) and no pipeline parallelism is introduced. Since the use of FIFO suggests that the programmer expected parallelism, the compiler reports a warning that both sides of the channel are in the same thread. In this case, the problem can be fixed by moving the FIFO operation to before the if-statement or by arranging to pass the if-condition through a channel by changing the if-statement to if (FIFO(2,x>0)) { . . . }.
  • Decoupling can be used with all channel types except non-blocking FIFOs.
  • When optimizing inter-thread communication and introducing pipeline parallelism, our compiler needs to perform dataflow analysis on both scalar and non-scalar variables. When a function is written in C, it is often possible to perform this analysis automatically using, for example, the techniques of Tu et al. (P. Tu and D. Padua, “Automatic Array Privatization”, in Proc. Workshop on Compiler Optimizations for Scalable Parallel Systems Languages (LNCS 768), Springer Verlag, 500-521, 1993) to compute DEF, USE and KILL sets. In typical SoC architectures, this analysis is not possible because the PE is programmed in assembly language or is a fixed-function accelerator, so the programmer must annotate the function declarations for RPC functions with whether an argument is an “in”, “out” or “inout” argument. Similarly, side-effects on global variables or on device state are annotated by recording dependencies on those variables/devices.
  • FIG. 4 shows a further example of how an original piece of code can be split into two threads to be executed in parallel using of put and get instructions.
  • Parallelizing at a coarse granularity allows the duplication of more control code between threads which reduces and simplifies inter-thread communication allowing the generation of distributed schedules. That is, we can distribute the control code across multiple processors both by putting each control thread on a different processor and by putting different parts of a single control thread onto different processors.
  • The transfer of data may be done by, writing the data to a particular buffer such as a FIFO. Alternatively it may simply be done by providing the other section of the program with information as to where the data has been stored.
  • The way of transferring the data depends on the system the program is executing on. In particular, if the architecture does not have shared memory, it is necessary to insert DMA copies from a buffer in one memory to a buffer in a different memory. This can lead to a lot of changes in the code: declaring both buffers, performing the copy, etc. In embodiments of the invention an analysis is performed to determine which buffers need to be replicated in multiple memory regions and to determine exactly which form of copy should be used. DMA copies are also inserted automatically subject to some heuristics when the benefit from having the programmer make the decision themselves is too small.
  • One of the advantages of inserting channel indicators into a program explicitly is that it is possible to use channels which do not provide first-in-first-out semantics. This section describes some alternative implementations which can be used.
  • Atomic channels provide atomic access to an element: an atomic_get operation acquires a copy of the element and makes the element unavailable to other threads (i.e., it “locks” the variable) and an atomic_put operation makes the variable available for use by other threads (i.e., it “unlocks” the variable). We note that atomic channels are equivalent to a fifo channel of maximum length 1.
  • Nonblocking put and get channels are a variant on fifo channels where the nbpfifo_put operation returns an error code if the channel is full instead of blocking as fifo channels do. These channels are for use in interrupt handlers since it is possible to block a thread but not an interrupt. We also provide channels that provide a non-blocking nbgfifo_get operation.
  • Timed channels provide time-indexed access to data. When data is inserted into or removed from the channel, a timestamp is specified:
  • void ts_put(tschan_t c, int timestamp, void* v);
  • void ts_get(tschan_t c, int timestamp, void* v);
  • The ts_get operation returns the entry with the closest timestamp to that specified. All ts_put operations must use strictly increasing times and all ts_get operations must use strictly increasing times. This restriction allows entries to be discarded when they can no longer be accessed.
  • Timed channels allow for more parallelism between threads since, after the first ts_put is performed, ts_get operations never block because there is always an entry with a closest timestamp. The cost of this increased performance is less precise synchronization between threads than with FIFO channels: applications that use timed channels are unlikely to give deterministic results.
  • Timed channels are useful for implementing time-sensitive information where it is important to use current data. For example, mobile telephones implementing the “3rd generation” W-CDMA protocol use rake receivers to increase the bandwidth of links subject to multipath fading (i.e., where the signal contains delayed echoes of the transmitted signal typically due to reflections off buildings). Rake receivers estimate the strength and delay of these reflections and use these estimates to combine delayed copies of the received signal to maximize the amount of received signal energy. In such applications, the best estimate to use is the one closest in time to the data arrived which may be different from the next estimate generated.
  • Timed channels are an example of a channel type which makes sense in some domains or applications but not in others. Rather than fix the set of channel types in the language, our compiler allows new channel types to be added using annotations to identify channel types and put and get operations. The only properties on which SoC-C relies are that operations are atomic, directional and obey a copy semantics. That is, put operations atomically copy data into a channel and get operations atomically copy data out of a channel.
  • Most alternative channel types of interest have the effect of making the application non-deterministic: the behaviour of the application depends on the precise order in which parallel operations complete. For this reason, we require the programmer to explicitly decide which type of queue to use.
  • FIG. 5 a shows a simple computer program annotated according to an embodiment of the present invention. An analysis of this program is performed initially and parts of the program are identified by programmer annotation in this embodiment although it could be identified by some other analysis including static analysis, profile driven feedback, etc. The parts identified are as follows:
  • What can be regarded as the “decoupling scope”. This is a contiguous sequence of code that we wish to split into multiple threads.
  • The “replicatable objects”: that is variables and operations which it is acceptable to replicate. A simple rule of thumb is that scalar variables (i.e., not arrays) which are not used outside the scope, scalar operations which only depend on and only modify replicatable variables, and control flow operations should be replicated but more sophisticated policies are possible.
  • Ordering dependencies between different operations: if two function calls both modify a non-replicated variable, the order of those two function calls is preserved in the decoupled code. (Extensions to the basic algorithm allow this requirement to be relaxed in various ways.)
  • The “data boundaries” between threads: that is, the non-replicatable variables which will become FIFO channels. (The “copies” data annotation described above determines the number of entries in the FIFO.)
  • This degree of annotation is fine for examples but would be excessive in practice so most real embodiments would rely on tools to add the annotations automatically based on heuristics and/or analyses.
  • At a high level, the algorithm splits the operations in the scope into a number of threads whose execution will produce the same result as the original program under any scheduling policy that respects the FIFO access ordering of the channels used to communicate between threads.
  • The particular decoupling algorithm used generates a maximal set of threads such that the following properties hold:
      • All threads have the same control flow structure and may have copies of the replicatable variables and operations.
      • Each non-replicatable operation is included in only one of the threads.
      • Each non-replicatable variable must satisfy one of the following:
        • The only accesses to the variable in the original program are reads; or
        • All reads and writes to the variable are in a single thread; or
        • The variable was marked as a data boundary and all reads are in one thread and all writes are in another thread.
      • If two operations have an ordering dependency between them which is not due to a read after write (RAW) dependency on a variable which has been marked as a data boundary, then the operations must be in the same thread.
  • FIG. 5 b shows the maximal set of threads for the program of FIG. 5 a. One way to generate the set of threads shown in FIG. 5 b is as follows:
      • 1. For each non-replicatable operation, create a ‘protothread’ consisting of just that operation plus a copy of all the replicatable operations and variables.
  • Each replicatable variable must be initialized at the start of each thread with the value of the original variable before entering the scope and one of the copies of each replicatable variable should be copied back into the master copy on leaving the scope. (Executing all these protothreads is highly unlikely to give the same answer as the original program, because it lacks the necessary synchronization between threads. This is fixed by the next steps.)
      • 2. Repeatedly pick two threads and merge them into a single thread if any of the following problems exist:
        • a. One thread writes a non-replicatable variable which is accessed (read or written) by the other thread and the variable is not marked as a data boundary.
        • b. Two threads both write to a variable which is marked as a data boundary.
        • c. Two threads both read from a variable that is marked as a data boundary.
        • d. There is an ordering dependency between an operation in one thread and an operation in the other thread which is not a RAW dependency on a variable marked as a data boundary.
      • 3. When no more threads can be merged, quit
  • Another way is to pick an operation, identify all the operations which must be in the same thread as that operation by repeatedly adding operations which would be merged (in step 2 above). Then pick the next operation not yet assigned to a thread and add all operations which must be in the same thread as that operation. Repeat until there are no more non-replicatable operations. It should be noted that this is just one possible way of tackling this problem: basically, we are forming equivalence classes based on a partial order and there are many other known ways to do this.
  • The above method splits a program into a number of sections which can be executed in parallel. There are many possible mechanisms that can be used to accomplish this task.
  • Execution mechanism.
  • FIG. 6 shows a flow diagram of a method according to an embodiment of the present invention. An initial step of the method comprises analysing the computer program that contains sequential code and program separation indicators. For each program separation indicator the program is analysed to determine how it can be divided into separate sections around this separation indicator. First of all it is checked that it is reasonable to divide it there. If it can be divided at this point then data communication between the two sections is provided and this may be done in a number of ways including the insertion of put and get instructions as was discussed earlier and then the program is analysed to determine if there are further program separation indicators. If there is then the program at this point is analysed to determine if it can be divided. If it cannot in its present state then it is checked to determine if it requires further data transfer instructions to divide it. If it does then these are inserted and data communication in line with these data transfer instructions can then be provided between the two sections. If it is determined that it cannot be divided at this point then the two potential sections can be merged and the program separation indicator removed from the list. In some embodiments, rather than merging the two potential sections a warning is output to indicate that the program is not correct and then programmer can then amend it.
  • Program separation indicators can take a number of forms. They may simply be some sort of indicator such as a split indicating that the program needs to be divided there or they can in fact be data transfer functions which do by themselves provide a data communication between the two sections. Thus, they may be fifo instructions indicating that data should be sent fifo to transfer the data between the two sections as was discussed earlier or they may be put and get instructions. It may be that several variables need to be transferred between the two sections and that there are not sufficient data transfer instructions within the program. In such a case then the further data transfer instructions for these variables which have not been addressed can be inserted into the program and then the data communication between the two sections is provided and can then be separated before separate execution on separate execution mechanisms.
  • 1. Introduction
  • The following describes language extensions/annotations, compilation tools, analysis tools, debug/profiling tools, runtime libraries and visualization tools to help programmers program complex multiprocessor systems. It is primarily aimed at programming complex SoCs which contain heterogeneous parallelism (CPUs, DEs, DSPs, programmable accelerators, fixed-function accelerators and DMA engines) and irregular memory hierarchies.
  • The compilation tools can take a program that is either sequential or contains few threads and map it onto the available hardware, introducing parallelism in the process.
  • When the program is executed, we can exploit the fact that we know mappings between the user's program and what is executing to efficiently present a debug and profile experience close to what the programmer expects while still giving the benefit of using the parallel hardware. We can also exploit the high level view of the overall system to test the system more thoroughly, or to abstract away details that do not matter for some views of the system.
  • This provides a way of providing a full view for SoC programming.
  • 2. Single View Compilation 2.1 Overview
  • The task of programming a SoC is to map different parts of an application onto different parts of the hardware. In particular, blocks of code must be mapped onto processors, data engines, accelerators, etc. and data must be mapped onto various memories. In a heterogeneous system, we may need to write several versions of each kernel (each optimized for a different processor) and some blocks of code may be implemented by a fixed-function accelerator with the same semantics as the code. The mapping process is both tedious and error-prone because the mappings must be consistent with each other and with the capabilities of the hardware. We reduce these problems using program analysis which:
      • detect errors in the mapping
      • infer what mappings would be legal
      • choose legal mappings automatically subject to some heuristics
  • The number of legal mappings is usually large but once the programmer has made a few choices, the number of legal options usually drops significantly so it is feasible to ask the programmer to make a few key choices and then have the tool fill in the less obvious choices automatically.
  • Often the code needs minor changes to allow some mappings. In particular, if the architecture does not have shared memory, it is necessary to insert DMA copies from a buffer in one memory to a buffer in a different memory buffer. This leads to a lot of changes in the code: declaring both buffers, performing the copy, etc. Our compiler performs an analysis to determine which buffers need to be replicated in multiple memory regions and to determine exactly which form of copy should be used. It also inserts DMA copies automatically subject to some heuristics when the benefit from having the programmer make the decision themselves is too small.
  • Systems with multiple local memories often have tight memory requirements which are exacerbated by allocating a copy of a buffer in multiple memories. Our compiler uses lifetime analysis and heuristics to reduce the memory requirement by overlapping buffers in a single memory when they are never simultaneously live.
  • Programmable accelerators may have limited program memory so it is desirable to upload new code while old code is running. For correctness, we must guarantee that the new code is uploaded (and I-caches made consistent) before we start running it.
  • Our compiler uses program analysis to check this and/or to schedule uploading of code at appropriate places.
  • For applications with highly variable load, it is desirable to have multiple mappings of an application and to switch dynamically between different mappings.
  • Some features of our approach are:
      • Using an architecture description to derive the ‘rules’ for what code can execute where. In particular, we use the type of each processor and the memories attached to each processor.
      • The use of program analysis together with the architecture description to detect inconsistent mappings.
      • Using our ability to detect inconsistent mappings to narrow down the list of consistent mappings to reduce the number of (redundant) decisions that the programmer has to make.
      • Selecting an appropriate copy of a buffer according to which processor is using it and inserting appropriate DMA copy operations.
      • Use of lifetime analyses and heuristics to reduce memory usage due to having multiple copies of a buffer.
      • Dynamic switching of mappings.
    2.2 Annotations to Specify Mappings
  • To describe this idea further, we need some syntax for annotations. Here we provide one embodiment of annotations which provide the semantics we want.
  • In this document, all annotations take the form:
      • . . . @ {tag1=>value1, . . . tagm=>value}
  • Or, when there is just one tag and it is obvious,
      • . . . @ value
  • The primary annotations are on data and on code. If a tag is repeated, it indicates alternative mappings.
  • The tags associated with data include:
      • {memory=>“bank3”} specifies which region of memory a variable is declared in.
      • {copies=>2} specifies that a variable is double buffered
      • {processor=>“P1”} specifies that a variable is in a region of memory accessible by processor P1.
  • For example, the annotation:
      • int x[100] @ {memory=>“bank3”, copies=>2, memory=>“bank4”, copies=>1} indicates that there are 3 alternative mappings of the array x: two in memory bank3 and one in memory bank4.
  • The tags associated with code include:
      • {processor=>“P1”} specifies which processor the code is to run on
      • {priority=>5} specifies the priority with which that code should run relative to other code running on the same processor
      • {atomic=>true} specifies that the code is to run without pre-emption.
      • {runtime=>“<=10 ms”} specifies that the code must be able to run in less than 10 milliseconds on that processor. This is one method used to guide automatic system mapping.
  • For example, the annotation:
      • {fir(x); fft(x,y);} @ {processor=>“P1”}
  • Specifies that processor P1 is to execute fft followed by P1. The semantics is similar to that of a synchronous remote procedure call: when control reaches this code, free variables are marshalled and sent to processor P1, processor P1 starts executing the code and the program continues when the code finishes executing.
  • It is not always desirable to have synchronous RPC behaviour. It is possible to implement asynchronous RPCs using this primitive either by executing mapped code in a separate thread or by splitting each call into two parts: one which signals the start and one which signals completion.
  • The tags associated with functions are:
      • {cpu=>“AR1DE”} specifies that this version of an algorithm can be run on a processor/accelerator of type “AR1 DE”
      • {flags=>“-O3”} specifies compiler options that should be used when compiling this function
      • {implements=>“fir”} specifies that this version of an algorithm can be used as a drop in replacement for another function in the system
  • For example, the annotation:
      • Void copy_DMA(void* src, void* tgt, unsigned length) @ {cpu=>“PL081”, implements=>“copy”};
  • Specifies that this function runs on a PL081 accelerator (a DMA Primesys engine) and can be used whenever a call to “copy” is mapped to a PL081 accelerator.
  • 2.3 Extracting Architectural Rules from the Architectural Description
  • There are a variety of languages for describing hardware architectures including the SPIRIT language and ARM SoCDesigner's internal language. While the languages differ in syntax, they share the property that we can extract information such as the following:
      • The address mapping of each processor. That is, which elements of each memory region and which peripheral device registers are accessed at each address in the address and I/O space. A special case of this is being able to detect that a component cannot address a particular memory region at all.
      • The type of each component including any particular attributes such as cache size or type.
      • That a processor's load-store unit, a bus, a combination of buses in parallel with each other, a memory controller or the address mapping makes it possible for accesses to two addresses that map to the same component or to different components from one processor to be seen in a different order by another processor. That is, the processors are not sequentially consistent with respect to some memory accesses.
      • That a combination of load-store units, caches, buffers in buses, memory controllers, etc. makes it possible for writes by one processor to the same memory location to suffer from coherency problems wrt another processor for certain address ranges.
  • Thus, from the architecture, we can detect both address maps which can be used to fill in fine details of the mapping process and we can detect problems such as connectivity, sequential consistency and incoherence that can affect the correctness of a mapping.
  • 2.4 Detecting Errors in a System Mapping
  • Based on rules detected in an architectural description and/or rules from other sources, we can analyse both sequential and parallel programs to detect errors in the mapping. Some examples:
      • If a piece of code is mapped to a processor P and that code reads or writes data mapped to a memory M and P cannot access M, then there is an error in the mapping.
      • If two pieces of code mapped to processors P1 and P2 both access the same variable x (e.g. P1 writes to x and P2 reads from x), then any write by P1 that can be observed by a read by P2 must:
        • have some synchronization between P1 and P2
        • be coherent (e.g., there may need to be a cache flush by P1 before the synchronization and a cache invalidate by P2 after the synchronization)
        • be sequentially consistent (e.g., there may need to be a memory barrier by P1 before the synchronization and a memory barrier by P2 after the synchronization)
        • share memory (e.g., it may be necessary to insert one or more copy operations (by DMA engines or by other processors/accelerators) to transfer the data from one copy of x to the other.
      • Synchronization and signalling can be checked
      • Timing and bandwidth can be checked
      • Processor capability can be checked: a DMA engine probably cannot play Pacman
      • Processor speed can be checked: a processor may not be fast enough to meet certain deadlines.
      • Etc.
  • Thus, we can check the mapping of a software system against the hardware system it is to run on based on a specification of the architecture or additional information obtained in different ways.
  • 2.5 Filling in Details and Correcting Errors in a System Mapping
  • Having detected errors in a system mapping, there are a variety of responses. An error such as mapping a piece of code to a fixed-function accelerator that does not support that function should probably just be reported as an error that the programmer must fix. Errors such as omitting synchronization can sometimes be fixed by automatically inserting synchronization. Errors such as requiring more variables to a memory bank than will fit can be solved, to some extent, using overlay techniques. Errors such as mapping an overly large variable to a memory can be resolved using software managed paging though this may need hardware support or require that the kernel be compiled with software paging turned on (note: software paging is fairly unusual so we have to implement it before we can turn it on!). Errors such as omitting memory barriers, cache flush/invalidate operations or DMA transfers can always be fixed automatically though it can require heuristics to insert them efficiently and, in some cases, it is more appropriate to request that the programmer fix the problem themselves.
  • 3. Architecture Driven Communication 3.1 Overview
  • Given a program that has been mapped to the hardware, the precise way that the code is compiled depends on details of the hardware architecture. In particular, it depends on whether two communicating processors have a coherent and sequentially consistent view of a memory through which they are passing data.
  • 3.2 Communication Glue Code
  • Our compiler uses information about the SoC architecture, extracted from the architecture description, to determine how to implement the communication requirements specified within the program. This enables it to generate the glue code necessary for communication to occur efficiently and correctly. This can include generation of memory barriers, cache maintenance operations, DMA transfers and synchronisation on different processing elements.
  • This automation reduces programming complexity, increases reliability and flexibility, and provides a useful mechanism for extended debugging options.
  • 3.3 Communication Error Checking
  • Other manual and automatic factors may be used to influence the communication mechanism decisions. Errors and warnings within communication mappings can be found using information derived from the architecture description.
  • 3.4 Summary
  • Some features of our approach are:
      • Detecting coherence and consistency problems of communication requirements from a hardware description.
      • Automatically inserting memory barriers, cache maintenance, DMA transfers etc. to fix coherence/consistency problems into remote procedure call stubs (i.e., the “glue code”) based on above.
    4. Accelerator RPC
  • We take the concept of Remote Procedure Calls (RPCs) which are familiar on fully programmable processors communicating over a network, and adapt and develop it for application in the context of a SoC: processors communicating over a bus with fixed function, programmable accelerators and data engines.
  • Expressing execution of code on other processing elements or invocation of accelerators as RPCs gives a function based model for programmers, separating the function from the execution mechanism. This enables greater flexibility and scope for automation and optimisation.
  • 4.1 RPC Abstraction
  • An RPC abstraction can be expressed as functions mapped to particular execution mechanisms:
  • main( ) {
     foo( );
     foo( ) @ {processor => p2};
    }
  • This provides a simple mechanism to express invocation of functions, and the associated resourcing, communication and synchronisation requirements.
  • Code can be translated to target the selected processing elements, providing the associated synchronisation and communication. For example, this could include checking the resource is free, configuring it, starting it and copying the results on completion. The compiler can select appropriate glue mechanisms based on the source and target of the function call. For example, an accelerator is likely to be invoked primarily by glue on a processor using a mechanism specific to the accelerator.
  • The glue code may be generated automatically based on a high level description of the accelerator or the programmer may write one or more pieces of glue by hand.
  • The choice of processor on which the operation runs can be determined statically or can be determined dynamically. For example, if there are two identical DMA engines, one might indicate that the operation can be mapped onto either engine depending on which is available first.
  • The compiler optimisations based on the desired RPC interface can range from a dynamically linked interface to inter-procedural specialisation of the particular RPC interface.
  • 4.2 RPC Semantics
  • RPC calls may be synchronous or asynchronous. Asynchronous calls naturally introduce parallelism, while synchronous calls are useful as a simpler function call model, and may be used in conjunction with fork-join parallelism. In fact, parallelism is not necessary for efficiency; a synchronous call alone can get the majority of the gain when targeting accelerators. Manually and automatically selecting between asynchronous and synchronous options can benefit debugging, tracing and optimisation.
  • RPC calls may be re-entrant or non-reentrant, and these decisions can be made implicitly, explicitly or through program analysis to provide benefit such as optimisation where appropriate.
  • 4.3 RPC Debugging
  • This mechanism enables a particular function to have a number of different execution targets within a program, but each of those targets can be associated back to the original function; debugging and trace can exploit this information. This enables a user to set a breakpoint on a particular function, and the debug and trace mechanisms be arranged such that it can be caught wherever it executes, or on a restricted subset (e.g. a particular processing element).
  • The details of the RPC interface implementation can be abstracted away in some debugging views.
  • 4.4 Summary
  • Some features of our approach are:
      • Using an RPC-like approach for mapping functions on to programmable and fixed function accelerators, including multiple variants.
      • Providing mechanisms for directing mapping and generation of the marshalling and synchronisation to achieve it.
      • Optimising the RPC code based on inter-procedural and program analysis.
      • Providing debug functionality based on information from the RPC abstraction and the final function implementations.
    5. Coarse Grained Dataflow 5.1 Overview
  • Increasingly, applications are being built using libraries which define datatypes and a set of operations on those types. The datatypes are often bulk datastructures such as arrays of data, multimedia data, signal processing data, network packets, etc. and the operations may be executed with some degree of parallelism on a coprocessor, DSP processor, accelerator, etc. It is therefore possible to view programs as a series of often quite coarse-grained operations applied to quite large data structures instead of the conventional view of a program as a sequence of ‘scalar’ operations (like ‘32 bit add’) applied to ‘scalar’ values like 32-bit integers or the small sets of values found in SIMD within a register (SWAR) processing such as that found in NEON. It is also advantageous to do so because this coarse-grained view can be a good match for accelerators found in modern SoCs.
  • We observe that with some non-trivial adaptation and some additional observations, optimization techniques known to work on fine-grained operations and data can be adapted to operate on coarse-grained operations and data.
  • Our compiler understands the semantics associated with the data structures and their use within the system, and can manipulate them and the program to perform transformations and optimisations to enable and optimise execution of the program.
  • 5.2 Conventional Analyses and Their Extension
  • Most optimizing compilers perform a dataflow analysis prior to optimization. For example, section 10.5 of Aho Sethi and Ullman's ‘Compilers: Principles Techniques and Tools’, published by Addison Wesley, 1986, ISBN: 0-201-10194-7 describes dataflow analysis. The dataflow analysis is restricted to scalar values: those that fit in a single CPU register. Two parts of a dataflow analysis are:
      • identifying the dataflow through individual operations
      • combining the dataflow analysis with a control-flow analysis to determine the dataflow from one program point to another.
  • In order to use dataflow analysis techniques with coarse-grained dataflow, we modify the first part so that instead of identifying the effect of a single instruction on a single element, we identify the effect of a coarse-grained operation (e.g., a function call or coprocessor invocation) on an entire data structure in terms of whether the operation is a ‘use’, a ‘def’ or a ‘kill’ of the value in a data structure. Care must be taken if an operation modifies only half of an array since the operation does not completely kill the value of the array.
  • For operations implemented in hardware or in software, this might be generated automatically from a precise description of the operation (including the implementation of the operation) or it might be generated from an approximate description of the main effects of the operation or it might be provided as a direct annotation.
  • In particular, for software, these coarse-grained operations often consist of a simple combination of nested loops and we can analyze the code to show that the operation writes to an entire array and therefore ‘kills’ the old value in the array. In scalar analysis, this is trivial since any write necessarily kills the entire old value.
  • The following sections identify some of the uses of coarse-grained dataflow analysis
  • 5.3 Multiple Versions of the Same Buffer
  • Especially when writing parallel programs or when using I/O devices and when dealing with complex memory hierarchies, it is necessary to allocate multiple identically sized buffers and copy between the different buffers (or use memory remapping hardware to achieve the effect of a copy). We propose that in many cases these multiple buffers can be viewed as alternative versions of a single, logical variable. It is possible to detect this situation in a program with multiple buffers, or the programmer can identify the situation. One way the programmer can identify the situation is to declare a single variable and then use annotations to specify that the variable lives in multiple places or the programmer could declare multiple variables and use annotations to specify that they are the same logical variable.
  • However the different buffers are identified as being one logical variable, the advantages that can be obtained include:
      • more intelligent buffer allocation
      • detecting errors where one version is updated and that change is not propagated to other version before it is used
      • debug, trace and profile tools can treat a variable as one logical entity so that, for example, if programmer sets watchpoint on x then tools watch for changes on any version of x. Likewise, if compiler has put x and y in the same memory location (following liveness analysis), then the programmer will only be informed about a write to x when that memory location is being used to store x, not when it is being used to store y. When doing this, you might well want to omit writes to a variable which exist only to preserve the multi-version illusion. For example, if one accelerator writes to version 1, then a dma copies version 1 to version 2, then another accelerator modifies the variable, then the programmer will often not be interested in the dma copy.
  • We note that compilers do something similar for scalar variables: the value of a scalar variable ‘x’ might sometimes live on the stack, sometimes in register 3, sometimes in register 6, etc. and the compiler keeps track of which copies currently contain the live value.
  • 5.4 Allocation
  • By performing a liveness analysis of the data structures, the compiler can provide improved memory allocation through memory reuse because it can identify opportunities to place two different variables in the same memory location. Indeed, one can use many algorithms normally used for register allocation (where the registers contain scalar values) to perform allocation of data structures. One modification required is that one must handle the varying size of buffers whereas, typically, all scalar registers are the same size.
  • 5.5 Scheduling
  • One thing that can increase memory use is having many variables simultaneously live. It has been known for a long time that you can reduce the number of scalar registers required by a piece of code by reordering the scalar operations so that less variables are simultaneously live.
  • Using a coarse-grained dataflow analysis, one can identify the lifetime of each coarse-grained data structure and then reorder code to reduce the number of simultaneously live variables. One can even choose to recalculate the value of some data structure because it is cheaper to recalculate it than to remember its value.
  • When parallelising programs, one can also deliberately choose to restrain the degree of parallelism to reduce the number of simultaneously live values. Various ways to restrain the parallelism exist: forcing two operations into the same thread, using mutexes/semaphores to block one thread if another is using a lot of resource, tweaking priorities or other scheduler parameters.
  • If a processor/accelerator has a limited amount of available memory, performing a context switch on that processor can be challenging. Context switching memory-allocated variables used by that processor solves the problem.
  • 5.6 Optimisation
  • Compiler books list many other standard transformations that can be performed to scalar code. Some of the mapping and optimisation techniques that can be applied at the coarse-grain we discuss include value splitting, spilling, coalescing, dead variable removal, recomputation, loop hoisting and CSE.
  • Data structures will be passed as arguments, possibly as part of an ABI. Optimisations such as specialisation and not conforming to the ABI when it is not exposed can be applied.
  • 5.7 Multigranularity Operation
  • In some cases, one would want to view a complex datastructure at multiple granularities. For example, given a buffer of complex values, one might wish to reason about dataflow affecting all real values in the buffer, dataflow affecting all imaginary values or dataflow involving the whole buffer. (More complex examples exist)
  • 5.8 Debugging
  • When debugging, it is possible for the data structure to live in a number of different places throughout the program. We can provide a single debug view of all copies, and watch a value wherever it is throughout the lifetime of a program, optionally omitting omit certain accesses such as DMAs.
  • The same is possible for tracing data structures within the system.
  • 5.9 Zero Copying
  • Using this coarse-grained view, one can achieve zero copy optimization of a sequence of code like this:
      • int x[100];
      • generate(&x); // writes to x
      • put(channel,&x)
        by inlining the definition of put to get:
      • int x[100];
      • generate(&x); // writes to x
      • int *px=put_begin(channel);
      • copy(px,&x);
      • put_end(channel,px);
        then reordering the code a little:
      • int *px=put_begin(channel);
      • int x[100];
      • generate(&x); // writes to x
      • copy(px,&x);
      • put_end(channel,px);
        and optimizing the memory allocation and copy:
      • int *px=put_begin(channel);
      • generate(px); // writes to *px
      • put_end(channel,px);
    5.10 Trace
  • Most of this section is about coarse-grained data structure but some benefits from identifying coarse-grained operations come when we are generating trace. Instead of tracing every scalar operation that is used inside a coarse-grained operation, we can instead just trace the start and stop of the operation. This can also be used for cross-triggering the start/stop of recording other information through trace.
  • Likewise, instead of tracing the input to/output from the whole sequence of scalar operations, we can trace just the values at the start/end of the operation.
  • 5.11 Validating Programmer Assertions
  • If we rely on programmer assertions, documentation, etc. in performing our dataflow analysis, it is possible that an error in the assertions will lead to an error in the analysis or transformations performed. To guard against these we can often use hardware or software check mechanisms. For example, if we believe that a function should be read but not written by a given function, then we can perform a compile-time analysis to verify it ahead of time or we can program an MMU or MPU to watch for writes to that range of addresses or we can insert instrumentation to check for such errors. We can also perform a ‘lint’ check which looks for things which may be wrong even if it cannot prove that they are wrong. Indeed, one kind of warning is that the program is too complex for automatica analysis to prove that it is correct.
  • 5.12 Summary
  • Some of the features of our approach are:
      • Using a register like (aka scalar-like) approach to data structure semantics within the system
      • Using liveness analysis to influence memory allocation, parallelism and scheduling decisions.
      • Applying register optimisations found in compiler to data structures within a program.
      • Providing debugging and tracing of variables as a single view
    6. Decoupling 6.1 Overview
  • Given a program that uses some accelerators, it is possible to make it run faster by executing different parts in parallel with one another. Many methods for parallelizing programs exist but many of them require homogeneous hardware to work and/or require very low cost, low latency communication mechanisms to obtain any benefit. Our compiler uses programmer annotations (many/all of which can be inserted automatically) to split the code that invokes the accelerators (‘the control code’) into a number of parallel “threads” which communicate infrequently. Parallelizing the control code is advantage because it allows tasks on independent accelerators to run concurrently.
  • Our compiler supports a variety of code generation strategies which allow the parallelized control code to run on a control processor in a real time operating system, in interrupt handlers or in a polling loop (using ‘wait for event’ if available to reduce power) and it also supports distributed scheduling where some control code runs on one or more control processors, some control code runs on programmable accelerators, some simple parts of the code are implemented using conventional task-chaining hardware mechanisms. It is also possible to design special ‘scheduler devices’ which could execute some parts of the control code. The advantage of not running all the control code on the control processor is that it can greatly decrease the load on the control processor.
  • Other parallelising methods may be used in conjunction with the other aspects of this compiler.
  • Some of the features of our approach are:
      • By applying decoupled software pipelining to the task of parallelizing the control code in a system that uses heterogeneous accelerators, we significantly extend the reach of decoupled software pipelining and by working on coarser grained units of parallelism, we avoid the need to add hardware to support high frequency streaming.
      • By parallelizing at a significantly coarser granularity, we avoid the need for low latency, high throughput communication mechanisms used in prior art.
      • Parallelizing at a significantly coarser granularity also allows us to duplicate more control code between threads which reduces and simplifies inter-thread communication which allows us to generate distributed schedules. That is, we can distribute the control code across multiple processors both by putting each control thread on a different processor and by putting different parts of a single control thread onto different processors.
      • By optionally allowing the programmer more control over the communication between threads, we are able to overcome the restriction of decoupled software pipelining to acyclic ‘pipelines’.
      • The wide range of backends including distributed scheduling and use of hardware support for scheduling.
      • Our decoupling algorithm is applied at the source code level whereas existing decoupling algorithms are applied at the assembly code level after instruction scheduling.
  • Some of the recent known discussions on decoupled software pipelining are:
      • Decoupled Software Pipelining: http://liberty.princeton.edu/Research/DSWP/http://liberty.princeton.edu/Publications/index.php?abs=1&setselect=pact04_dswp http://liberty.cs.princeton.edu/Publications/index.php?abs=1&setselect=micro38_dswp
      • Automatically partitioning packet processing applications for pipelined architectures, PLDI 2005, ACM http://portal.acm.org/citation.cfm?id=1065010.1065039
    6.2 A Basic Decoupling Algorithm
  • The basic decoupling algorithm splits a block of code into a number of threads that pass data between each other via FIFO channels. The algorithm requires us to identify (by programmer annotation or by some other analysis including static analysis, profile driven feedback, etc.) the following parts of the program:
      • The “decoupling scope”: that is a contiguous sequence of code that we wish to split into multiple threads. Some ways this can be done are by marking a compound statement, or we can insert a ‘barrier’ annotation that indicates that some parallelism ends/starts here.
      • The “replicatable objects”: that is variables and operations which it is acceptable to replicate. A simple rule of thumb is that scalar variables (i.e., not arrays) which are not used outside the scope, scalar operations which only depend on and only modify replicatable variables, and control flow operations should be replicated but more sophisticated policies are possible.
      • Ordering dependencies between different operations: if two function calls both modify a non-replicated variable, the order of those two function calls is preserved in the decoupled code. (Extensions to the basic algorithm allow this requirement to be relaxed in various ways.)
      • The “data boundaries” between threads: that is, the non-replicatable variables which will become FIFO channels. (The “copies” data annotation described above determines the number of entries in the FIFO.)
  • (Identifying replicatable objects and data boundaries are two of the features of our decoupling algorithm.)
  • If we use annotations on the program to identify these program parts, a simple program might look like this:
  • void main( ) {
       int i;
      for(i=0; i<10; ++i) {
      int x[100] @ {copies=2, replicatable=false; boundary=true} ;
      produce(x) @ {replicatable=false, writes_to=[x]};
        DECOUPLE(x);
      consume(x) @ {replicatable=false, reads_from=[x]};
      }
    }
  • This degree of annotation is fine for examples but would be excessive in practice so most real embodiments would rely on tools to add the annotations automatically based on heuristics and/or analyses.
  • At a high level, the algorithm splits the operations in the scope into a number of threads whose execution will produce the same result as the original program under any scheduling policy that respects the FIFO access ordering of the channels used to communicate between threads.
  • The particular decoupling algorithm we use generates a maximal set of threads such that the following properties hold:
      • All threads have the same control flow structure and may have copies of the replicatable variables and operations.
      • Each non-replicatable operation is included in only one of the threads.
      • Each non-replicatable variable must satisfy one of the following:
        • The only accesses to the variable in the original program are reads; or
        • All reads and writes to the variable are in a single thread; or
        • The variable was marked as a data boundary and all reads are in one thread and all writes are in another thread.
      • If two operations have an ordering dependency between them which is not due to a read after write (RAW) dependency on a variable which has been marked as a data boundary, then the operations must be in the same thread.
  • For the example program above, the maximal set of threads is:
  • void main( ) {
       int x[100] @ {copies=2};
       channel c @ {buffers=x};
       parallel sections{
       section{
           int i;
           for(i=0; i<10; ++i) {
              int x1[100];
           produce(x1);
           put(c,x1);
           }
      }
       section{
           int i;
           for(i=0; i<10; ++i) {
              int x2[100];
           get(c,x2);
           consume(x2);
           }
      }
    }
  • One way to generate this set of threads is as follows:
      • 4. For each non-replicatable operation, create a ‘protothread’ consisting of just that operation plus a copy of all the replicatable operations and variables. Each replicatable variable must be initialized at the start of each thread with the value of the original variable before entering the scope and one of the copies of each replicatable variable should be copied back into the master copy on leaving the scope. (Executing all these protothreads is highly unlikely to give the same answer as the original program, because it lacks the necessary synchronization between threads. This is fixed by the next steps.)
      • 5. Repeatedly pick two threads and merge them into a single thread if any of the following problems exist:
        • a. One thread writes a non-replicatable variable which is accessed (read or written) by the other thread and the variable is not marked as a data boundary.
        • b. Two threads both write to a variable which is marked as a data boundary.
        • c. Two threads both read from a variable that is marked as a data boundary.
        • d. There is an ordering dependency between an operation in one thread and an operation in the other thread which is not a RAW dependency on a variable marked as a data boundary.
      • 6. When no more threads can be merged, quit
  • Another way if to pick an operation, identify all the operations which must be in the same thread as that operation by repeatedly adding operations which would be merged (in step 2 above). Then pick the next operation not yet assigned to a thread and add all operations which must be in the same thread as that operation. Repeat until there are no more non-replicatable operations. (There are lots of other ways of tackling this problem: basically, we are forming equivalence classes based on a partial order and there are many known ways to do that.)
  • Note that doing dataflow analysis on arrays one must distinguish defs which are also kills (i.e., the entire value of a variable is overwritten by an operation) and that requires a more advanced analysis than is normally used.
  • 6.3 Decoupling Extensions
  • There are a number of extensions to this model
  • 6.3.1 Range Splitting Preprocessing
  • It is conventional to use dataflow analysis to determine the live ranges of a scalar variable and then replace the variable with multiple copies of the variable: one for each live range. We use the same analysis techniques to determine the live range of arrays and split their live ranges in the same way. This has the benefit of increasing the precision of later analyses which can enable more threads to be generated. On some compilers it also has the undesirable effect of increasing memory usage which can be mitigated by later merging these copies if they end up in the same thread and by being selective about splitting live ranges where the additional decoupling has little overall effect on performance.
  • 6.3.2 Zero Copy Optimizations
  • The put and get operations used when decoupling can be used both for scalar and non-scalar values (i.e., both for individual values (scalars) and arrays of values (non-scalars) but they are inefficient for large scalar values because they require a memory copy. Therefore, for coarse-grained decoupling, it is desirable to use an optimized mechanism to pass data between threads.
  • In operating systems, it is conventional to use “zero copy” interfaces for bulk data transfer: instead of generating data into one buffer and then copying the data to the final destination, we first determine the final destination and generate the data directly into the final destination. Applying this idea to the channel interface, we can replace the simple ‘put’ operation with two functions: put_begin obtains the address of the next free buffer in the channel and put_end makes this buffer available to readers of the channel:
      • Void* put _begin(channel *ch);
      • Void put_end(channel *ch, void* buf);
  • Similarily, the get operation is split into a get_begin and get_end pair
      • Void* get_begin(channel *ch);
      • Void get_end(channel *ch, void* buf);
  • Using these operations, we can often rewrite sequences of code such as:
      • Int x[100];
      • Generate(x);
      • Put(ch,x);
        to this more efficient sequence:
      • Int *px=put_begin(ch);
      • Generate(px);
      • Put_end(ch,px);
        And similarily, for get:
      • Int x[100];
      • Get(ch,x);
      • Consume(x);
        to this more efficient sequence:
      • Int *px=get_begin(ch);
      • Consume(px);
      • get_end(ch,px);
  • Note that doing zero copy correctly requires us to take lifetime of variables into account.
  • We can do that using queues with multiple readers, queues with intermediate r/w points, reference counts or by restricting the decoupling (all readers must be in same thread and . . . ) to make lifetime trivial to track. This can be done by generating custom queue structures to match the code or custom queues can be built out of a small set of primitives.
  • 6.3.3 Dead Code and Data Elimination
  • This section illustrates both how to get better results and also that we may not get exactly the same control structure but that they are very similar.
  • 6.3.4 Allowing Cyclic Thread Dependencies
  • Prior art on decoupling restricts the use of decoupling to cases where the communication between the different threads is acyclic. There are two reasons why prior art has done this:
      • 2. Cyclic thread dependencies can lead to deadlock—that is, two threads may not run in parallel because of data dependencies between them.
      • 3. A particularity common case of cyclic thread dependencies is code such as
  • y = 1;
    while(1) {
     x = f(y);
     y = g(x);
    }
      • Under existing decoupling schemes, puts are always inserted after assignment to any data boundary variable. This would require both a put outside the loop and a put at the end of the loop:
  • y1 = 1;
    put(ch,y1);
    while(1) {
     y2 = get(ch);
     x = f(y2);
     y3 = g(x);
     put(ch,y3);
    }
      • Existing decoupling schemes only generate matched pairs of puts and gets (i.e., there is only one put on each channel and only one get on each channel so they cannot generate such code An alternative way of decoupling this code is to generate:
  • y1 = 1;
    while(1) {
     put(ch,y1);
     y2 = get(ch);
     x = f(y2);
     y1 = g(x);
    }
      • This does have matched pairs of puts and gets but breaks the rule of always performing a put after any assignment to a variable so it is also not generated by existing decoupling techniques.
    6.3.5 Exposing Channels to the Programmer
  • It is possible to modify the decoupling algorithm to allow the programmer to insert puts and gets (or put_begin/end, get_begin/end pairs) themselves. The modified decoupling algorithm treats the puts and gets in much the same way that the standard algorithm treats data boundaries. Specifically, it constructs the maximal set of threads such that:
      • Almost all the same conditions as for standard algorithm go here
      • All puts to a channel are in the same thread
      • All gets to a channel are in the same thread
  • For example, given this program:
  • channel ch1;
    put(ch1,0);
    for(int i=0; i<N); ++i) {
     int x = f( );
     put(ch1,x);
     int y = g(get(ch1));
     DECOUPLE(y);
     h(x,y);
    }
  • The modified decoupling algorithm will produce:
  •   channel ch1, ch2;
      put(ch1,0);
      parallel sections{
      section{
         for(int i=0; i<10; ++i) {
             x = f( );
             put(ch1,x);
             int y = get(ch2);
             h(x,y);
         }
    }
      section{
         for(int i=0; i<10; ++i) {
         int y = g(get(ch1));
             put(ch2,y);
         }
    }
  • This extension of decoupling is useful for creating additional parallelism because it allows f and g to be called in parallel.
  • Writing code using explicit puts can also be performed as a preprocessing step. For example, we could transform:
  • for(i=0; i<N; ++i) {
     x = f(i);
      y = g(i,x);
      h(i,x,y);
    }
  • To the following equivalent code:
  • x = f(0);
    for(i=0: i<N; ++i) {
       y = g(i,x);
       h(i,x,y);
       if (i+1<N) x = f(i+1);
    }
  • Which, when decoupled gives very similar code to the above.
  • (There are numerous variations on this transformation including computing f(i+1) unconditionally, peeling the last iteration of the loop, etc.)
  • 6.3.6 Alternatives to FIFO Channels
  • A First-In First-Out (FIFO) channel preserves the order of values that pass through it: the first value inserted is the first value extracted, the second value inserted is the second value extracted, etc. Other kinds of channel are possible including:
      • a “stack” which has Last in First out (LIFO) semantics. Amongst other advantages, stacks can be simpler to implement
      • a priority queue where entries are prioritized by the writer or according to some property of the entry and the reader always received the highest priority entry in the queue.
      • a merging queue where a new value is not inserted if it matches the value at the back of the queue or as a variant, if it matches any value in the queue. Omitting duplicate values which may help reduce duplicated work
      • a channel which only tracks the last value written to the queue. That is, the queue logically contains only the most recently written entry. This is useful if the value being passed is time-dependent (e.g., current temperature) and it is desirable to always use the most recent value. Note that with fine-grained decoupling the amount of time between generation of the value and its consumption is usually small so being up to date is not a problem; whereas in coarse-grained decoupling, a lot of time may pass between generation and consumption and the data could easily be out of date if passed using a FIFO structure.
      • A channel which communicates with a hardware device. For example, a DMA device may communicate with a CPU using a memory mapped doubly-linked list of queue entries which identify buffers to be copied or a temperature sensor may communicate with a CPU using a device register which contains the current temperature.
  • Using most of these alternative channels has an affect on program meaning so we either have to perform an analysis before using a different kind of channel or the programmer can indicate that a different choice is appropriate/allowed.
  • 6.3.7 Using Locks
  • In parallel programming, it is often necessary for one thread to need exclusive access to some resource while it is using that resource to avoid a class of timing dependent behaviour known as a “race condition” or just a “race”. The regions of exclusive access are known as “critical sections” and are often clearly marked in a program. Exclusive access can be arranged in several ways. For example, one may ‘acquire’ (aka ‘lock) a ‘lock’ (aka ‘mutex’) before starting to access the resource and ‘release’ (aka ‘unlock’) the lock after using the resource. Exclusive access may also be arranged by disabling pre-emption (such as interrupts) while in a critical section (i.e., a section in which exclusive access is required). In some circumstances, one might also use a ‘lock free’ mechanism where multiple users may use a resource but at some point during use (in particular, at the end), they will detect the conflict, clean up and retry.
  • Some examples of wanting exclusive access include having exclusive access to a hardware accelerator, exclusive access to a block of memory or exclusive access to an input/output device. Note that in these cases, it is usually not necessary to preserve the order of accesses to the resource.
  • The basic decoupling algorithm avoids introducing race conditions by preserving all ordering dependencies on statements that access non-replicated resources. Where locks have been inserted into the program, the basic decoupling algorithm is modified as follows:
      • The ordering dependencies on operations which use shared resources can be relaxed. This requires programmer annotation and/or program analysis which, for each operation which may be reordered, identifies:
        • Which other operations it can be reordered relative to
        • Which operations can simultaneously access the same resource (i.e., without requiring exclusive access)
        • Which critical section each operation occurs in.
      • For example, one might identify a hardware device as a resource, then indicate which operations read from the resource (and so can be executed in parallel with each other) and which operations modify the resource (and so must have exclusive access to the resource).
      • For simplicity, one might identify all operations inside a critical section as having an ordering dependency between them though one can sometimes relax this if the entire critical section lies inside the scope of decoupling.
      • One might determine which critical section each operation occurs in using an analysis which conservatively approximates the set of locks held at all points in the program.
    6.3.8 Multithreaded Input
  • Decoupling can be applied to any sequential section of a parallel program. If the section communicates with the parallel program, we must determine any ordering dependencies that apply to operations within the section (a safe default is that the order of such operations should be preserved). What I'm saying here is that one of the nice properties of decoupling is that it interacts well with other forms of paralellization including manual parallelization.
  • 6.4 Decoupling Backends
  • The decoupling algorithm generates sections of code that are suitable for execution on separate processors but can be executed on a variety of different execution engines by modifying the “back end” of the compiler. That is, by applying a further transformation to the code after decoupling to better match the hardware or the context we wish it to run in.
  • 6.4.1 Multiprocessor and Multithreaded Processor Backends
  • The most straightforward execution model is to execute each separate section in the decoupled program on a separate processor or, on a processor that supports multiple hardware contexts (i.e., threads), to execute each separate section on a separate thread.
  • Since most programs have at least one sequential section before the separate sections start (e.g., there may be a sequential section to allocate and initialize channels), execution will typically start on one processor which will later synchronize with the other processors/threads to start parallel sections on them.
  • 6.4.2 Using Accelerators
  • In the context of an embedded system and, especially, a System on Chip (SoC), some of the data processing may be performed by separate processors such as general purpose processors, digital signal processors (DSPs), graphics processing units (GPUs), direct memory access (DMA) units, data engines, programmable accelerators or fixed-function accelerators. This data processing can be modelled as a synchronous remote procedure call. For example, a memory copy operation on a DMA engine can be modelled as a function call to perform a memory copy. When such an operation executes, the thread will typically:
      • acquire a lock to ensure it has exclusive access to the DMA engine
      • configure the DMA engine with the source and destination addresses and the data size
      • start the DMA engine to initiate the copy
      • wait for the DMA engine to complete the copy which will be detected either by an interrupt to a control processor or by polling
      • copy out any result from the copy (such as a status value)
      • release the lock on the accelerator
  • This mode of execution can be especially effective because one ‘control processor’ can keep a number of accelerator's busy with the control processor possibly doing little more than deciding which accelerator to start next and on what data. This mode of execution can be usefully combined with all of the following forms of execution.
  • 6.4.3 RTOS Backend
  • Instead of a multiprocessor or multithreaded processor, one can use a thread library, operating system (OS) or real time operating system (RTOS) running on one or more processors to execute the threads introduced by decoupling. This is especially effective when combined with the use of accelerators because running an RTOS does not provide parallelism and hence does not increase performance but using accelerators does provide parallelism and can therefore increase performance.
  • 6.4.4 Transforming to Event-Based Execution
  • Instead of executing threads directly using a thread library, OS or RTOS, one can transform threads into an ‘event-based’ form which can execute more efficiently than threads. The methods can be briefly summarized as follows:
      • Transformations to data representation.
      • The usual representation of threads allocates thread-local variables on a stack and requires one stack per thread. The overhead of managing this stack and some of the space overhead of stacks can be reduced by using a different allocation policy for thread-local variables based on how many copies of the variable can be live at once and on the lifetime of the variables.
      • If only one copy of each variable can be live at once (e.g., if the functions are not required to be re-entrant), then all variables can be allocated statically (i.e., not on a stack or heap).
      • If multiple copies of a variable can be live at once (e.g., if more than once instance of a thread can be live at once), the variables can be allocated on the heap.
      • Transformations to context-switch mechanism
      • When one processor executes more threads than the processor supports, the processor must sometimes switch from executing one thread to executing another thread. This is known as a ‘context switch’. The usual context mechanism used by threads is to save the values of all registers on the stack or in a reserved area of memory called the “thread control block”, then load all the registers with values from a different thread control block and restart the thread. The advantage of this approach is that a context switch can be performed at almost any point during execution so any code can be made multithreaded just by using a suitable thread library, OS or RTOS.
      • An alternative mechanism for context switching is to transform each thread to contain explicit context switch points where the thread saves its current context in a thread control block and returns to the scheduler which selects a new thread to run and starts it. The advantages of this approach are that thread control blocks can be made significantly smaller. If all context switches occur in the top-level function and all thread-local variables can be statically allocated, it is possible to completely eliminate the stack so that the entire context of a thread can be reduced to just the program counter value which makes context switches very cheap and makes thread control blocks extremely small.
      • A further advantage of performing context switches only at explicit context switch points is that it is easier and faster to ensure that a resource shared between multiple threads is accessed exclusively by at most one thread at a time because, in many cases, it is possible to arrange that pre-emption only happens when the shared resource is not being used by the current thread.
  • Together, these transformations can be viewed as a way of transforming a thread into a state machine with each context switch point representing a state and the code that continues execution from each context switch point viewed as a transition function to determine the next state. Execution of transformed threads can be viewed as having been transformed to an event-based model where all execution occurs in response to external events such as responses from input/output devices or from accelerators. It is not necessary to transform all threads: event-based execution can coexist with threaded execution.
  • 6.4.5 Interrupt-Driven Execution
  • Transforming threads as described above to allow event-based execution is a good match for applications that use accelerators that signal task completion via interrupts. On receiving an interrupt signalling task completion the following steps occur:
      • the state of the associated accelerator is updated
      • all threads that could be blocked waiting for that task to complete or for that accelerator to become available are executed. This may lead to further threads becoming unblocked.
      • When there are no runnable threads left, the interrupt handler completes
    6.4.6 Polling-Based Execution
  • Transforming threads as described above is also a good match for polling-based execution where the control processor tests for completion of tasks on a set of accelerators by reading a status register associated with each accelerator. This is essentially the same as interrupt-driven execution except that the state of the accelerators is updated by polling and the polling loop executes until all threads complete execution.
  • 6.4.7 Distributed Scheduling
  • Distributed scheduling can be done in various ways. Some part of a program may be simple enough that it can be implemented using a simple state machine which schedules one invocation of an accelerator after completion of another accelerator. Or, a control processor can hand over execution of a section within a thread to another processor. In both cases, this can be viewed as a RPC like mechanism (“{foo( ); bar( )@P0;}@P1”). In the first case, one way to implement it is to first transform the thread to event-based form and then opportunistically spot that a sequence of system states can be mapped onto a simple state machine and/or you may perform transformations to make it map better.
  • 6.4.8 Non-Work-Conserving Schedulers and Priorities/Deadlines
  • Two claims in this section: 1) using a priority mechanism and 2) using a non-work-conserving scheduler in the context of decoupling
  • If a system has to meet a set of deadlines and the threads within the system share resources such as processors, it is common to use a priority mechanism to select which thread to run next. These priorities might be static or they may depend on dynamic properties such as the time until the next deadline or how full/empty input and output queues are.
  • In a multiprocessor system, using a priority mechanism can be problematic because at the instant that one task completes, the set of tasks available to run next is too small to make a meaningful choice and better schedules occur if one waits a small period of time before making a choice. Such schedulers are known as non-work-conserving schedulers.
  • 7. Trace Reconstruction 7.1 Overview
  • A long-standing problem of parallelizing compilers is that it is hard to relate the view of execution seen by debug mechanisms to the view of execution the programmer expects from the original sequential program. Our tools can take an execution trace obtained from running a program on parallel hardware and reorder it to obtain a sequential trace that matches the original program. This is especially applicable to but not limited to the coarse-grained nature of our parallelization method.
  • To achieve complete reconstruction, it helps if the parallelizing compiler inserts hints in the code that make it easier to match up corresponding parts of the program. In the absence of explicit hints, it may be possible to obtain full reconstruction using debug information to match parts of the program.
  • When there are no explicit hints or debug information, partial reconstruction can be achieved by using points in the program that synchronize with each other to guide the matching process. The resulting trace will not be sequential but will be easier to understand. A useful application is to make it simpler to understand a trace of a program written using an event-based programming style (e.g., a GUI, interrupt handlers, device drivers, etc.)
  • Partial reconstruction could also be used to simplify parallel programs running on systems that use release consistency. Such programs must use explicit memory barriers at all synchronization points so it will be possible to simplify traces to reduce the degree of parallelism the programmer must consider.
  • One simple case of this is reconstructing a ‘message passing’ view of bus traffic.
  • HP has been looking at using trace to enable performance debugging of distributed protocols. Their focus is on data mining and performance not reconstructing a sequential trace.
  • http://portal.acm.org/citation.cfm?id=945445.945454&dl =portal&dl=ACM&type=series&idx=945445&part=Proceedings&WantType=Proceedings&title=ACM %20Symposium%20on%20Operating%20Systems%20Principles&CFID=11111111&CFTOKEN=2222222
  • 7.2 Partial Reconstruction Based on Observed Dataflow
  • Suppose we can identify sections of the system execution and we have a trace which lets us identify when each section was running and we have a trace of the memory accesses they performed or, from knowing properties of some of the sections, we know what memory accesses they would perform without needing a trace. The sections we can identify might be:
      • function calls
      • remote procedure calls
      • execution of a fixed-function accelerator such as a DMA transfer
      • message passing
  • We can summarize the memory accesses of each section in terms of the input data and the output data (what addresses were accessed and, perhaps, what values were read or written).
  • Given a sequence of traces of sections, we can construct a dynamic dataflow graph where each section is a node in a directed graph and there is an edge from a node M to a node N if the section corresponding to M writes to an address x and the section corresponding to N reads from address x and, in the original trace, no write to x happens between M's write to x and N's read from x.
  • This directed dataflow graph shows how different sections communicate with each other and can be used for a variety of purposes:
      • identify potential parallelism
      • identify timing-sensitive behaviour such as race conditions (when combined with a trace of synchronizations between parallel threads): if M writes to x and N reads from x and there is no chain of synchronizations from M to N to ensure that N cannot read from x before M does the read, there is a potential problem
      • identify redundant memory writes (if a value is overwritten before it has been read)
      • provides a simple way to show programmers what is happening in a complex, possibly parallel, system
      • can be analyzed to determine the time between when data is being generated and when it is consumed. If the time is long it might suggest that memory requirements could be reduced by calculating data nearer the time or, in a parallel or concurrent system that the generating task can be executed later.
      • can be analyzed to identify number and identity of consumers of data: it is often possible to manage memory more efficiently or generate data more efficiently if we know what it is being used for, when it is being used, etc.
  • Many other uses exist.
  • 7.3 Full Reconstruction Based on Parallelization Transformations
  • The first section talks about what you need for the general case of a program that has been parallelized and you would like to serialize trace from a run of the parallel program based on some understanding of what transformations were done during parallelization (i.e., you know how different bits of the program relate to the original program). The second part talks about how you would specifically do this if the paralellization process included decoupling. The sketch describes the simplest case in which it can work but it is possible to relax these restrictions significantly.
  • Here is a brief description of what is required to do trace reconstruction for decoupled programs. That is, to be able to take a trace from the decoupled program and reorder it to obtain a legal trace of the original program.
  • Most relevant should be conditions 1-9 which say what we need from trace. Where the conditions do not hold, there need to be mechanisms to achieve the same effect or a way of relaxing the goals so that they can still be met. For example, if we can only trace activity on the bus and two kernels running on the same DE communicate by one leaving the result in DE-local memory and the other using it from there, then we either add hardware to observe accesses to local memories or we tweak the schedule to add a spurious DMA copy out of the local memory so that it appears on the bus or we pretend we didn't want to see that kind of activity anyway.
  • Condition 10 onwards relate mainly to what decoupling aims to achieve. But, some conditions are relevant such as conditions 5 and 6 because, in practice, it is useful to be able to relax these conditions slightly. For example (5) says that kernels have exclusive access to buffers but it is obviously ok to have multiple readers of the same buffer and it would also be ok (in most real programs) for two kernels to (atomically) invoke ‘malloc’ and ‘free’ in the middle of the kernels even though the particular heap areas returned will depend on the precise interleaving of those calls and it may even be ok for debugging printfs from each kernel to be ordered.
  • Initial assumptions (to be relaxed later):
      • 1. Trace can see the start and end of each kernel execution and can identify which kernel is being started or is stopping.
      • 2. Trace can see context switches on each processor and can identify which context we are leaving and which context we are entering.
  • Consequences of (1)-(2): We can derive which kernel instance is running on any processor at any time.
      • 3. Trace has a coherent, consistent view of all activity on all processors.
      • 4. Trace can identify the source of all transactions it observes.
        • Two mechanisms that can make this possible are:
        • 1. Trace might observe directly which processor caused a transaction. or
        • 2. Trace might observe some property of the transaction such as the destination address and combine that with some property of the kernels running at that time.
        • Condition 2 can be satisfied if we have each kernel only accesses buffers that are either:
        • 1. At a static address (and of static length); or
        • 2. At an address (and of a length) that are handed to the kernel at the start of kernel execution and trace can infer what that address and length are.
  • Consequences of (1)-(4): We can identify each transaction with a kernel instance and we can see all transactions a kernel performs.
      • 5. Each kernel instance has exclusive access to each buffer during its execution. That is, all inter-kernel communication occurs at kernel boundaries.
      • 6. Each kernel's transactions only depend on the state of the buffers it accesses and the state of those buffers only depends on the initial state of the system and on transactions that kernels have performed since then.
  • Consequences of (1)-(6): Given a trace consisting of the interleaved transactions of a set of kernel instances, we can reorder the transactions such that all transactions of a kernel are contiguous and the resulting trace satisfies all read after write data dependencies. That is, we can construct a sequentially consistent view of the transactions as though kernels executed atomically and sequentially.
  • Note that there may be many legal traces. e.g., if A (only) writes to address 0 and then 1 and B (only) writes to address 2 and then 3 then the trace ‘0,2,1,3’ could be reordered to ‘0,1,2,3’ or to ‘2,3,0,1’.
      • 7. Sequencing of each kernel instance is triggered by a (single) state machine. There are a number of parallel state machines. (State machines may be in dedicated hardware or a number of state machines may be simulated on a processor.)
      • 8. State machines can synchronize with each other and can wait for completion of a kernel and state transitions can depend (only) on those synchronizations and on the results of kernels.
      • 9. Trace has a sequentially consistent, coherent view of all state transitions of the sequencers and all synchronization.
  • Consequences of (7)-(9): Given a trace of the state transitions and synchronizations, we can reorder them into any of the set of legal transitions those state machines could have made where a transition is legal if it respects synchronization dependencies. Consequences of (1)-(9): Given a trace of all kernel transactions and all state transitions and synchronizations, we can reorder them into any legal trace which respects the same synchronization dependencies and data dependencies.
  • The challenge of trace reconstruction is to show that, if you decouple a program, then the following holds. (Actually, this is what you want to show for almost any way you may parallelize a program.)
      • 10. We assume that we have a single ‘master’ deterministic state machine that corresponds to the set of parallel, deterministic state machines in the following way:
        • a. Any trace of the ‘master’ state machine is a legal trace of the parallel state machine.
        • b. Some traces of the parallel state machine can be reordered into a legal trace of the master state machine.
        • c. Those traces of the parallel state machine that cannot be reordered to give a legal trace of the master, are a prefix of a trace that can be reordered to give a legal trace of the master.
        • That is, any run of the parallel machine can be run forward to a point equivalent to a run of the master state machine.
  • (We further assume that we know how to do this reordering and how to identify equivalent points.)
  • Consequences of (1)-(10): We can reorder any trace to match a sequential version of the same program.
  • To show that decoupling gives us property (10) (i.e., that any trace of the decoupled program can be reordered to give a trace of the original program and to show how to do that reordering), we need to establish a relationship between the parallel state machine and the master state machine (i.e., the original program). This relationship is an “embedding” (i.e., a mapping between states in the parallel and the master machines such that the transitions map to each other in the obvious way). It is probably easiest to prove this by considering what happens when we decouple a single state machine (i.e., a program) into two parallel state machines.
  • When we decouple, we take a connected set of states in the original and create a new state machine containing copies of those states but:
      • 1. The two machines synchronize with each other on all transitions into and out of that set of states.
      • 2. The two machines contain a partition of the kernel activations of the original machine.
      • 3. The two machines each contain a subset (which may overlap) of the transitions of the original machine.
  • From this, it follows that the parallel machine_can_execute the same sequence as the original machine. To show that it_must_execute an equivalent sequence (i.e., that we can always reorder the trace), we need the following properties of decoupling:
      • 4. All data dependencies are respected: if kernel B reads data written by kernel A, then both are executed in sequence on the same state machine or the state machines will synchronize after A completes and before B starts.
        • Note that this depends on the fact that channels are FIFO queues so data is delivered in order.
  • Extensions of decoupling allow the programmer to indicate that two operations can be executed in either order even though there is a data dependency between them (e.g., both increment a variable atomically). This mostly needs us to relax the definition of what trace reconstruction is meant to do. One major requirement is that the choice of order doesn't have any knock-on effects on control flow.
      • 5. Deadlock should not happen:
        • threads cannot block indefinitely on a put as long as each queue has space for at least one value.
        • threads cannot block indefinitely on a get: either one thread is still making progress towards a put or, if they both hit a get, at least one will succeed.
  • Outline proof: Because they share the same control flow, the two threads perform opposing actions (i.e., a put/get pair) on channels in the same sequence as each other. A thread can only block on a get or a put if it has run ahead of the other thread. Therefore, when one thread is blocked, the other is always runnable.
  • Extensions of decoupling allow for the following:
      • 1. Locks are added by the programmer.
      • To avoid deadlock, we require:
        • The standard condition that locks must always be obtained in a consistent order.
        • If the leading thread blocks on a channel while holding a lock, then the trailing thread cannot block on the same lock.
  • A sufficient (and almost necessary) condition is that a put and a get on the same channel must not be inside corresponding critical sections (in different threads):
  • // Not allowed
    parallel_sections{
     section{ ... lock(1); ... put(ch,x); ... unlock(1); ...}
     section{ ... lock(1); ... get(ch,x); ... unlock(1); ...}
    }
  • which means that the original code cannot have looked like this:
      • . . . lock(1); . . . DECOUPLE(ch,x); . . . unlock(1); . . .
  • That is, extreme care must be taken if DECOUPLE occurs inside a critical section especially when inserting DECOUPLE annotations automatically
      • 2. Puts and gets don't have to occur in pairs in the program.
  • A useful and safe special case is that all initialization code does N puts, a loop then contains only put_get pairs and then finalization code does at most N gets. It should be possible to prove that this special case is ok.
  • It might also be possible to prove the following for programs containing arbitrary puts and gets: if the original single-threaded program does not deadlock (i.e., never does a get on an empty channel or a put on a full channel), then neither will the decoupled program.
  • 8. Exploiting Schedule Flexibility 8.1 Overview
  • A long-standing problem of parallelizing compilers is that it is virtually impossible to provide the programmer with a start-stop debugger that lets them debug in terms of their sequential program even though it runs in parallel. In particular, we would like to be able to run the program quickly (on the parallel hardware) for a few minutes and then switch to a sequential view when we want to debug.
  • It is not necessary (and hard) to seamlessly switch from running parallel code to running sequential code but it is feasible to change the scheduling rules to force the program to run only one task at a time. With compiler help, it is possible to execute in almost the sequence that the original program would have executed. With less compiler help or where the original program was parallel, it is possible to present a simpler schedule than on the original program. This method can be applied to interrupt driven program too.
  • This same method of tweaking the scheduler while leaving the application unchanged can be used to test programs more thoroughly. Some useful examples:
      • Testing the robustness of a real time system by modifying the runtime of tasks.
      • Making a task longer may cause a deadline to be missed. Making a task longer may detect scheduling anomalies where the system runs faster if one part becomes slower. (Scheduling anomalies usually indicate that priorities have been set incorrectly.) Making tasks take randomly longer times establishes how stable a schedule is.
      • Providing better test coverage in parallel systems. Race conditions and deadlock often have a small window of opportunity which it is hard to detect in testing because the ‘windows’ of several threads have to be aligned for the problem to manifest. By delaying threads by different amounts, we can cause different parts of each thread to overlap so that we can test a variety of alignments. (We can also measure which alignments we have tested so far for test-coverage statistics and for guided search.) This is especially useful for interrupt-driven code.
  • John Regehr did some work on avoiding interrupt overload by delaying and combining interrupts.
  • http://portal.acm.org/citation.cfm?id=945445.945454&dl=portal&dl=ACM&type=series&idx=945445&part=Proceedings&WantType=Proceedings&title=ACM %20Symposium%20on%20Operating%20Systems%20Principles&CFID=11111111&CFTOKEN=2222222
  • but this is really about modifying the (hardware) scheduling of interrupts to have more desirable properties for building real time systems whereas we are more interested in:
      • debugging, tracing and testing systems (and some of the stuff we do might actually break real-time properties of the system)
      • thread schedulers (but we still want to do some interrupt tweaking)
    8.2 Testing Concurrent Systems
  • Errors in concurrent systems often stem from timing-dependent behaviour. It is hard to find and to reproduce errors because they depend on two independently executing sections of code executing at the same time (on a single-processor system, this means that one section is pre-empted and the other section runs). The problematic sections are often not identified in the code.
  • Concurrent systems often have a lot of flexibility about when a particular piece of code should run: a task may have a deadline or it may require that it receive 2 seconds of CPU in every 10 second interval but tasks rarely require that they receive a particular pattern of scheduling.
  • The idea is to use the flexibility that the system provides to explore different sequences from those that a traditional scheduler would provide. In particular, we can use the same scheduler but modify task properties (such as deadlines or priorities) so that the system should still satisfy real time requirements or, more flexibly, use a different scheduler which uses a different schedule.
  • Most schedulers in common use are ‘work conserving schedulers’: if the resources needed to run a task are available and the task is due to execute, the task is started. In contrast, a non-work-conserving scheduler might choose to leave a resource idle for a short time even though it could be used. Non-work-conserving schedulers are normally used to improve efficiency where there is a possibility that a better choice of task will become available if the scheduler delays for a short time.
  • A non-work-conserving scheduler for testing concurrent systems because they provide more flexibility over the precise timing of different tasks than does a work-conserving scheduler. In particular, we can exploit flexibility in the following way:
      • model the effect of possibly increased runtime of different tasks. e.g., if task A takes 100 microseconds and we want to know what would happen if it took 150 microseconds, the scheduler can delay scheduling any tasks for 50 microseconds after A completes. A special case is uniformly slowing down all tasks to establish the ‘critical scaling factor’. Another interesting thing to watch for is ‘scheduling anomalies’ where a small change in the runtime of a task can have a large effect on the overall schedule and, in particular, where increasing the runtime of one task can cause another task to execute earlier (which can have both good and bad effects).
      • model the effect of variability in the runtime of different tasks by waiting a random amount of time after each task completes
      • cause two tasks to execute at a range of different phases relative to each other by delaying the start of execution of one or the other of the tasks by different amounts. Where the tasks are not periodic, (e.g., they are triggered by external events) you might delay execution of one task until some time after the other task has been triggered.
  • In all these cases, the modification of the schedule is probably done within the constraints of the real-time requirements of the tasks. For example, when a task becomes runnable, one might establish how much ‘slack’ there is in the schedule and then choose to delay the task for at most that amount. In particular, when exploring different phases, if the second event doesn't happen within that period of slack, then the first event must be sent to the system and we will hope to explore that phase the next time the event triggers.
  • It is often useful to monitor which different schedules have been explored either to report to the programmer exactly what tests have been performed and which ones found problems or to drive a feedback loop where a test harness keeps testing different schedules until sufficient test coverage has been achieved.
  • 8.3 Debugging Parallel Systems
  • When a sequential program is parallelized, it is often the case that one of the possible schedules that the scheduler might choose causes the program to execute in exactly the same order that the original program would have executed. (Where this is not true, such as with a non-preemptive scheduler it is sometimes possible to insert pre-emption points into the code to make it true.)
  • If the scheduler is able to determine what is currently executing and what would have run next in the original program, the scheduler can choose to execute the thread that would run that piece of code. (Again, it may be necessary to insert instrumentation into the code to help the scheduler figure out the status of each thread so that it can execute them in the correct order.)
  • 8.4 Tracing Parallel Systems
  • Reduce amount of reordering required by reordering the execution which might reduce size of window required, simplify task of separating out parallel streams of memory accesses, eliminate the need to reorder trace at all, etc.
  • 9. Exploiting High Level View 9.1 Overview
  • Working with the whole program at once and following compilation through many different levels of abstraction allows us to exploit information from one level of compilation in a higher or lower level. Some examples:
      • Executing with very abstract models of kernels can give us faster simulation which gives visualization, compiler feedback and regression checks on meeting deadlines.
      • We can plug a high level simulator of one component into a low level system simulation (using back-annotation of timing) and vice-versa.
      • We can simulate at various levels of detail: trace start/stop events (but don't simulate kernels), functional simulation using semihosting, bus traffic simulation, etc.
      • We can use our knowledge of the high level semantics to insert checking to confirm that the high-level semantics is enforced. For example, if a kernel is supposed to access only some address ranges, we can use an MPU to enforce that.
      • We can reconstruct a ‘message-passing view’ of bus traffic.
  • Although illustrative embodiments of the invention have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes and modifications can be effected therein by one skilled in the art without departing from the scope and spirit of the invention as defined by the appended claims.

Claims (25)

1. A method of transforming a portion of a computer program comprising a list of sequential instructions comprising control code and data processing code and a program separation indicator indicating a point where said sequential instructions may be divided to form separate sections that are capable of being separately executed and that each comprise different data processing code, said method comprising the steps of:
(i) analysing said portion of said program to determine if said sequential instructions can be divided at said point indicated by said program separation indicator and in response to determining that it can:
(iia) providing data communication between said separate sections indicated by said program separation indicator, such that said separate sections can be decoupled from each other, such that at least one of said sections is capable of being separately executed by an execution mechanism that is separate from an execution mechanism executing another of said separate sections, said at least one of said sections being capable of generating data and communicating said data to at least one other of said separate sections; and
in response to determining it can not:
(iib) not performing step (iia).
2. A method according to claim 1, wherein said step (iib) further comprises outputting a warning indicating an error in said computer program.
3. A method according to claim 1, wherein said step (iib) comprises amending said computer program such that said sequential instructions can be divided at said point and then performing step (iia).
4. A method according to claim 3, wherein said step of amending said computer program comprises inserting data transfer instructions at said point indicated by said program separation indicator.
5. A method according to claim 1, wherein said step (iib) comprises merging said two sections together and removing said program separation indicator.
6. A method according to claim 1, wherein said program separation indicator comprises at least one data transfer instruction, said data communication between said separate sections being provided in dependence upon said at least one data transfer instruction.
7. A method according to claim 1, wherein said program separation indicator comprises a plurality of data transfer instructions each corresponding to one of a plurality of data variables.
8. A method according to claim 6, wherein said step (iia) of providing data communication comprises inserting at least one “put data into a data store” instruction and at least one “get data from said data store” instruction into said instruction stream, and dividing said computer program between said put and get instructions to form said at least one separate section.
9. A method according to claim 8, wherein said data store comprises a FIFO buffer.
10. A method according to claim 1, wherein said step (iia) of providing data communication comprises allowing said data to be transferred in an order different to that it is generated in.
11. A method according to claim 1, wherein said step (iia) comprises providing cyclic data communication between said separate sections.
12. A method according to claim 1, wherein said separate sections comprise the same control code.
13. A method according to claim 1, said portion of said computer program comprising a plurality of program separation indicators each indicating a point where said sequential instructions may be divided to form separate sections, each of said separate sections being capable of being separately executed and comprising different data processing code, said method providing data communication between said separate sections indicated by said plurality of program separation indicators.
14. A method according to claim 1, wherein said transformed computer program is suitable for execution upon respective execution mechanisms of a heterogeneous system having a plurality of execution mechanisms.
15. A method according to claim 14, wherein said heterogeneous system comprises an asymmetric memory hierarchy.
16. A method according to claim 14, wherein said control code of at least one of said sections is operable to be processed by a processor of said heterogeneous system and said data processing code of said section is operable to be processed by an execution mechanism under control of said control code processed by said processor.
17. A method according to claim 14, wherein said plurality of execution mechanisms include one or more of:
a general purpose processor;
a direct memory access unit;
a coprocessor;
an VLIW processor;
a digital signal processor; and
a hardware accelerator unit.
18. A method according to claim 1, wherein said method step comprises an initial step performed before step (i) of:
defining said portion of said computer program by marking said computer program with indications delimiting said portion of said sequential instructions within which said at least two sections are to be located.
19. A method according to claim 1, wherein said computer program comprises said portion having a number of instructions to be executed sequentially and at least one further portion having instructions to be performed in parallel with each other.
20. A method according to claim 1, wherein said portion of said computer program comprises an instruction loop comprising at least two data processing instructions, and said at least two sections each comprise said instruction loop each section comprising at least one of said at least two data processing instructions, said at least two sections comprising different data processing instructions.
21. A method according to claim 1, wherein said portion of said computer program comprises a whole computer program.
22. A computer-readable storage medium comprising a computer program for controlling a computer to perform the method as claimed in claim 1.
23. A computer executing a computer program to perform the method of claim 1.
24. A method of transforming a portion of a computer program comprising a list of sequential instructions and a program separation indicator indicating a point where said sequential instructions may be divided to form separate sections that are capable of being separately executed and that each comprise different data processing code, said list of instructions comprising control code and data processing code, said method comprising the step of:
(i) in response to said program separation indicator, providing data communication between said separate sections such that they can be decoupled from each other, such that at least one of said sections is capable of being separately executed by an execution mechanism that is separate from an execution mechanism executing another of said separate sections and said at least one of said sections can communicate data that it generates to at least one other of said separate sections via said provided data communication.
25. A method according to claim 24, comprising a further initial step (0) performed before step (i) of in response to said program separation indicator, analysing said portion of said computer program and determining which of said sequential instructions should be in which of said separate sections prior to providing said data communication.
US11/898,360 2006-10-24 2007-09-11 Analyzing and transforming a computer program for executing on asymmetric multiprocessing systems Abandoned US20080098208A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US11/898,360 US20080098208A1 (en) 2006-10-24 2007-09-11 Analyzing and transforming a computer program for executing on asymmetric multiprocessing systems

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US85375606P 2006-10-24 2006-10-24
US11/898,360 US20080098208A1 (en) 2006-10-24 2007-09-11 Analyzing and transforming a computer program for executing on asymmetric multiprocessing systems

Publications (1)

Publication Number Publication Date
US20080098208A1 true US20080098208A1 (en) 2008-04-24

Family

ID=38219318

Family Applications (5)

Application Number Title Priority Date Filing Date
US11/898,363 Abandoned US20080098207A1 (en) 2006-10-24 2007-09-11 Analyzing diagnostic data generated by multiple threads within an instruction stream
US11/898,360 Abandoned US20080098208A1 (en) 2006-10-24 2007-09-11 Analyzing and transforming a computer program for executing on asymmetric multiprocessing systems
US11/907,881 Active 2029-01-22 US7809989B2 (en) 2006-10-24 2007-10-18 Performing diagnostic operations upon an asymmetric multiprocessor apparatus
US11/976,314 Active 2031-05-19 US8250549B2 (en) 2006-10-24 2007-10-23 Variable coherency support when mapping a computer program to a data processing apparatus
US11/976,315 Active 2031-02-07 US8190807B2 (en) 2006-10-24 2007-10-23 Mapping a computer program to an asymmetric multiprocessing apparatus

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US11/898,363 Abandoned US20080098207A1 (en) 2006-10-24 2007-09-11 Analyzing diagnostic data generated by multiple threads within an instruction stream

Family Applications After (3)

Application Number Title Priority Date Filing Date
US11/907,881 Active 2029-01-22 US7809989B2 (en) 2006-10-24 2007-10-18 Performing diagnostic operations upon an asymmetric multiprocessor apparatus
US11/976,314 Active 2031-05-19 US8250549B2 (en) 2006-10-24 2007-10-23 Variable coherency support when mapping a computer program to a data processing apparatus
US11/976,315 Active 2031-02-07 US8190807B2 (en) 2006-10-24 2007-10-23 Mapping a computer program to an asymmetric multiprocessing apparatus

Country Status (11)

Country Link
US (5) US20080098207A1 (en)
EP (1) EP2076837B1 (en)
JP (1) JP5054115B2 (en)
KR (1) KR101325229B1 (en)
CN (1) CN101529391B (en)
DE (1) DE602007009857D1 (en)
GB (1) GB2443277B (en)
IL (1) IL197314A (en)
MY (1) MY144449A (en)
TW (1) TWI407374B (en)
WO (1) WO2008050076A1 (en)

Cited By (59)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090199028A1 (en) * 2008-02-01 2009-08-06 Arimilli Ravi K Wake-and-Go Mechanism with Data Exclusivity
US20090199029A1 (en) * 2008-02-01 2009-08-06 Arimilli Ravi K Wake-and-Go Mechanism with Data Monitoring
US20090199184A1 (en) * 2008-02-01 2009-08-06 Arimilli Ravi K Wake-and-Go Mechanism With Software Save of Thread State
US20090199197A1 (en) * 2008-02-01 2009-08-06 International Business Machines Corporation Wake-and-Go Mechanism with Dynamic Allocation in Hardware Private Array
US20090199030A1 (en) * 2008-02-01 2009-08-06 Arimilli Ravi K Hardware Wake-and-Go Mechanism for a Data Processing System
US20090199189A1 (en) * 2008-02-01 2009-08-06 Arimilli Ravi K Parallel Lock Spinning Using Wake-and-Go Mechanism
US20100162217A1 (en) * 2008-12-22 2010-06-24 Microsoft Corporation Debugging System Using Static Analysis
US20100241828A1 (en) * 2009-03-18 2010-09-23 Microsoft Corporation General Distributed Reduction For Data Parallel Computing
US20100268915A1 (en) * 2009-04-16 2010-10-21 International Business Machines Corporation Remote Update Programming Idiom Accelerator with Allocated Processor Resources
US20100268791A1 (en) * 2009-04-16 2010-10-21 International Business Machines Corporation Programming Idiom Accelerator for Remote Update
US20100268790A1 (en) * 2009-04-16 2010-10-21 International Business Machines Corporation Complex Remote Update Programming Idiom Accelerator
US20100287341A1 (en) * 2008-02-01 2010-11-11 Arimilli Ravi K Wake-and-Go Mechanism with System Address Bus Transaction Master
US20100293341A1 (en) * 2008-02-01 2010-11-18 Arimilli Ravi K Wake-and-Go Mechanism with Exclusive System Bus Response
US20110016293A1 (en) * 2009-07-15 2011-01-20 Comm. a l' ener. atom. et aux energies alter. Device and method for the distributed execution of digital data processing operations
US20110072006A1 (en) * 2009-09-18 2011-03-24 Microsoft Corporation Management of data and computation in data centers
US20110173419A1 (en) * 2008-02-01 2011-07-14 Arimilli Ravi K Look-Ahead Wake-and-Go Engine With Speculative Execution
US20110173625A1 (en) * 2008-02-01 2011-07-14 Arimilli Ravi K Wake-and-Go Mechanism with Prioritization of Threads
US20110173630A1 (en) * 2008-02-01 2011-07-14 Arimilli Ravi K Central Repository for Wake-and-Go Mechanism
US20110173632A1 (en) * 2008-02-01 2011-07-14 Arimilli Ravi K Hardware Wake-and-Go Mechanism with Look-Ahead Polling
US20110173423A1 (en) * 2008-02-01 2011-07-14 Arimilli Ravi K Look-Ahead Hardware Wake-and-Go Mechanism
US20110173417A1 (en) * 2008-02-01 2011-07-14 Arimilli Ravi K Programming Idiom Accelerators
US20110173631A1 (en) * 2008-02-01 2011-07-14 Arimilli Ravi K Wake-and-Go Mechanism for a Data Processing System
US20110173593A1 (en) * 2008-02-01 2011-07-14 Arimilli Ravi K Compiler Providing Idiom to Idiom Accelerator
US8145849B2 (en) 2008-02-01 2012-03-27 International Business Machines Corporation Wake-and-go mechanism with system bus response
GB2486485A (en) * 2010-12-16 2012-06-20 Imagination Tech Ltd Completing execution of one phase of a computer program before scheduling any instructions for the next phase
US8230201B2 (en) 2009-04-16 2012-07-24 International Business Machines Corporation Migrating sleeping and waking threads between wake-and-go mechanisms in a multiple processor data processing system
US20130010150A1 (en) * 1997-07-15 2013-01-10 Kia Silverbrook Portable handheld device with multi-core image processor
US8533720B2 (en) 2011-02-25 2013-09-10 International Business Machines Corporation Offloading work from one type to another type of processor based on the count of each type of service call instructions in the work unit
US8566831B2 (en) 2011-01-26 2013-10-22 International Business Machines Corporation Execution of work units in a heterogeneous computing environment
US8612952B2 (en) 2010-04-07 2013-12-17 International Business Machines Corporation Performance optimization based on data accesses during critical sections
US20130339923A1 (en) * 2012-06-19 2013-12-19 Charles Chen Xu Data Handling Among Actors in a Dataflow Programming Environment
US8621430B2 (en) 2011-03-03 2013-12-31 International Business Machines Corporation Method for code transformation supporting temporal abstraction of parameters
US8725992B2 (en) 2008-02-01 2014-05-13 International Business Machines Corporation Programming language exposing idiom calls to a programming idiom accelerator
US20140157229A1 (en) * 2012-12-04 2014-06-05 International Business Machines Corporation Streamlining Hardware Initialization Code
US20140195834A1 (en) * 2013-01-04 2014-07-10 Microsoft Corporation High throughput low latency user mode drivers implemented in managed code
US8789939B2 (en) 1998-11-09 2014-07-29 Google Inc. Print media cartridge with ink supply manifold
US8823823B2 (en) 1997-07-15 2014-09-02 Google Inc. Portable imaging device with multi-core processor and orientation sensor
US8866923B2 (en) 1999-05-25 2014-10-21 Google Inc. Modular camera and printer
US8896724B2 (en) 1997-07-15 2014-11-25 Google Inc. Camera system to facilitate a cascade of imaging effects
US8902333B2 (en) 1997-07-15 2014-12-02 Google Inc. Image processing method using sensed eye position
US8910137B2 (en) 2012-04-13 2014-12-09 International Business Machines Corporation Code profiling of executable library for pipeline parallelization
US8908075B2 (en) 1997-07-15 2014-12-09 Google Inc. Image capture and processing integrated circuit for a camera
US8936196B2 (en) 1997-07-15 2015-01-20 Google Inc. Camera unit incorporating program script scanner
US8949809B2 (en) 2012-03-01 2015-02-03 International Business Machines Corporation Automatic pipeline parallelization of sequential code
US9055221B2 (en) 1997-07-15 2015-06-09 Google Inc. Portable hand-held device for deblurring sensed images
US9323543B2 (en) 2013-01-04 2016-04-26 Microsoft Technology Licensing, Llc Capability based device driver framework
US20160313991A1 (en) * 2013-06-16 2016-10-27 President And Fellows Of Harvard College Methods and apparatus for parallel processing
US9652817B2 (en) 2015-03-12 2017-05-16 Samsung Electronics Co., Ltd. Automated compute kernel fusion, resizing, and interleave
US20170236244A1 (en) * 2016-02-12 2017-08-17 Arm Limited Graphics processing systems
US9811319B2 (en) 2013-01-04 2017-11-07 Microsoft Technology Licensing, Llc Software interface for a hardware device
US10296340B2 (en) 2014-03-13 2019-05-21 Arm Limited Data processing apparatus for executing an access instruction for N threads
CN110199269A (en) * 2017-01-23 2019-09-03 三星电子株式会社 Method and electronic device for the data processing between multiprocessor
CN110998540A (en) * 2017-08-01 2020-04-10 微软技术许可有限责任公司 Execution of focus of trace code in debugger
CN111476264A (en) * 2019-01-24 2020-07-31 国际商业机器公司 Testing of robustness against access restricted systems
US10732982B2 (en) 2017-08-15 2020-08-04 Arm Limited Data processing systems
US10869108B1 (en) 2008-09-29 2020-12-15 Calltrol Corporation Parallel signal processing system and method
US20210271666A1 (en) * 2018-09-28 2021-09-02 Marc Brandis Ag Analyzing a processing engine of a transaction-processing system
US11604752B2 (en) 2021-01-29 2023-03-14 Arm Limited System for cross-routed communication between functional units of multiple processing units
US20230195321A1 (en) * 2021-12-17 2023-06-22 Samsung Electronics Co., Ltd. Storage device and operating method thereof

Families Citing this family (207)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7614037B2 (en) * 2004-05-21 2009-11-03 Microsoft Corporation Method and system for graph analysis and synchronization
US8079019B2 (en) * 2007-11-21 2011-12-13 Replay Solutions, Inc. Advancing and rewinding a replayed program execution
GB2443277B (en) * 2006-10-24 2011-05-18 Advanced Risc Mach Ltd Performing diagnostics operations upon an asymmetric multiprocessor apparatus
US8341604B2 (en) * 2006-11-15 2012-12-25 Qualcomm Incorporated Embedded trace macrocell for enhanced digital signal processor debugging operations
US8370806B2 (en) * 2006-11-15 2013-02-05 Qualcomm Incorporated Non-intrusive, thread-selective, debugging method and system for a multi-thread digital signal processor
US8380966B2 (en) * 2006-11-15 2013-02-19 Qualcomm Incorporated Method and system for instruction stuffing operations during non-intrusive digital signal processor debugging
US8533530B2 (en) * 2006-11-15 2013-09-10 Qualcomm Incorporated Method and system for trusted/untrusted digital signal processor debugging operations
US10353797B2 (en) * 2006-12-29 2019-07-16 International Business Machines Corporation Using memory tracking data to inform a memory map tool
US8484516B2 (en) * 2007-04-11 2013-07-09 Qualcomm Incorporated Inter-thread trace alignment method and system for a multi-threaded processor
US8312227B2 (en) * 2007-05-31 2012-11-13 Intel Corporation Method and apparatus for MPI program optimization
CN101329638B (en) * 2007-06-18 2011-11-09 国际商业机器公司 Method and system for analyzing parallelism of program code
US20090007115A1 (en) * 2007-06-26 2009-01-01 Yuanhao Sun Method and apparatus for parallel XSL transformation with low contention and load balancing
US8548777B2 (en) * 2007-09-28 2013-10-01 Rockwell Automation Technologies, Inc. Automated recommendations from simulation
US20090089031A1 (en) * 2007-09-28 2009-04-02 Rockwell Automation Technologies, Inc. Integrated simulation of controllers and devices
US7801710B2 (en) * 2007-09-28 2010-09-21 Rockwell Automation Technologies, Inc. Simulation controls for model variability and randomness
US8181165B2 (en) * 2007-10-30 2012-05-15 International Business Machines Corporation Using annotations to reuse variable declarations to generate different service functions
IL187038A0 (en) * 2007-10-30 2008-02-09 Sandisk Il Ltd Secure data processing for unaligned data
US8402438B1 (en) 2007-12-03 2013-03-19 Cadence Design Systems, Inc. Method and system for generating verification information and tests for software
US8468504B2 (en) * 2007-12-28 2013-06-18 Streaming Networks (Pvt.) Ltd. Method and apparatus for interactive scheduling of VLIW assembly code
US8156474B2 (en) * 2007-12-28 2012-04-10 Cadence Design Systems, Inc. Automation of software verification
US9063778B2 (en) * 2008-01-09 2015-06-23 Microsoft Technology Licensing, Llc Fair stateless model checking
GB2456813B (en) 2008-01-24 2012-03-07 Advanced Risc Mach Ltd Diagnostic context construction and comparison
JP5278336B2 (en) * 2008-02-15 2013-09-04 日本電気株式会社 Program parallelization apparatus, program parallelization method, and program parallelization program
US8615647B2 (en) * 2008-02-29 2013-12-24 Intel Corporation Migrating execution of thread between cores of different instruction set architecture in multi-core processor and transitioning each core to respective on / off power state
EP2257874A4 (en) * 2008-03-27 2013-07-17 Rocketick Technologies Ltd Design simulation using parallel processors
US9678775B1 (en) * 2008-04-09 2017-06-13 Nvidia Corporation Allocating memory for local variables of a multi-threaded program for execution in a single-threaded environment
US8776030B2 (en) * 2008-04-09 2014-07-08 Nvidia Corporation Partitioning CUDA code for execution by a general purpose processor
GB0808576D0 (en) * 2008-05-12 2008-06-18 Xmos Ltd Compiling and linking
FR2931269A1 (en) * 2008-05-16 2009-11-20 Ateji Soc Par Actions Simplifi METHOD AND SYSTEM FOR DEVELOPING PARALLEL PROGRAMS
US9223677B2 (en) 2008-06-11 2015-12-29 Arm Limited Generation of trace data in a multi-processor system
WO2009153621A1 (en) * 2008-06-19 2009-12-23 Freescale Semiconductor, Inc. A system, method and computer program product for scheduling processor entity tasks in a multiple-processing entity system
US9058206B2 (en) * 2008-06-19 2015-06-16 Freescale emiconductor, Inc. System, method and program product for determining execution flow of the scheduler in response to setting a scheduler control variable by the debugger or by a processing entity
WO2009153620A1 (en) * 2008-06-19 2009-12-23 Freescale Semiconductor, Inc. A system, method and computer program product for scheduling a processing entity task
US8572577B2 (en) * 2008-06-20 2013-10-29 International Business Machines Corporation Monitoring changes to data within a critical section of a threaded program
US8332825B2 (en) * 2008-06-26 2012-12-11 Microsoft Corporation Dynamically monitoring application behavior
US9032377B2 (en) * 2008-07-10 2015-05-12 Rocketick Technologies Ltd. Efficient parallel computation of dependency problems
EP2297647A4 (en) * 2008-07-10 2012-12-12 Rocketick Technologies Ltd Efficient parallel computation of dependency problems
JP2010026851A (en) * 2008-07-22 2010-02-04 Panasonic Corp Complier-based optimization method
US8028113B2 (en) * 2008-08-15 2011-09-27 International Business Machines Corporation Methods and systems for deadlock-free allocation of memory
US20110113221A1 (en) * 2008-08-18 2011-05-12 Telefonaktiebolaget L M Ericsson (Publ) Data Sharing in Chip Multi-Processor Systems
US8230442B2 (en) 2008-09-05 2012-07-24 International Business Machines Corporation Executing an accelerator application program in a hybrid computing environment
US8504344B2 (en) * 2008-09-30 2013-08-06 Cadence Design Systems, Inc. Interface between a verification environment and a hardware acceleration engine
US20100095286A1 (en) * 2008-10-10 2010-04-15 Kaplan David A Register reduction and liveness analysis techniques for program code
US8418146B2 (en) * 2008-11-26 2013-04-09 Microsoft Corporation Sampling techniques for dynamic data-race detection
GB0823329D0 (en) * 2008-12-22 2009-01-28 Geotate Bv Position signal sampling method and apparatus
KR101511273B1 (en) * 2008-12-29 2015-04-10 삼성전자주식회사 System and method for 3d graphic rendering based on multi-core processor
US8527734B2 (en) 2009-01-23 2013-09-03 International Business Machines Corporation Administering registered virtual addresses in a hybrid computing environment including maintaining a watch list of currently registered virtual addresses by an operating system
US9286232B2 (en) * 2009-01-26 2016-03-15 International Business Machines Corporation Administering registered virtual addresses in a hybrid computing environment including maintaining a cache of ranges of currently registered virtual addresses
US8843880B2 (en) * 2009-01-27 2014-09-23 International Business Machines Corporation Software development for a hybrid computing environment
US8255909B2 (en) 2009-01-28 2012-08-28 International Business Machines Corporation Synchronizing access to resources in a hybrid computing environment
US20100191923A1 (en) * 2009-01-29 2010-07-29 International Business Machines Corporation Data Processing In A Computing Environment
US9170864B2 (en) * 2009-01-29 2015-10-27 International Business Machines Corporation Data processing in a hybrid computing environment
CN102317915B (en) * 2009-02-16 2014-03-26 英科伦有限责任公司 Method for analysing the real-time capability of a system
US8205117B2 (en) * 2009-02-25 2012-06-19 Hewlett-Packard Development Company, L.P. Migratory hardware diagnostic testing
JP5316128B2 (en) * 2009-03-17 2013-10-16 トヨタ自動車株式会社 Fault diagnosis system, electronic control unit, fault diagnosis method
US20100242014A1 (en) * 2009-03-17 2010-09-23 Xiaohan Zhu Symmetric multi-processor operating system for asymmetric multi-processor architecture
US8843927B2 (en) * 2009-04-23 2014-09-23 Microsoft Corporation Monitoring and updating tasks arrival and completion statistics without data locking synchronization
US8413108B2 (en) * 2009-05-12 2013-04-02 Microsoft Corporation Architectural data metrics overlay
US9378062B2 (en) * 2009-06-18 2016-06-28 Microsoft Technology Licensing, Llc Interface between a resource manager and a scheduler in a process
US8719831B2 (en) * 2009-06-18 2014-05-06 Microsoft Corporation Dynamically change allocation of resources to schedulers based on feedback and policies from the schedulers and availability of the resources
DE102009025572A1 (en) * 2009-06-19 2010-12-23 Wolfgang Pree Gmbh A method for developing guaranteed real-time systems
US8914799B2 (en) * 2009-06-30 2014-12-16 Oracle America Inc. High performance implementation of the OpenMP tasking feature
JP5452125B2 (en) 2009-08-11 2014-03-26 クラリオン株式会社 Data processing apparatus and data processing method
US8990783B1 (en) 2009-08-13 2015-03-24 The Mathworks, Inc. Scheduling generated code based on target characteristics
US8566804B1 (en) * 2009-08-13 2013-10-22 The Mathworks, Inc. Scheduling generated code based on target characteristics
US8381194B2 (en) 2009-08-19 2013-02-19 Apple Inc. Methods and apparatuses for selective code coverage
US9594656B2 (en) 2009-10-26 2017-03-14 Microsoft Technology Licensing, Llc Analysis and visualization of application concurrency and processor resource utilization
US9430353B2 (en) * 2009-10-26 2016-08-30 Microsoft Technology Licensing, Llc Analysis and visualization of concurrent thread execution on processor cores
US8359588B2 (en) * 2009-11-25 2013-01-22 Arm Limited Reducing inter-task latency in a multiprocessor system
US8392929B2 (en) * 2009-12-15 2013-03-05 Microsoft Corporation Leveraging memory isolation hardware technology to efficiently detect race conditions
US8826234B2 (en) * 2009-12-23 2014-09-02 Intel Corporation Relational modeling for performance analysis of multi-core processors
WO2011083459A1 (en) * 2010-01-08 2011-07-14 Daniel Geist Utilizing temporal assertions in a debugger
US8516467B2 (en) * 2010-01-29 2013-08-20 Nintendo Co., Ltd. Method and apparatus for enhancing comprehension of code time complexity and flow
US9417905B2 (en) * 2010-02-03 2016-08-16 International Business Machines Corporation Terminating an accelerator application program in a hybrid computing environment
US8578132B2 (en) * 2010-03-29 2013-11-05 International Business Machines Corporation Direct injection of data to be transferred in a hybrid computing environment
US8959496B2 (en) * 2010-04-21 2015-02-17 Microsoft Corporation Automatic parallelization in a tracing just-in-time compiler system
US9015443B2 (en) 2010-04-30 2015-04-21 International Business Machines Corporation Reducing remote reads of memory in a hybrid computing environment
US8756590B2 (en) * 2010-06-22 2014-06-17 Microsoft Corporation Binding data parallel device source code
US8972995B2 (en) * 2010-08-06 2015-03-03 Sonics, Inc. Apparatus and methods to concurrently perform per-thread as well as per-tag memory access scheduling within a thread and across two or more threads
US9652365B2 (en) * 2010-08-24 2017-05-16 Red Hat, Inc. Fault configuration using a registered list of controllers
US20120240224A1 (en) * 2010-09-14 2012-09-20 Georgia Tech Research Corporation Security systems and methods for distinguishing user-intended traffic from malicious traffic
US8990551B2 (en) 2010-09-16 2015-03-24 Microsoft Technology Licensing, Llc Analysis and visualization of cluster resource utilization
US20120096445A1 (en) * 2010-10-18 2012-04-19 Nokia Corporation Method and apparatus for providing portability of partially accelerated signal processing applications
US8656496B2 (en) * 2010-11-22 2014-02-18 International Business Machines Corporations Global variable security analysis
US8832659B2 (en) * 2010-12-06 2014-09-09 University Of Washington Through Its Center For Commercialization Systems and methods for finding concurrency errors
US8959501B2 (en) * 2010-12-14 2015-02-17 Microsoft Corporation Type and length abstraction for data types
US20120160272A1 (en) * 2010-12-23 2012-06-28 United Microelectronics Corp. Cleaning method of semiconductor process
KR101785116B1 (en) * 2010-12-24 2017-10-17 한양대학교 산학협력단 Software-defined raido terminal for modem hardware independent radio applications
US8856764B2 (en) * 2011-01-25 2014-10-07 International Business Machines Corporation Distributed static analysis of computer software applications
US8726245B2 (en) * 2011-01-28 2014-05-13 International Business Machines Corporation Static analysis of computer software applications having a model-view-controller architecture
DE102011004363B4 (en) * 2011-02-18 2023-10-05 Airbus Operations Gmbh Control device for controlling network participants, method for operating a computer network and computer network
US9189283B2 (en) * 2011-03-03 2015-11-17 Hewlett-Packard Development Company, L.P. Task launching on hardware resource for client
GB2489278B (en) * 2011-03-24 2019-12-25 Advanced Risc Mach Ltd Improving the scheduling of tasks to be performed by a non-coherent device
US8650542B1 (en) * 2011-03-25 2014-02-11 The Mathworks, Inc. Hierarchical, self-describing function objects
US9015688B2 (en) * 2011-04-01 2015-04-21 Intel Corporation Vectorization of scalar functions including vectorization annotations and vectorized function signatures matching
US9128748B2 (en) * 2011-04-12 2015-09-08 Rocketick Technologies Ltd. Parallel simulation using multiple co-simulators
US8949777B2 (en) * 2011-04-22 2015-02-03 Intel Corporation Methods and systems for mapping a function pointer to the device code
US8855194B2 (en) * 2011-05-09 2014-10-07 Texas Instruments Incorporated Updating non-shadow registers in video encoder
US9043363B2 (en) * 2011-06-03 2015-05-26 Oracle International Corporation System and method for performing memory management using hardware transactions
US9069545B2 (en) * 2011-07-18 2015-06-30 International Business Machines Corporation Relaxation of synchronization for iterative convergent computations
US8918770B2 (en) * 2011-08-25 2014-12-23 Nec Laboratories America, Inc. Compiler for X86-based many-core coprocessors
GB2495959A (en) * 2011-10-26 2013-05-01 Imagination Tech Ltd Multi-threaded memory access processor
US8219684B1 (en) * 2011-11-02 2012-07-10 Google Inc. Redundant data requests with cancellation
US9043765B2 (en) * 2011-11-09 2015-05-26 Microsoft Technology Licensing, Llc Simultaneously targeting multiple homogeneous and heterogeneous runtime environments
US8615614B2 (en) * 2011-11-30 2013-12-24 Freescale Semiconductor, Inc. Message passing using direct memory access unit in a data processing system
US9367687B1 (en) * 2011-12-22 2016-06-14 Emc Corporation Method for malware detection using deep inspection and data discovery agents
US9686152B2 (en) 2012-01-27 2017-06-20 Microsoft Technology Licensing, Llc Techniques for tracking resource usage statistics per transaction across multiple layers of protocols
KR101885211B1 (en) * 2012-01-27 2018-08-29 삼성전자 주식회사 Method and apparatus for resource allocation of gpu
US8793697B2 (en) * 2012-02-23 2014-07-29 Qualcomm Incorporated Method and system for scheduling requests in a portable computing device
US8838861B2 (en) 2012-05-09 2014-09-16 Qualcomm Incorporated Methods and apparatuses for trace multicast across a bus structure, and related systems
US9928109B2 (en) 2012-05-09 2018-03-27 Nvidia Corporation Method and system for processing nested stream events
DE102012011584A1 (en) * 2012-06-13 2013-12-19 Robert Bosch Gmbh Resource management system for automation systems
RU2012127581A (en) * 2012-07-02 2014-01-10 ЭлЭсАй Корпорейшн SOURCE CODE GENERATOR FOR DEVELOPING AND TESTING SOFTWARE FOR MULTI-PROCESSING MEDIA
RU2012127578A (en) 2012-07-02 2014-01-10 ЭлЭсАй Корпорейшн ANALYZER OF APPLICABILITY OF THE SOFTWARE MODULE FOR THE DEVELOPMENT AND TESTING OF THE SOFTWARE FOR MULTIPROCESSOR MEDIA
EP2706420B1 (en) * 2012-09-05 2015-03-18 Siemens Aktiengesellschaft Method for operating an automation device
WO2014074251A1 (en) 2012-11-06 2014-05-15 Coherent Logix, Incorporated Multiprocessor programming toolkit for design reuse
WO2014074783A2 (en) 2012-11-09 2014-05-15 Coherent Logix, Incorporated Real time analysis and control for a multiprocessor system
US20140181830A1 (en) * 2012-12-26 2014-06-26 Mishali Naik Thread migration support for architectually different cores
US9519568B2 (en) 2012-12-31 2016-12-13 Nvidia Corporation System and method for debugging an executing general-purpose computing on graphics processing units (GPGPU) application
US9207969B2 (en) * 2013-01-25 2015-12-08 Microsoft Technology Licensing, Llc Parallel tracing for performance and detail
US8762916B1 (en) * 2013-02-25 2014-06-24 Xilinx, Inc. Automatic generation of a data transfer network
US8924193B2 (en) * 2013-03-14 2014-12-30 The Mathworks, Inc. Generating variants from file differences
US9471456B2 (en) * 2013-05-15 2016-10-18 Nvidia Corporation Interleaved instruction debugger
US10802876B2 (en) * 2013-05-22 2020-10-13 Massachusetts Institute Of Technology Multiprocessor scheduling policy with deadline constraint for determining multi-agent schedule for a plurality of agents
GB2514618B (en) * 2013-05-31 2020-11-11 Advanced Risc Mach Ltd Data processing systems
IL232836A0 (en) * 2013-06-02 2014-08-31 Rocketick Technologies Ltd Efficient parallel computation of derendency problems
US9292419B1 (en) * 2013-06-04 2016-03-22 The Mathworks, Inc. Code coverage and confidence determination
US9697003B2 (en) 2013-06-07 2017-07-04 Advanced Micro Devices, Inc. Method and system for yield operation supporting thread-like behavior
US9075624B2 (en) 2013-06-24 2015-07-07 Xilinx, Inc. Compilation of system designs
EP3028167A1 (en) * 2013-07-31 2016-06-08 Hewlett Packard Enterprise Development LP Data stream processing using a distributed cache
US10372590B2 (en) * 2013-11-22 2019-08-06 International Business Corporation Determining instruction execution history in a debugger
US20150195383A1 (en) * 2014-01-08 2015-07-09 Cavium, Inc. Methods and systems for single instruction multiple data programmable packet parsers
US9733981B2 (en) * 2014-06-10 2017-08-15 Nxp Usa, Inc. System and method for conditional task switching during ordering scope transitions
US10061591B2 (en) 2014-06-27 2018-08-28 Samsung Electronics Company, Ltd. Redundancy elimination in single instruction multiple data/thread (SIMD/T) execution processing
US10061592B2 (en) 2014-06-27 2018-08-28 Samsung Electronics Co., Ltd. Architecture and execution for efficient mixed precision computations in single instruction multiple data/thread (SIMD/T) devices
US9182990B1 (en) * 2014-07-01 2015-11-10 Google Inc. Method and apparatus for detecting execution of unsupported instructions while testing multiversioned code
US9672029B2 (en) * 2014-08-01 2017-06-06 Vmware, Inc. Determining test case priorities based on tagged execution paths
US10148547B2 (en) 2014-10-24 2018-12-04 Tektronix, Inc. Hardware trigger generation from a declarative protocol description
US9338076B1 (en) 2014-10-24 2016-05-10 Tektronix, Inc. Deriving hardware acceleration of decoding from a declarative protocol description
US20160170767A1 (en) * 2014-12-12 2016-06-16 Intel Corporation Temporary transfer of a multithreaded ip core to single or reduced thread configuration during thread offload to co-processor
US9280389B1 (en) * 2014-12-30 2016-03-08 Tyco Fire & Security Gmbh Preemptive operating system without context switching
US9996354B2 (en) 2015-01-09 2018-06-12 International Business Machines Corporation Instruction stream tracing of multi-threaded processors
US20160224327A1 (en) * 2015-02-02 2016-08-04 Lenovo Enterprise Solutions (Singapore) Pte. Ltd. Linking a Program with a Software Library
US11119903B2 (en) 2015-05-01 2021-09-14 Fastly, Inc. Race condition testing via a scheduling test program
US10102031B2 (en) 2015-05-29 2018-10-16 Qualcomm Incorporated Bandwidth/resource management for multithreaded processors
US9910760B2 (en) * 2015-08-07 2018-03-06 Nvidia Corporation Method and apparatus for interception of synchronization objects in graphics application programming interfaces for frame debugging
US9977693B2 (en) 2015-09-23 2018-05-22 Hanan Potash Processor that uses plural form information
US10140122B2 (en) 2015-09-23 2018-11-27 Hanan Potash Computer processor with operand/variable-mapped namespace
US10061511B2 (en) 2015-09-23 2018-08-28 Hanan Potash Computing device with frames/bins structure, mentor layer and plural operand processing
US10067878B2 (en) 2015-09-23 2018-09-04 Hanan Potash Processor with logical mentor
US10095641B2 (en) 2015-09-23 2018-10-09 Hanan Potash Processor with frames/bins structure in local high speed memory
WO2017062612A1 (en) 2015-10-09 2017-04-13 Arch Systems Inc. Modular device and method of operation
US10534697B2 (en) * 2015-10-27 2020-01-14 Sap Se Flexible configuration framework
US9678788B2 (en) * 2015-11-10 2017-06-13 International Business Machines Corporation Enabling poll/select style interfaces with coherent accelerators
US10884761B2 (en) 2016-03-22 2021-01-05 Lenovo Enterprise Solutions (Singapore) Pte. Ltd Best performance delivery in heterogeneous computing unit environment
US10203747B2 (en) 2016-03-22 2019-02-12 Lenovo Enterprise Solutions (Singapore) Pte. Ltd. Workload placement based on heterogeneous compute performance per watt
US10860499B2 (en) 2016-03-22 2020-12-08 Lenovo Enterprise Solutions (Singapore) Pte. Ltd Dynamic memory management in workload acceleration
US11093286B2 (en) * 2016-04-26 2021-08-17 Hanan Potash Computing device with resource manager and civilware tier
US10303466B1 (en) * 2016-09-07 2019-05-28 Amazon Technologies, Inc. Semantic annotations in source code
US10177795B1 (en) 2016-12-29 2019-01-08 Amazon Technologies, Inc. Cache index mapping
US20180285241A1 (en) * 2017-03-28 2018-10-04 Carnegie Mellon University Energy-interference-free debugger for intermittent energy-harvesting systems
US20180331973A1 (en) * 2017-05-09 2018-11-15 Microsoft Technology Licensing, Llc Increasing virtual machine availability during server updates
US10282274B2 (en) * 2017-06-14 2019-05-07 Microsoft Technology Licensing, Llc Presenting differences between code entity invocations
US10732634B2 (en) 2017-07-03 2020-08-04 Baidu Us Llc Centralized scheduling system using event loop for operating autonomous driving vehicles
US10635108B2 (en) * 2017-07-03 2020-04-28 Baidu Usa Llc Centralized scheduling system using global store for operating autonomous driving vehicles
US20190057017A1 (en) * 2017-08-16 2019-02-21 Microsoft Technology Licensing, Llc Correlation Of Function Calls To Functions In Asynchronously Executed Threads
US10474600B2 (en) 2017-09-14 2019-11-12 Samsung Electronics Co., Ltd. Heterogeneous accelerator for highly efficient learning systems
CN111684769B (en) 2017-11-06 2023-03-24 思想系统公司 Network system including matching processing unit of table-based action
WO2019118628A1 (en) * 2017-12-12 2019-06-20 Arch Systems Inc. System and method for physical machine monitoring and analysis
JP6955163B2 (en) * 2017-12-26 2021-10-27 富士通株式会社 Information processing equipment, information processing methods and programs
CN112074808A (en) * 2018-02-22 2020-12-11 思想系统公司 Programmable computer IO device interface
US10636112B2 (en) * 2018-03-28 2020-04-28 Intel Corporation Graphics processor register data re-use mechanism
JP7236811B2 (en) * 2018-03-30 2023-03-10 株式会社デンソー Information processing equipment
US11237946B2 (en) * 2018-05-03 2022-02-01 Sap Se Error finder tool
US11468338B2 (en) 2018-09-11 2022-10-11 Apple Inc. Compiling models for dedicated hardware
JP7037097B2 (en) * 2018-10-19 2022-03-16 日本電信電話株式会社 Data processing system, central processing unit and data processing method
US11126532B1 (en) * 2018-11-14 2021-09-21 Teledyne Lecroy, Inc. Method and apparatus for a parallel, metadata-based trace analytics processor
US10824538B2 (en) * 2019-01-22 2020-11-03 Oracle International Corporation Scalable incremental analysis using caller and callee summaries
US11169886B2 (en) * 2019-01-29 2021-11-09 Sap Se Modification of temporary database pages
US10892971B2 (en) 2019-03-12 2021-01-12 Arch Systems Inc. System and method for network communication monitoring
US11782816B2 (en) * 2019-03-19 2023-10-10 Jens C. Jenkins Input/output location transformations when emulating non-traced code with a recorded execution of traced code
US11281560B2 (en) * 2019-03-19 2022-03-22 Microsoft Technology Licensing, Llc Input/output data transformations when emulating non-traced code with a recorded execution of traced code
US11657162B2 (en) * 2019-03-22 2023-05-23 Intel Corporation Adversarial training of neural networks using information about activation path differentials
US11036546B1 (en) 2019-04-16 2021-06-15 Xilinx, Inc. Multi-threaded shared memory functional simulation of dataflow graph
US10860766B1 (en) 2019-05-23 2020-12-08 Xilinx, Inc. Compilation flow for a heterogeneous multi-core architecture
US10802807B1 (en) 2019-05-23 2020-10-13 Xilinx, Inc. Control and reconfiguration of data flow graphs on heterogeneous computing platform
US11204745B2 (en) 2019-05-23 2021-12-21 Xilinx, Inc. Dataflow graph programming environment for a heterogenous processing system
US11138019B1 (en) 2019-05-23 2021-10-05 Xilinx, Inc. Routing in a compilation flow for a heterogeneous multi-core architecture
US11727265B2 (en) * 2019-06-27 2023-08-15 Intel Corporation Methods and apparatus to provide machine programmed creative support to a user
US11516234B1 (en) * 2019-07-08 2022-11-29 Cisco Technology, Inc. In-process correlation through class field injection
US11068364B2 (en) * 2019-07-12 2021-07-20 Intelliflash By Ddn, Inc. Predictable synchronous data replication
US10949332B2 (en) 2019-08-14 2021-03-16 Microsoft Technology Licensing, Llc Data race analysis based on altering function internal loads during time-travel debugging
US11216446B2 (en) * 2019-08-29 2022-01-04 Snowflake Inc. Identifying software regressions based on query retry attempts in a database environment
US11016849B2 (en) * 2019-09-04 2021-05-25 Red Hat, Inc. Kernel software raid support for direct-access file systems
CN111427816A (en) * 2020-03-04 2020-07-17 深圳震有科技股份有限公司 Inter-core communication method of AMP system, computer equipment and storage medium
US11216259B1 (en) * 2020-03-31 2022-01-04 Xilinx, Inc. Performing multiple functions in single accelerator program without reload overhead in heterogenous computing system
US11693795B2 (en) * 2020-04-17 2023-07-04 Texas Instruments Incorporated Methods and apparatus to extend local buffer of a hardware accelerator
US20210382888A1 (en) * 2020-06-08 2021-12-09 Mongodb, Inc. Hedged reads
US11611588B2 (en) * 2020-07-10 2023-03-21 Kyndryl, Inc. Deep learning network intrusion detection
US11360918B1 (en) * 2020-12-21 2022-06-14 Otis Elevator Company Real-time processing system synchronization in a control system
DE102021102460A1 (en) 2021-02-03 2022-08-04 Ford Global Technologies, Llc Procedure for performing a simulation
US11467811B1 (en) * 2021-06-24 2022-10-11 Marvell Asia Pte Ltd Method and apparatus for generating metadata by a compiler
US20230004365A1 (en) * 2021-06-24 2023-01-05 Marvell Asia Pte Ltd Multistage compiler architecture
US11537457B2 (en) * 2021-06-25 2022-12-27 Intel Corporation Low latency remoting to accelerators
US11941291B2 (en) * 2021-09-02 2024-03-26 Micron Technology, Inc. Memory sub-system command fencing
US20220191003A1 (en) * 2021-12-10 2022-06-16 Tamas Mihaly Varhegyi Complete Tree Structure Encryption Software
CN114398019B (en) * 2022-01-24 2024-02-23 广州文石信息科技有限公司 Screen update request processing method and device and electronic ink screen equipment

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5088034A (en) * 1988-01-29 1992-02-11 Hitachi, Ltd. Compiling method for determining programs to be executed parallelly by respective processors in a parallel computer which transfer data with a data identifier to other processors
US5414849A (en) * 1992-10-30 1995-05-09 Hitachi, Ltd. Evaluating method of data division patterns and a program execution time for a distributed memory parallel computer system, and parallel program producing method using such an evaluating method
US5799142A (en) * 1994-09-12 1998-08-25 Nec Corporation Debugging method and debugging system for multi-task programs
US6170051B1 (en) * 1997-08-01 2001-01-02 Micron Technology, Inc. Apparatus and method for program level parallelism in a VLIW processor
US6611956B1 (en) * 1998-10-22 2003-08-26 Matsushita Electric Industrial Co., Ltd. Instruction string optimization with estimation of basic block dependence relations where the first step is to remove self-dependent branching
US20040078779A1 (en) * 2002-10-22 2004-04-22 Bala Dutt Inducing concurrency in software code
US20050188364A1 (en) * 2004-01-09 2005-08-25 Johan Cockx System and method for automatic parallelization of sequential code
US20060005179A1 (en) * 2004-06-30 2006-01-05 Nec Corporation Program parallelizing apparatus, program parallelizing method, and program parallelizing program
US7047395B2 (en) * 2001-11-13 2006-05-16 Intel Corporation Reordering serial data in a system with parallel processing flows
US20080046875A1 (en) * 2006-08-16 2008-02-21 Gad Haber Program Code Identification System and Method
US20080127200A1 (en) * 2006-07-04 2008-05-29 Iti Scotland Limited Techniques for program execution
US20090150872A1 (en) * 2006-07-04 2009-06-11 George Russell Dynamic code update

Family Cites Families (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US117274A (en) * 1871-07-25 Improvement in barrel-heads
EP0396833A1 (en) * 1989-05-12 1990-11-14 International Business Machines Corporation Trace facility for use in a multiprocessing environment
US5692193A (en) 1994-03-31 1997-11-25 Nec Research Institute, Inc. Software architecture for control of highly parallel computer systems
US6539339B1 (en) * 1997-12-12 2003-03-25 International Business Machines Corporation Method and system for maintaining thread-relative metrics for trace data adjusted for thread switches
US6115763A (en) 1998-03-05 2000-09-05 International Business Machines Corporation Multi-core chip providing external core access with regular operation function interface and predetermined service operation services interface comprising core interface units and masters interface unit
US20060117274A1 (en) 1998-08-31 2006-06-01 Tseng Ping-Sheng Behavior processor system and method
US20040154027A1 (en) * 1998-10-14 2004-08-05 Jean-Jacques Vandewalle Method and means for managing communications between local and remote objects in an object oriented client server system in which a client application invokes a local object as a proxy for a remote object on the server
US6480818B1 (en) * 1998-11-13 2002-11-12 Cray Inc. Debugging techniques in a multithreaded environment
US6636950B1 (en) * 1998-12-17 2003-10-21 Massachusetts Institute Of Technology Computer architecture for shared memory access
JP2000293498A (en) * 1999-04-05 2000-10-20 Nec Corp Remote debug system in distributed environment and recording medium
US20020065864A1 (en) 2000-03-03 2002-05-30 Hartsell Neal D. Systems and method for resource tracking in information management environments
US6748583B2 (en) * 2000-12-27 2004-06-08 International Business Machines Corporation Monitoring execution of an hierarchical visual program such as for debugging a message flow
US6857084B1 (en) * 2001-08-06 2005-02-15 Lsi Logic Corporation Multiprocessor system and method for simultaneously placing all processors into debug mode
US6862694B1 (en) * 2001-10-05 2005-03-01 Hewlett-Packard Development Company, L.P. System and method for setting and executing breakpoints
US7318164B2 (en) * 2001-12-13 2008-01-08 International Business Machines Corporation Conserving energy in a data processing system by selectively powering down processors
US6941492B1 (en) * 2002-02-05 2005-09-06 Emc Corporation Debugging tool for efficient switching between targets in a multi-processor environment
US7080283B1 (en) * 2002-10-15 2006-07-18 Tensilica, Inc. Simultaneous real-time trace and debug for multiple processing core systems on a chip
US7243264B2 (en) * 2002-11-01 2007-07-10 Sonics, Inc. Method and apparatus for error handling in networks
US7222343B2 (en) * 2003-01-16 2007-05-22 International Business Machines Corporation Dynamic allocation of computer resources based on thread type
US7444546B2 (en) * 2003-04-17 2008-10-28 Arm Limited On-board diagnostic circuit for an integrated circuit
US7114042B2 (en) * 2003-05-22 2006-09-26 International Business Machines Corporation Method to provide atomic update primitives in an asymmetric heterogeneous multiprocessor environment
US7743382B2 (en) * 2003-11-03 2010-06-22 Ramal Acquisition Corp. System for deadlock condition detection and correction by allowing a queue limit of a number of data tokens on the queue to increase
US7721069B2 (en) * 2004-07-13 2010-05-18 3Plus1 Technology, Inc Low power, high performance, heterogeneous, scalable processor architecture
EP1815332A4 (en) * 2004-09-07 2009-07-15 Starent Networks Corp Process checkpointing and migration in computing systems
GB0420442D0 (en) * 2004-09-14 2004-10-20 Ignios Ltd Debug in a multicore architecture
US7437581B2 (en) * 2004-09-28 2008-10-14 Intel Corporation Method and apparatus for varying energy per instruction according to the amount of available parallelism
JP2006227706A (en) * 2005-02-15 2006-08-31 Matsushita Electric Ind Co Ltd Program development system and program development program
US7665073B2 (en) * 2005-04-18 2010-02-16 Microsoft Corporation Compile time meta-object protocol systems and methods
US7689867B2 (en) * 2005-06-09 2010-03-30 Intel Corporation Multiprocessor breakpoint
US7827551B2 (en) * 2005-09-21 2010-11-02 Intel Corporation Real-time threading service for partitioned multiprocessor systems
US7793278B2 (en) * 2005-09-30 2010-09-07 Intel Corporation Systems and methods for affine-partitioning programs onto multiple processing units
US8490065B2 (en) * 2005-10-13 2013-07-16 International Business Machines Corporation Method and apparatus for software-assisted data cache and prefetch control
US9081609B2 (en) * 2005-12-21 2015-07-14 Xerox Corporation Image processing system and method employing a threaded scheduler
US8533680B2 (en) * 2005-12-30 2013-09-10 Microsoft Corporation Approximating finite domains in symbolic state exploration
US9038040B2 (en) * 2006-01-25 2015-05-19 International Business Machines Corporation Method for partitioning programs between a general purpose core and one or more accelerators
US20070250820A1 (en) * 2006-04-20 2007-10-25 Microsoft Corporation Instruction level execution analysis for debugging software
GB2443277B (en) * 2006-10-24 2011-05-18 Advanced Risc Mach Ltd Performing diagnostics operations upon an asymmetric multiprocessor apparatus
US9229726B2 (en) * 2006-10-26 2016-01-05 International Business Machines Corporation Converged call flow and web service application integration using a processing engine
US20080108899A1 (en) * 2006-11-06 2008-05-08 Nahi Halmann Hand-held ultrasound system with single integrated circuit back-end
EP2006784A1 (en) * 2007-06-22 2008-12-24 Interuniversitair Microelektronica Centrum vzw Methods for characterization of electronic circuits under process variability effects
US9223677B2 (en) * 2008-06-11 2015-12-29 Arm Limited Generation of trace data in a multi-processor system

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5088034A (en) * 1988-01-29 1992-02-11 Hitachi, Ltd. Compiling method for determining programs to be executed parallelly by respective processors in a parallel computer which transfer data with a data identifier to other processors
US5414849A (en) * 1992-10-30 1995-05-09 Hitachi, Ltd. Evaluating method of data division patterns and a program execution time for a distributed memory parallel computer system, and parallel program producing method using such an evaluating method
US5799142A (en) * 1994-09-12 1998-08-25 Nec Corporation Debugging method and debugging system for multi-task programs
US6170051B1 (en) * 1997-08-01 2001-01-02 Micron Technology, Inc. Apparatus and method for program level parallelism in a VLIW processor
US6611956B1 (en) * 1998-10-22 2003-08-26 Matsushita Electric Industrial Co., Ltd. Instruction string optimization with estimation of basic block dependence relations where the first step is to remove self-dependent branching
US7047395B2 (en) * 2001-11-13 2006-05-16 Intel Corporation Reordering serial data in a system with parallel processing flows
US20040078779A1 (en) * 2002-10-22 2004-04-22 Bala Dutt Inducing concurrency in software code
US20050188364A1 (en) * 2004-01-09 2005-08-25 Johan Cockx System and method for automatic parallelization of sequential code
US20060005179A1 (en) * 2004-06-30 2006-01-05 Nec Corporation Program parallelizing apparatus, program parallelizing method, and program parallelizing program
US20080127200A1 (en) * 2006-07-04 2008-05-29 Iti Scotland Limited Techniques for program execution
US20090150872A1 (en) * 2006-07-04 2009-06-11 George Russell Dynamic code update
US20080046875A1 (en) * 2006-08-16 2008-02-21 Gad Haber Program Code Identification System and Method

Cited By (142)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8902340B2 (en) 1997-07-12 2014-12-02 Google Inc. Multi-core image processor for portable device
US9544451B2 (en) 1997-07-12 2017-01-10 Google Inc. Multi-core image processor for portable device
US9338312B2 (en) 1997-07-12 2016-05-10 Google Inc. Portable handheld device with multi-core image processor
US8947592B2 (en) 1997-07-12 2015-02-03 Google Inc. Handheld imaging device with image processor provided with multiple parallel processing units
US9179020B2 (en) 1997-07-15 2015-11-03 Google Inc. Handheld imaging device with integrated chip incorporating on shared wafer image processor and central processor
US8953178B2 (en) 1997-07-15 2015-02-10 Google Inc. Camera system with color display and processor for reed-solomon decoding
US9584681B2 (en) 1997-07-15 2017-02-28 Google Inc. Handheld imaging device incorporating multi-core image processor
US9560221B2 (en) 1997-07-15 2017-01-31 Google Inc. Handheld imaging device with VLIW image processor
US9432529B2 (en) 1997-07-15 2016-08-30 Google Inc. Portable handheld device with multi-core microcoded image processor
US9237244B2 (en) 1997-07-15 2016-01-12 Google Inc. Handheld digital camera device with orientation sensing and decoding capabilities
US9219832B2 (en) 1997-07-15 2015-12-22 Google Inc. Portable handheld device with multi-core image processor
US9197767B2 (en) 1997-07-15 2015-11-24 Google Inc. Digital camera having image processor and printer
US9191530B2 (en) 1997-07-15 2015-11-17 Google Inc. Portable hand-held device having quad core image processor
US8908069B2 (en) 1997-07-15 2014-12-09 Google Inc. Handheld imaging device with quad-core image processor integrating image sensor interface
US9191529B2 (en) 1997-07-15 2015-11-17 Google Inc Quad-core camera processor
US9185247B2 (en) 1997-07-15 2015-11-10 Google Inc. Central processor with multiple programmable processor units
US9185246B2 (en) 1997-07-15 2015-11-10 Google Inc. Camera system comprising color display and processor for decoding data blocks in printed coding pattern
US9168761B2 (en) 1997-07-15 2015-10-27 Google Inc. Disposable digital camera with printing assembly
US9148530B2 (en) 1997-07-15 2015-09-29 Google Inc. Handheld imaging device with multi-core image processor integrating common bus interface and dedicated image sensor interface
US9143635B2 (en) 1997-07-15 2015-09-22 Google Inc. Camera with linked parallel processor cores
US9143636B2 (en) 1997-07-15 2015-09-22 Google Inc. Portable device with dual image sensors and quad-core processor
US9137398B2 (en) 1997-07-15 2015-09-15 Google Inc. Multi-core processor for portable device with dual image sensors
US8908075B2 (en) 1997-07-15 2014-12-09 Google Inc. Image capture and processing integrated circuit for a camera
US9131083B2 (en) 1997-07-15 2015-09-08 Google Inc. Portable imaging device with multi-core processor
US9124737B2 (en) 1997-07-15 2015-09-01 Google Inc. Portable device with image sensor and quad-core processor for multi-point focus image capture
US9124736B2 (en) 1997-07-15 2015-09-01 Google Inc. Portable hand-held device for displaying oriented images
US9060128B2 (en) 1997-07-15 2015-06-16 Google Inc. Portable hand-held device for manipulating images
US9055221B2 (en) 1997-07-15 2015-06-09 Google Inc. Portable hand-held device for deblurring sensed images
US8953060B2 (en) 1997-07-15 2015-02-10 Google Inc. Hand held image capture device with multi-core processor and wireless interface to input device
US8823823B2 (en) 1997-07-15 2014-09-02 Google Inc. Portable imaging device with multi-core processor and orientation sensor
US8953061B2 (en) 1997-07-15 2015-02-10 Google Inc. Image capture device with linked multi-core processor and orientation sensor
US8947679B2 (en) 1997-07-15 2015-02-03 Google Inc. Portable handheld device with multi-core microcoded image processor
US8936196B2 (en) 1997-07-15 2015-01-20 Google Inc. Camera unit incorporating program script scanner
US8937727B2 (en) * 1997-07-15 2015-01-20 Google Inc. Portable handheld device with multi-core image processor
US8934053B2 (en) 1997-07-15 2015-01-13 Google Inc. Hand-held quad core processing apparatus
US8934027B2 (en) 1997-07-15 2015-01-13 Google Inc. Portable device with image sensors and multi-core processor
US8928897B2 (en) 1997-07-15 2015-01-06 Google Inc. Portable handheld device with multi-core image processor
US8922670B2 (en) 1997-07-15 2014-12-30 Google Inc. Portable hand-held device having stereoscopic image camera
US8922791B2 (en) 1997-07-15 2014-12-30 Google Inc. Camera system with color display and processor for Reed-Solomon decoding
US20130010150A1 (en) * 1997-07-15 2013-01-10 Kia Silverbrook Portable handheld device with multi-core image processor
US8913137B2 (en) 1997-07-15 2014-12-16 Google Inc. Handheld imaging device with multi-core image processor integrating image sensor interface
US8913182B2 (en) 1997-07-15 2014-12-16 Google Inc. Portable hand-held device having networked quad core processor
US8913151B2 (en) 1997-07-15 2014-12-16 Google Inc. Digital camera with quad core processor
US8908051B2 (en) 1997-07-15 2014-12-09 Google Inc. Handheld imaging device with system-on-chip microcontroller incorporating on shared wafer image processor and image sensor
US9137397B2 (en) 1997-07-15 2015-09-15 Google Inc. Image sensing and printing device
US8902357B2 (en) 1997-07-15 2014-12-02 Google Inc. Quad-core image processor
US8902333B2 (en) 1997-07-15 2014-12-02 Google Inc. Image processing method using sensed eye position
US8902324B2 (en) 1997-07-15 2014-12-02 Google Inc. Quad-core image processor for device with image display
US8896724B2 (en) 1997-07-15 2014-11-25 Google Inc. Camera system to facilitate a cascade of imaging effects
US8896720B2 (en) 1997-07-15 2014-11-25 Google Inc. Hand held image capture device with multi-core processor for facial detection
US8836809B2 (en) 1997-07-15 2014-09-16 Google Inc. Quad-core image processor for facial detection
US8866926B2 (en) 1997-07-15 2014-10-21 Google Inc. Multi-core processor for hand-held, image capture device
US8789939B2 (en) 1998-11-09 2014-07-29 Google Inc. Print media cartridge with ink supply manifold
US8866923B2 (en) 1999-05-25 2014-10-21 Google Inc. Modular camera and printer
US8341635B2 (en) 2008-02-01 2012-12-25 International Business Machines Corporation Hardware wake-and-go mechanism with look-ahead polling
US20110173593A1 (en) * 2008-02-01 2011-07-14 Arimilli Ravi K Compiler Providing Idiom to Idiom Accelerator
US20090199183A1 (en) * 2008-02-01 2009-08-06 Arimilli Ravi K Wake-and-Go Mechanism with Hardware Private Array
US8788795B2 (en) 2008-02-01 2014-07-22 International Business Machines Corporation Programming idiom accelerator to examine pre-fetched instruction streams for multiple processors
US20090199029A1 (en) * 2008-02-01 2009-08-06 Arimilli Ravi K Wake-and-Go Mechanism with Data Monitoring
US8732683B2 (en) 2008-02-01 2014-05-20 International Business Machines Corporation Compiler providing idiom to idiom accelerator
US8725992B2 (en) 2008-02-01 2014-05-13 International Business Machines Corporation Programming language exposing idiom calls to a programming idiom accelerator
US8640142B2 (en) 2008-02-01 2014-01-28 International Business Machines Corporation Wake-and-go mechanism with dynamic allocation in hardware private array
US8640141B2 (en) 2008-02-01 2014-01-28 International Business Machines Corporation Wake-and-go mechanism with hardware private array
US8880853B2 (en) 2008-02-01 2014-11-04 International Business Machines Corporation CAM-based wake-and-go snooping engine for waking a thread put to sleep for spinning on a target address lock
US20090199184A1 (en) * 2008-02-01 2009-08-06 Arimilli Ravi K Wake-and-Go Mechanism With Software Save of Thread State
US20100287341A1 (en) * 2008-02-01 2010-11-11 Arimilli Ravi K Wake-and-Go Mechanism with System Address Bus Transaction Master
US20110173419A1 (en) * 2008-02-01 2011-07-14 Arimilli Ravi K Look-Ahead Wake-and-Go Engine With Speculative Execution
US20090199030A1 (en) * 2008-02-01 2009-08-06 Arimilli Ravi K Hardware Wake-and-Go Mechanism for a Data Processing System
US8612977B2 (en) 2008-02-01 2013-12-17 International Business Machines Corporation Wake-and-go mechanism with software save of thread state
US20090199028A1 (en) * 2008-02-01 2009-08-06 Arimilli Ravi K Wake-and-Go Mechanism with Data Exclusivity
US20110173625A1 (en) * 2008-02-01 2011-07-14 Arimilli Ravi K Wake-and-Go Mechanism with Prioritization of Threads
US20110173630A1 (en) * 2008-02-01 2011-07-14 Arimilli Ravi K Central Repository for Wake-and-Go Mechanism
US8171476B2 (en) 2008-02-01 2012-05-01 International Business Machines Corporation Wake-and-go mechanism with prioritization of threads
US20100293341A1 (en) * 2008-02-01 2010-11-18 Arimilli Ravi K Wake-and-Go Mechanism with Exclusive System Bus Response
US20110173632A1 (en) * 2008-02-01 2011-07-14 Arimilli Ravi K Hardware Wake-and-Go Mechanism with Look-Ahead Polling
US8516484B2 (en) 2008-02-01 2013-08-20 International Business Machines Corporation Wake-and-go mechanism for a data processing system
US8452947B2 (en) 2008-02-01 2013-05-28 International Business Machines Corporation Hardware wake-and-go mechanism and content addressable memory with instruction pre-fetch look-ahead to detect programming idioms
US20110173423A1 (en) * 2008-02-01 2011-07-14 Arimilli Ravi K Look-Ahead Hardware Wake-and-Go Mechanism
US8386822B2 (en) 2008-02-01 2013-02-26 International Business Machines Corporation Wake-and-go mechanism with data monitoring
US20090199189A1 (en) * 2008-02-01 2009-08-06 Arimilli Ravi K Parallel Lock Spinning Using Wake-and-Go Mechanism
US20110173417A1 (en) * 2008-02-01 2011-07-14 Arimilli Ravi K Programming Idiom Accelerators
US8316218B2 (en) 2008-02-01 2012-11-20 International Business Machines Corporation Look-ahead wake-and-go engine with speculative execution
US8312458B2 (en) 2008-02-01 2012-11-13 International Business Machines Corporation Central repository for wake-and-go mechanism
US8250396B2 (en) 2008-02-01 2012-08-21 International Business Machines Corporation Hardware wake-and-go mechanism for a data processing system
US20110173631A1 (en) * 2008-02-01 2011-07-14 Arimilli Ravi K Wake-and-Go Mechanism for a Data Processing System
US8015379B2 (en) 2008-02-01 2011-09-06 International Business Machines Corporation Wake-and-go mechanism with exclusive system bus response
US20090199197A1 (en) * 2008-02-01 2009-08-06 International Business Machines Corporation Wake-and-Go Mechanism with Dynamic Allocation in Hardware Private Array
US8127080B2 (en) 2008-02-01 2012-02-28 International Business Machines Corporation Wake-and-go mechanism with system address bus transaction master
US8225120B2 (en) 2008-02-01 2012-07-17 International Business Machines Corporation Wake-and-go mechanism with data exclusivity
US8145849B2 (en) 2008-02-01 2012-03-27 International Business Machines Corporation Wake-and-go mechanism with system bus response
US10869108B1 (en) 2008-09-29 2020-12-15 Calltrol Corporation Parallel signal processing system and method
US9274930B2 (en) * 2008-12-22 2016-03-01 Microsoft Technology Licensing, Llc Debugging system using static analysis
US20100162217A1 (en) * 2008-12-22 2010-06-24 Microsoft Corporation Debugging System Using Static Analysis
US8239847B2 (en) * 2009-03-18 2012-08-07 Microsoft Corporation General distributed reduction for data parallel computing
US20100241828A1 (en) * 2009-03-18 2010-09-23 Microsoft Corporation General Distributed Reduction For Data Parallel Computing
US8886919B2 (en) 2009-04-16 2014-11-11 International Business Machines Corporation Remote update programming idiom accelerator with allocated processor resources
US20100268915A1 (en) * 2009-04-16 2010-10-21 International Business Machines Corporation Remote Update Programming Idiom Accelerator with Allocated Processor Resources
US20100268791A1 (en) * 2009-04-16 2010-10-21 International Business Machines Corporation Programming Idiom Accelerator for Remote Update
US8082315B2 (en) * 2009-04-16 2011-12-20 International Business Machines Corporation Programming idiom accelerator for remote update
US20100268790A1 (en) * 2009-04-16 2010-10-21 International Business Machines Corporation Complex Remote Update Programming Idiom Accelerator
US8230201B2 (en) 2009-04-16 2012-07-24 International Business Machines Corporation Migrating sleeping and waking threads between wake-and-go mechanisms in a multiple processor data processing system
US8145723B2 (en) * 2009-04-16 2012-03-27 International Business Machines Corporation Complex remote update programming idiom accelerator
US9569272B2 (en) * 2009-07-15 2017-02-14 Commissariat a l'energie atomique et aux alternatives Device and method for the distributed execution of digital data processing operations
US20110016293A1 (en) * 2009-07-15 2011-01-20 Comm. a l' ener. atom. et aux energies alter. Device and method for the distributed execution of digital data processing operations
US8392403B2 (en) 2009-09-18 2013-03-05 Microsoft Corporation Management of data and computation in data centers
US20110072006A1 (en) * 2009-09-18 2011-03-24 Microsoft Corporation Management of data and computation in data centers
US8612952B2 (en) 2010-04-07 2013-12-17 International Business Machines Corporation Performance optimization based on data accesses during critical sections
US9304812B2 (en) 2010-12-16 2016-04-05 Imagination Technologies Limited Multi-phased and multi-threaded program execution based on SIMD ratio
US11947999B2 (en) 2010-12-16 2024-04-02 Imagination Technologies Limited Multi-phased and multi-threaded program execution based on SIMD ratio
GB2486485A (en) * 2010-12-16 2012-06-20 Imagination Tech Ltd Completing execution of one phase of a computer program before scheduling any instructions for the next phase
GB2486485B (en) * 2010-12-16 2012-12-19 Imagination Tech Ltd Method and apparatus for scheduling the issue of instructions in a microprocessor using multiple phases of execution
US10585700B2 (en) 2010-12-16 2020-03-10 Imagination Technologies Limited Multi-phased and multi-threaded program execution based on SIMD ratio
US8566831B2 (en) 2011-01-26 2013-10-22 International Business Machines Corporation Execution of work units in a heterogeneous computing environment
US8533720B2 (en) 2011-02-25 2013-09-10 International Business Machines Corporation Offloading work from one type to another type of processor based on the count of each type of service call instructions in the work unit
US8621430B2 (en) 2011-03-03 2013-12-31 International Business Machines Corporation Method for code transformation supporting temporal abstraction of parameters
US8949809B2 (en) 2012-03-01 2015-02-03 International Business Machines Corporation Automatic pipeline parallelization of sequential code
US10452369B2 (en) 2012-04-13 2019-10-22 International Business Machines Corporation Code profiling of executable library for pipeline parallelization
US9619360B2 (en) 2012-04-13 2017-04-11 International Business Machines Corporation Code profiling of executable library for pipeline parallelization
US8910137B2 (en) 2012-04-13 2014-12-09 International Business Machines Corporation Code profiling of executable library for pipeline parallelization
US20130339923A1 (en) * 2012-06-19 2013-12-19 Charles Chen Xu Data Handling Among Actors in a Dataflow Programming Environment
US8904371B2 (en) * 2012-06-19 2014-12-02 Telefonaktiebolaget L M Ericsson (Publ) Data handling among actors in a dataflow programming environment
US9158537B2 (en) * 2012-12-04 2015-10-13 International Business Machines Corporation Streamlining hardware initialization code
US20140157229A1 (en) * 2012-12-04 2014-06-05 International Business Machines Corporation Streamlining Hardware Initialization Code
US9021426B2 (en) * 2012-12-04 2015-04-28 International Business Machines Corporation Streamlining hardware initialization code
US20140157230A1 (en) * 2012-12-04 2014-06-05 International Business Machines Corporation Streamlining Hardware Initialization Code
US20140195834A1 (en) * 2013-01-04 2014-07-10 Microsoft Corporation High throughput low latency user mode drivers implemented in managed code
US9811319B2 (en) 2013-01-04 2017-11-07 Microsoft Technology Licensing, Llc Software interface for a hardware device
US9323543B2 (en) 2013-01-04 2016-04-26 Microsoft Technology Licensing, Llc Capability based device driver framework
US10949200B2 (en) * 2013-06-16 2021-03-16 President And Fellows Of Harvard College Methods and apparatus for executing data-dependent threads in parallel
US20160313991A1 (en) * 2013-06-16 2016-10-27 President And Fellows Of Harvard College Methods and apparatus for parallel processing
US10296340B2 (en) 2014-03-13 2019-05-21 Arm Limited Data processing apparatus for executing an access instruction for N threads
US9652817B2 (en) 2015-03-12 2017-05-16 Samsung Electronics Co., Ltd. Automated compute kernel fusion, resizing, and interleave
US10475147B2 (en) * 2016-02-12 2019-11-12 Arm Limited Multiple GPU graphics processing system
US20170236244A1 (en) * 2016-02-12 2017-08-17 Arm Limited Graphics processing systems
CN110199269A (en) * 2017-01-23 2019-09-03 三星电子株式会社 Method and electronic device for the data processing between multiprocessor
CN110998540A (en) * 2017-08-01 2020-04-10 微软技术许可有限责任公司 Execution of focus of trace code in debugger
US10732982B2 (en) 2017-08-15 2020-08-04 Arm Limited Data processing systems
US20210271666A1 (en) * 2018-09-28 2021-09-02 Marc Brandis Ag Analyzing a processing engine of a transaction-processing system
CN111476264A (en) * 2019-01-24 2020-07-31 国际商业机器公司 Testing of robustness against access restricted systems
US11836256B2 (en) 2019-01-24 2023-12-05 International Business Machines Corporation Testing adversarial robustness of systems with limited access
US11604752B2 (en) 2021-01-29 2023-03-14 Arm Limited System for cross-routed communication between functional units of multiple processing units
US20230195321A1 (en) * 2021-12-17 2023-06-22 Samsung Electronics Co., Ltd. Storage device and operating method thereof

Also Published As

Publication number Publication date
JP2010507855A (en) 2010-03-11
US8250549B2 (en) 2012-08-21
EP2076837B1 (en) 2010-10-13
TW200821938A (en) 2008-05-16
GB2443277A (en) 2008-04-30
US20080098207A1 (en) 2008-04-24
KR101325229B1 (en) 2013-11-04
EP2076837A1 (en) 2009-07-08
IL197314A0 (en) 2009-12-24
US20080098262A1 (en) 2008-04-24
CN101529391B (en) 2011-06-15
US7809989B2 (en) 2010-10-05
DE602007009857D1 (en) 2010-11-25
WO2008050076A1 (en) 2008-05-02
IL197314A (en) 2012-12-31
US20080114937A1 (en) 2008-05-15
GB2443277B (en) 2011-05-18
KR20090082254A (en) 2009-07-29
JP5054115B2 (en) 2012-10-24
MY144449A (en) 2011-09-30
US8190807B2 (en) 2012-05-29
US20080215768A1 (en) 2008-09-04
GB0709182D0 (en) 2007-06-20
TWI407374B (en) 2013-09-01
CN101529391A (en) 2009-09-09

Similar Documents

Publication Publication Date Title
US20080098208A1 (en) Analyzing and transforming a computer program for executing on asymmetric multiprocessing systems
US10430190B2 (en) Systems and methods for selectively controlling multithreaded execution of executable code segments
Stratton et al. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs
Ottoni et al. Automatic thread extraction with decoupled software pipelining
ElTantawy et al. MIMD synchronization on SIMT architectures
Grossman et al. CnC-CUDA: declarative programming for GPUs
Arnold et al. Power aware heterogeneous MPSoC with dynamic task scheduling and increased data locality for multiple applications
Reid et al. SoC-C: efficient programming abstractions for heterogeneous multicore systems on chip
Sorensen et al. Specifying and testing GPU workgroup progress models
Sorensen et al. GPU schedulers: how fair is fair enough?
Ročkai Model checking software
Salcic et al. GALS-HMP: A heterogeneous multiprocessor for embedded applications
Bernard et al. On the compilation of a language for general concurrent target architectures
US20140223419A1 (en) Compiler, object code generation method, information processing apparatus, and information processing method
Lankamp Developing a reference implementation for a microgrid of microthreaded microprocessors
Kumar et al. A Modern Parallel Register Sharing Architecture for Code Compilation
Rutgers Programming models for many-core architectures: a co-design approach
Stavrou et al. Hardware budget and runtime system for data-driven multithreaded chip multiprocessor
Bari Achieving Resilience and Maintaining Performance in OpenSHMEM+ X Applications
Traulsen Reactive processing for synchronous languages and its worst case reaction time analysis
Berkovich Parallel Run-Time Verification
Kreiliger Time-predictable GPU execution
Baudisch Synthesis of Synchronous Programs to Parallel Software Architectures
de Oliveira Jr et al. CML: C modeling language
Naji Timing analysis for time-predictable architectures

Legal Events

Date Code Title Description
AS Assignment

Owner name: REGENTS OF THE UNIVERSITY OF MICHIGAN, THE, MICHIG

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:LIN, YUAN;REEL/FRAME:020240/0435

Effective date: 20071029

Owner name: ARM LIMITED, UNITED KINGDOM

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:REID, ALASTAIR DAVID;FORD, SIMON ANDREW;REEL/FRAME:020232/0590

Effective date: 20071003

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION