WO2014003611A1

WO2014003611A1 - Processor work distribution

Info

Publication number: WO2014003611A1
Application number: PCT/SE2012/050722
Authority: WO
Inventors: Diarmuid CORCORAN; Dietmar Fiedler
Original assignee: Telefonaktiebolaget L M Ericsson (Publ)
Priority date: 2012-06-27
Filing date: 2012-06-27
Publication date: 2014-01-03

Abstract

The application relates to a method and a processor configuration for processor work distribution. The distribution of work among processor cores or units in a processor configuration is normally implemented by global and/or local queues. If several competing cores want to access those queues to add or remove work items a locking mechanism for the queue has to be applied, such as semaphores or locks. However, deadlocks can occur when processes wait for semaphores to get released but nobody releases them. The solution to this problem is a processor configuration (100) comprising dedicated memory areas (111-114) storing the individual processor load status for each processor unit (121-124) and a method for work distribution using the stored individual processor load status thereby not requiring a locking mechanism on the memory.

Description

PROCESSOR WORK DISTRIBUTION TECHNICAL FIELD

The application relates to a method and a processor configuration for processor work distribution.

BACKGROUND

A multi-core processor is a processor configuration being a single computing component with two or more independent actual processors (also called "cores"), which are the units that read and execute program instructions. The instructions are ordinary CPU instructions such as add, move data, and branch, but the multiple cores can run multiple instructions at the same time, increasing overall speed for programs amenable to parallel computing. Manufacturers typically integrate the cores onto a single integrated circuit die (known as a chip multiprocessor or CMP), or onto multiple dies in a single chip package.

The distribution of processor work load among the processor cores is normally implemented by global and/or local queues. If several competing cores want to access those queues to add or remove work items a locking mechanism for the queue has to be applied, such as semaphores or locks. This has been for example been described on the web site http: //en.wikipedia. org/^Twiki .Semaphores . This locking mechanism can however result in an unpredictable behavior because the scheduling/work distribution mechanism is not aware of what the application is doing. Due to such a memory lock operation in the queue the application might have to wait until the memory lock is freed. The time of locking/unlocking depends very much on the type of the application. That makes the overall situation unpredictable especially if several processor cores try to access the same queue competing for the same semaphore or lock. Furthermore shared memory communication is very error prone; if a process (an instance of a computer program) forgets to unlock the semaphore the whole system can stop working. Deadlocks can occur when processes wait for semaphores to get released but nobody releases them.

SUMMARY

With this background it is the object of the embodiments described below to obviate at least some of the disadvantages mentioned above. The object is achieved by a processor configuration and a method for processor work distribution that does not require a locking mechanism on the memory when distributing the work.

The processor configuration comprises a plurality of processor units (such as processor cores) and a plurality of memory areas where each memory area is dedicated to a corresponding processor unit. The processor units communicate with the memory areas via a processor bus.

At least two of the processor units are working processor units for executing processes and at least one of the processor units is a work distributing processor unit configured to receive requests for processes to be executed and distributing these to the working processor units for execution. Each processor unit could in different embodiments be a working processor unit only, a distributing processor unit only, or a combination of both.

The working processor units are also configured to store information about its own processor load status in its dedicated memory area. The processor load status could for example be the processor work load and/or the estimated total executing time of queuing processes.

The work distributing processor unit is configured to when receiving a request for a process to be executed, read the processor load status for the working processor units from its respective dedicated memory areas and to transfer the request to the working processor unit having the lowest processor load. Optionally, each working processor unit is also configured to be in an active mode or in a power save mode.

In a preferred embodiment, each working processor unit is configured to have read and write access capabilities over the processor bus to its dedicated memory area but work distributing processor units have read access only capabilities to the other memory areas.

The method for processor work distribution in the processor configuration comprises the steps of storing for each working processor unit information about its current processor load status in the corresponding dedicated memory area. When receiving at the work distribution processor unit a request for a process to be executed, the work distribution processor unit reads the stored processor load status for the working processor units from the corresponding dedicated memory areas and transfers the received request to the working processor unit having the lowest processor load.

An advantage with this is that no locking mechanism on the memory is needed as shared queues between the processor units/cores are avoided. This makes systems using this processor configuration more predictable because it is not necessary to wait for unlocking of for example semaphores. The systems will also be more robust as dead locks could not happen. The method does also optionally allow for individual working processors to enter a power save mode which enables more power efficient implementations.

The embodiments will now be described in more detail and referring to accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 is a block diagram illustrating the processor configuration.

Figure 2 is a block diagram illustrating the read and read/write capabilities for the processor units.

Figure 3 is a block diagram illustrating the working processor units determining its processor load status.

Figure 4 is a block diagram illustrating a work distributing processor unit distributing the work to a working processor unit .

Figure 5 is a block diagram illustrating a processor unit ordering another processor unit to enter power save mode.

Figure 6 is a block diagram illustrating a processing system comprising a plurality of interconnected processor configurations .

Figure 7 is a flow chart illustrating an embodiment of the work distribution method.

Figure 8 is a flow chart illustrating an embodiment of processor units entering active or power save mode.

DETAILED DESCRIPTION

As said above, a multi-core processor is a processor configuration being a single computing component with two or more independent actual processors (called "cores") which are the units that read and execute program instructions. In contrast to multi-core systems, the term multi-CPU refers to multiple physically separate processing-units (which often contain special circuitry to facilitate communication between each other. However, the embodiments described below could be implemented in any of these configurations with the same advantageous effect and therefore the used term processing unit' corresponds to a CPU in a multi-CPU configuration as well as to a processor core in a multi-core processor configuration. Figure 1 illustrates an embodiment of a processor configuration 100 comprising a plurality of processor units PU0, PU1, PU2, PUN, 121-124 and a plurality of memory areas MAO, MAI, MA2, MAN, 111-114.

The memory areas 111-114 could in one embodiment together form a single shared memory area SMA 110. Each memory area 111-114 is dedicated to one and only one of the processor units 121-124. Each processor unit 121-124 is configured to communicate with the memory areas 111-114 and the other processor units 121-124 via a communication bus 130. At least two of the processor units 121-124 are configured as working processor units for executing processes and at least one of the processor units 121-124 is a work distributing processor unit configured to receive requests for processes to be executed (process requests) and distributing these to the working processor units for execution. Each processor unit could in different embodiments be a working processor unit only, a work distributing processor unit only, or a combination of both. A processor configuration could for example be

- five working processor units and one work distributing processor unit or - four working processor units and a combined work distributing processor unit working processor unit or

- six working processor units and two work distributing processor units or

- six processor units where all are combined work distributing processor units and working processor units.

Each working processor unit can be configured with a local queue 1211-1214 in which the transferred process requests are stored. Alternatively, the local queue for each working processor unit can be implemented in the corresponding dedicated memory area 111-114.

Each dedicated memory area 111-114 stores variables related to the status of the corresponding processor unit 121-124. Two variables related to the processor load status of each corresponding processor unit 121-124 are stored in the variables LOAD 1111 and QEXEC 1112 respectively. LOAD stores the current status value of the corresponding processor' s work load, normally expressed in the unit of percent, % of the total processing capability. The other variable QEXEC stores the value of the total estimated execution time (in fractions of seconds) for the processes queuing for execution in the local queues 1211-1214.

In different embodiments, the processor load status used in the work distribution method is defined by either the variable LOAD only, the variable QEXEC only or a combination of both variables LOAD and QEXEC where in the combination the execution time in QEXEC is converted to a load percentage and added up with the value of LOAD. Which variable or combination thereof that is used in the work distribution method can depend on the expected processing behavior. For example, in systems where only small peak loads are expected the LOAD variable is preferred whereas in systems where peak loads are expected, the QEXEC variable is preferred.

Another variable related to the status of the corresponding processor unit 121-124 is POW_SAVE (1115 in Figure 5), which indicates whether the corresponding processor is in power save mode (POW_SAVE = YES) or in active mode (POW_SAVE = NO) . More about this feature is described further below.

Figures 2 to 4 illustrate an embodiment where all processor units 121-124 are combined working processor units and work distributing processor units. In Figure 2 each one of the processor units 121-124 is configured to have read and write access 211 to its dedicated memory area 111-114 but only read access 212 to the other memory areas.

In one embodiment each working processor 121-124 is further configured to at regular intervals determine its own processor work load. This is illustrated in Figure 3 and in Figure 7. For example, processer units 121-124 determine at a certain point in time that its own current processor work load is 20%, 10%, 15% and 17% respectively. This load status value is stored in the variable LOAD 1111 in the corresponding dedicated memory area 111-114. The value of the total estimated execution time for the processes queuing for execution (variable QEXEC) can also be updated at regular intervals but preferably updated (as in step 712 in Figure 7) each time a process request (410 in Figure 4) is received by the working processor 121-124. A work distribution processor unit only, does not need to update its load status and would only need a read access to the other memory areas in order to perform the work distribution. Each processor unit 121-124 configured as a work distribution processor unit is further configured to receive process requests and to forward these to one of the working processor units 121-124. This is illustrated in Figure 4. When for example processor unit PU2, 123 receives a process request 410, it determines in real time the current processor load status of all the working processor units 121-124 by reading 405 out the value from the LOAD variable 1111 and/or the QEXEC variable 1112 in each corresponding memory area 111- 114. The processor unit PU2, 123 selects the processor unit having the least load (e.g. least processor work load (variable LOAD) or the shortest estimated queued execution time (variable QEXEC) for the moment among the processor units in active mode (including the processor unit PU2, 123 itself) . In the example in Figure 4 the processor unit with least processor work load (10%) is PU1, 122. Processor unit PU2, 123 signals to processor unit PU1, 122 and forwards the process request 410 to the processor unit PU1, 122. Processor unit PU1, 122 puts the received process request 410 in its internal queue 1212 and stores an update 406 of its QEXEC variable by adding the value of an estimated executing time for the process. Again, in another embodiment, processor unit PU2, 123 can select the processor unit having the shortest estimated queued execution time (variable QEXEC) . Each processor unit configured as a combined work distribution processor unit and working processor unit can optionally be configured to determine the signaling overhead on the processor bus 130. If for example the execution time of the signaling overhead is larger is than the estimated execution time of the process to be executed, the processor unit 123 receiving the process request 410 can execute the process itself without interrogating the other processor units 121,122,124 about their processor load.

A plurality of processor configurations 100 described above could be interconnected to form a larger processor system. This may be the case when a high signaling overhead on the processor bus 130 is expected. An embodiment of this system 600 is illustrated in Figure 6. In Figure 6 the processor system 600 comprises two processor configurations 100A, 100B and two configuration load memory areas 611, 612 dedicated to each processor configuration 100A, 100B. The processor configurations 10OA, 100B and the configuration load memory areas 611, 612 are interconnected via a processor bus 630.

Each processor configuration 100A, 100B is configured with a selected work distributing processor unit 151,152 configured to store the total configuration load status TA, TB for each processor configuration 100A, 100B in the corresponding configuration load memory areas 611,612. In one embodiment, the selected work distributing processor units 151,152 are configured to have read and write access capabilities 621 to the corresponding configuration load memory areas 611,612 and read only access capabilities 622 to the other configuration load memory areas 611,612.

The selected work distributing processor units 151,152 are further configured to receive a request 640 for execution of a process and to read the total configuration load status TA, TB for each processor configuration 100A, 100B and transfer the process request to the selected work distributing processor unit 151,152 in the processor configuration 100A, 100B having the lowest total configuration load. The configuration load memory areas 611,612 could together form a shared memory area 610. The shared memory areas 610, 11OA and 110B could in turn together also form a single shared memory area. A flow chart illustrating an embodiment of the work distribution method is found in Figure 7. In this embodiment each working processor determines at regular intervals its own processor work load. In Figure 7 the regular intervals are monitored by each working processor 121-124 by setting a timer Tl . When Tl fires in step 700 the working processor 121-124 determines its own processor work load in % in step 701. The value is stored in the variable LOAD in the corresponding memory area 111-114 in step 702 and timer Tl is restarted in step 703.

When a work distribution processor unit (such as processor PU2 123 in Figure 4) receives a process request 410 in step 704, it optionally estimates the execution time for all load status requests (read and transfer signaling) over the processor bus 130 in step 705. It also estimates the execution time for the received process request 410 in step 706. If the execution time SIGN for the load status requests on the bus 130 is estimated to be higher than the execution time PROC for the received process request 410, processor PU2 123 can take the decision in step 707 to execute in step 708 the process request 410 by itself, if its own processor load status allows it.

If the execution time SIGN for the load status requests is determined in step 707 to be lower than the execution time PROC for the receive process request 410, processor PU2 123 fetches in step 709 the processor load status from all the working processors and transfers the process request 410 to the least loaded working processor in active mode (in this case processor PU1 122) in step 710. Processor PU 1 122 puts in step 711 the process request in an internal queue 1212 and sends in step 712 an update 406 of its QEXEC variable by adding the value of the estimated executing time for the process .

As an option, at least one of the processor units 121-124 could be configured to instruct one (or several) working processor unit(s) including itself to enter a power save mode. This feature is useful at off-peak hours for energy saving. One embodiment of this is illustrated in Figure 5 and in the corresponding Figure 8. At least one processor unit (as for example PU2 123) determines at regular intervals the total load status for the processor configuration 100. In the flow chart in Figure 8 the regular intervals are monitored by setting a timer T2. In one embodiment, timer Tl and T2 could very well be one and the same timer.

When timer T2 fires in step 801, processor unit PU2 123 makes in step 802 a status interrogation 510 about the processor load status for all processor units by reading the status variables LOAD, QEXEC and POW_SAVE in the corresponding memory areas 111-114. The received processor load status values LOAD and QEXEC are added together to form a value of the total processor load for all the working processors.

If it is judged in step 803 that the total processor load status for all the working processors in active mode is relatively low (below a first threshold) , it is possible that one or several active working processor units, as for example PU1 122, temporarily can be released from duty. Processor unit PU2 123 sends in step 804 an instruction 511 to PU1 122 to enter the power save mode. Processor PU1 122 sends in step 805 an update 512 to its corresponding memory area 112 and sets the power save mode variable POW_SAVE = YES. Processor PU2 123 restarts timer T2 in step 806 and processor PU1 122 can now enter power save mode and reduce the total power consumption for the whole processor configuration 100. Conversely, If it is judged in steps 803 and 807 that the total processor load status for all the working processors in active mode together is relatively high (above a second threshold greater than the first) processor unit PU2 123 (or some other load distribution processor unit) can in step 808 instruct PU1 122 to ^>wake up' again and enter active mode (and updating the power save variable POW_SAVE = NO in step 809) .

As an alternative to use timer T2 (in step 801), it is possible to trigger the method to determine the load status at for example each 10, 100 or 1000 times the load distribution is performed. This could be made possible by incrementing or decrementing a counter stored in the memory area (e.g. 113) by the processor unit (e.g. PU2 123).

Claims

1. A processor configuration (100) for processor work distribution comprising:

- a plurality of processor units (121-124); - a plurality of memory areas (111-114), each memory area (111-114) being dedicated to a corresponding processor unit (121-124) ;

- a processor bus (130) for communication between the processor units (121-124) and the memory areas (111-114); wherein at least two of the processor units are working processor units (121,122,124) configured to execute processes and to store information about its own current processor load status in its dedicated memory area (111,112,114) and wherein at least one processor unit is a work distributing processor unit (123) configured to receive a request (410) for execution of a process and to read the processor load status for the working processor units (121,122,124) from its respective dedicated memory areas (111,112,114) and to transfer the request (410) to the working processor unit (122) having the lowest processor load.

2. A processor configuration (100) as in claim 1 wherein the working processor units (121,122,124) are configured to have read and write access capabilities (211) to its dedicated memory area (111,112,114) and the work distributing processor unit (123) is configured to have read only access capabilities (212) to all working processor unit's dedicated memory areas (111,112,114).

3. A processor configuration (100) as in claim 1 or 2 where the current processor load status is based upon the processor' s current work load or the estimated total executing time of queuing processes or a combination thereof.

4. A processor configuration (100) as in claim 3 wherein the working processor units (121,122,124) are further configured to store information about the processor' s current work load at regular intervals in the dedicated memory area (111, 112, 114) .

5. A processor configuration (100) as in claim 3 wherein the working processor units (121,122,124) are further configured to store information about the estimated total executing time of queuing processes each time a process request (410) is received.

6. A processor configuration (100) as in any preceding claim wherein at least one of the processor units (121-123) is configured to be both a work distributing processor unit (123) and a working processor unit (122) .

7. A processor configuration (100) as in claim 6 wherein each work distribution processor unit (123) is further configured to estimate the execution time for read and transfer signaling on the processor bus (130) and to estimate the execution time for the process requested to be executed and to execute the process by itself as a working processor if the estimated execution time for the signaling is greater than the estimated execution time for the process.

8. A processor configuration (100) as in any preceding claim wherein each working processor unit (121,122,124) further comprises a local queue (1212) in which the transferred process request (410) is stored before execution.

9. A processor configuration (100) as in any preceding claim wherein at least one of the working processor units (121,122,124) is further configured to go from active mode to power save mode or vice versa when the total configuration load status changes and to store information about this mode in its dedicated memory area (111,112,114).

10. A processor configuration (100) as in any preceding claim wherein the at least one of the processor units (123) is further configured to order another working processor unit (122) to go from active mode to power save mode or vice versa.

11. A processor configuration (100) as in any preceding claim wherein the plurality of memory areas (111-114) together forms a single shared memory area (110) .

12. A processing system (600) comprising

- at least two processor configurations (100A, 100B) according to any of the claims 1 to 11;

- at least two configuration load memory areas (611,612) each configured to store the total configuration load status for a corresponding processor configuration ( 100A, 100B) ;

- a processor bus (630) for communication between the processor configurations (100A, 100B) and the configuration load memory areas (611,612); wherein each processor configuration (100A, 100B) is configured with a selected work distributing processor unit (151,152) configured to store the total configuration load status in the corresponding configuration load memory areas (611,612) and further configured to receive a request (640) for execution of a process and to read the total configuration load status for each processor configuration (100A, 100B) from the configuration load memory areas (611,612) and transfer the request (640) to the selected work distributing processor unit (151,152) in the processor configuration (100A, 100B) having the lowest total configuration load.

13. A processing system (600) as in claim 12 wherein the selected work distributing processor units (151,152) are configured to have read and write access capabilities (621) to the corresponding configuration load memory areas (611,612) and read only access capabilities (622) to the other configuration load memory areas (611,612).

14. A processing system (600) as in claim 12 or 13 wherein all the memory areas (100A, 100B, 610) together form a single shared memory area.

15. A method for distributing processor work in a processor configuration (100) comprising:

- a plurality of processor units (121-124) where at least two of the processor units are working processor units (121,122,124) configured to execute processes and where at least one processor unit is a work distributing processor unit (123);

- a plurality of memory areas (111-114), where each memory area (111-114) is dedicated to one processor unit (121-124); - a processor bus (130) for communication between the processor units (121-124) and the memory areas (111-114); the method comprising the steps of:

- storing (702) information about the current processor load status for each working processor unit (121,122,124) in the corresponding dedicated memory area (111,112,114);

- receiving (704) at the work distribution processor unit (123) a request (410) for execution of a process;

- reading (709) by the work distribution processor unit (123) the processor load status for the processor units (121,122,124) from the corresponding dedicated memory area (111, 112, 114) ;

- transferring (710) the request (410) to the processor unit (122) having the lowest processor load.

16. A method for distributing work as in claim 15 where the current processor load status is based upon the processor' s current work load or the estimated total executing time of queuing processes or a combination thereof.

17. A method for distributing work as in claim 16 including the step of storing (702) information about the processor's current work load at regular intervals.

18. A method for distributing work as in claim 16 including the step of storing (702) information about the information about the estimated total executing time of queuing processes each time a process request (410) is received.

19. A method for distributing work as in any of the claims 15 to 18 wherein at least one of the processor units (121-123) is a combined processor unit configured to be both a work distributing processor unit (123) and a working processor unit (122) and where the method further comprises the steps of:

- estimating (705) the execution time for read and transfer signaling on the processor bus (130);

- estimating (706) the execution time for the process requested to be executed; - executing (708) the process by the combined processor unit if the estimated execution time for the signaling is greater than the estimated execution time for the process.

20. A method for distributing work as in any of the claims 15 to 19 where the method further comprises the step of: - storing (711) the process request (410) in a local queue (1211,1212,1214) in the working processor unit (121,122,124) for subsequent execution.