CN102508776A

CN102508776A - Automatic construction method for evaluation stimulus of multi-thread cross double-precision short-vector structure

Info

Publication number: CN102508776A
Application number: CN2011103428031A
Authority: CN
Inventors: 李春江; 杜云飞; 易会战; 杨灿群; 黄春; 陈娟; 赵克佳; 王�锋; 彭林; 左克
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2011-11-03
Filing date: 2011-11-03
Publication date: 2012-06-20
Anticipated expiration: 2031-11-03
Also published as: CN102508776B

Abstract

The invention discloses an automatic construction method for an evaluation stimulus of a multi-thread cross double-precision short-vector structure. The method comprises the following steps of: inputting a vector operation type to be evaluated and a vector length to be evaluated; automatically creating an empty assembly language file which serves as the evaluation stimulus; writing contents such as a multi-thread running initial code segment, an evaluation stimulus control structure, a multi-thread vector operation program segment, an evaluation stimulus synchronization structure, a data segment and a data segment initialization statement into the assembly language file; and forming the integral evaluation stimulus of the multi-thread cross double-precision short-vector structure. By the method, the evaluation stimulus can be automatically constructed in batches, a construction time length can be decreased, cost can be saved, evaluation stimulus development efficiency can be improved, a processor can be conveniently verified, and the performance of the process can be conveniently evaluated.

Description

The evaluation and test excitation method for automatically constructing of multithreading intersection double precision short vector structure

Technical field

The present invention relates to field of processors and evaluation and test field, relate in particular to a kind of evaluation and test excitation method for automatically constructing towards multithreading intersection double precision short vector structure treatment device.

Background technology

Along with the integrated level of processor chips is increasing, realization double precision short vector parts support that data-intensive science and engineering calculation are important development trends in processor cores.Expansion double precision short vector parts can increase substantially the double-precision floating point computing power of processor in the multiline procedure processor kernel.Double precision short vector parts need be realized the longer short vector register (AVX of Intel has supported 256 short vectors of totally 4 road double precision datums at present) of long word, and need to realize the corresponding vector operation instruction collection of supporting that double precision is calculated.

As shown in Figure 1, be the multiline procedure processor inner core synoptic diagram of having expanded double precision short vector parts.This processor cores is realized based on OpenSparc T2, in processor cores, has expanded vector processing unit (VPU), supports the short vector operation of 4 road double precision datums, and a plurality of threads can concurrently use.The mode that this processor cores adopts wheel to change multithreading is supported 8 hardware threads, and per 4 hardware threads are one group; Each clock period; Processor selects the present instruction of a thread to carry out from every group of 4 threads; This instruction can be that vector instruction also can be scalar instruction, and when the instruction of certain thread is blocked owing to reasons such as cache miss cause streamline, the microprocessor of multithreading intersection vector structure will be got from other threads and refer to carry out; Postpone thereby hide, guarantee to be full of streamline.As shown in Figure 1, the function description of each functional unit of processor cores is following:

1) self-trapping logical block (TLU) is used to upgrade machine state, handles unusual and interruption.The VPU of object oriented processor expansion, TLU has also carried out corresponding expansion, supports the state of VPU to upgrade and abnormality processing.

2) fetch unit (IFU), each clock period is got an instruction from every group of thread, be transmitted into corresponding performance element (EXU0/1, FGU, LSU, VPU) according to the type of instructing and carry out.

3) Integer Execution Units (EXU0/1) is responsible for carrying out the instruction of integer operation class.This processor comprises two Integer Execution Units (being labeled as respectively No. 0 and No. 1), and per 4 threads are shared an Integer Execution Units.

4) floating-point and graphic element (FGU), the instruction of being responsible for carrying out scalar floating-point operation instruction and supporting Flame Image Process.

5) get/deposit receipt unit (LSU), be responsible for the execution of all access instruction.

6) MMU (MMU), address translation, memory management when being responsible for cooperating the LSU unit to accomplish memory access.

7) vector operations unit (VPU) is responsible for carrying out the short vector instruction that realizes 4 road double precision datum computings.

8) communication unit (Gasket), responsible processor cores is communicated by letter with the 2nd grade of high-speed cache or other processor cores.

In order to realize multithreading intersection double precision short vector processor, on the basis of original processor cores, realized the VPU unit.And for the function with the VPU unit matches, TLU, IFU, LSU, MMU have carried out expansion to support the operation of double precision short vector.Corresponding with the improvement of processor cores structure; The processor of multithreading intersection double precision short vector structure has been realized short vector operational order collection, comprises vector access instruction, vector calculation instruction, vectorial comparison order, vectorial shift order, state of operation instruction etc.

The above-mentioned processor cores that has increased vector processing unit (VPU) is supported 8 hardware threads at most, and multi-thread concurrent uses the VPU parts just to constitute multithreading intersection double precision short vector architecture.Comprise polytype instruction in the instruction stream of each thread of use vector location, manage herein on the device, the process that instruction is carried out is following:

Each clock cycle of IFU is obtained two instructions from two threads from the present instruction of 8 hardware threads; Which functional unit it sends to based on the type decided of instructing with instruction is gone to carry out, if two all are the integer arithmetic instructions and can be dealt into two Integer Execution Units simultaneously respectively; If two all be access instruction, the instruction of vectorial floating-point operation, the instruction of scalar floating-point operation, then send wherein one earlier, the next clock cycle is redispatched other one.When the thread of a plurality of use VPU unit is carried out in processor simultaneously, from the vector access instruction of different threads, vector calculation instruction simultaneously at LSU, VPU is last carries out.

This multithreading intersection double precision short vector structure can be hidden the delay of long delay instruction, improves the overall performance of processor.

Double precision short vector unit is compared towards the SIMD expansion that Streaming Media calculates with tradition, and the register of use is different, data path is different, instruction set is also different fully; Therefore in the processor checking of this type of structural system, Performance Evaluation process, all need write a large amount of evaluation and test excitations.The evaluation and test excitation of in processor checking and Performance Evaluation, using is the assembly language program(me) of object oriented processor architecture.In the processor proof procedure, test and excitation is loaded on the test platform of processor and moves, correctness that can the validation processor design; And, can be according to the calculated amount evaluating performance of processors that is comprised in execution time of test and excitation and the incentive programme.In the checking of processor, Performance Evaluation process, all need write the assembly language program(me) of a large amount of conduct evaluation and test excitations, these programs all are that workload is big by research and development, tester's hand-coding usually, length expends time in.Because the different processor instruction set architecture is different, the short vector extended method is different, therefore can't inherit and reuse the existing evaluation and test excitation of using the short vector processing function unit towards multithreading.

Summary of the invention

Technical matters to be solved by this invention is: to the problem that prior art exists, the present invention provides a kind of evaluation and test easy to use, that can reduce personnel's workload and can shorten the multithreading intersection double precision short vector structure that expends duration to encourage method for automatically constructing.

For solving the problems of the technologies described above, the present invention adopts following technical scheme:

A kind of evaluation and test excitation method for automatically constructing of multithreading intersection double precision short vector structure is characterized in that may further comprise the steps:

(1) input vector operations type and vector length to be evaluated;

(2) create a content automatically and be empty assembly language file as the evaluation and test excitation;

(3) in said assembly language file, write following content:

(3.1) multithreading operation setup code section;

(3.2) evaluation and test excitation control structure comprises: start the code segment of multithreading execution pattern, be used to be provided with the multithreading enable register and make processor get into the multithreading duty; Thread is selected the also code segment of redirect, is used to read the privately owned thread number register of each thread and jumps to each thread according to thread number;

(3.3) multithreading vector operations program segment; Comprise: main thread vector operations code segment, the first address and the vector length that are used for each thread computes Task Distribution, operand calculate, read source operand vector sum destination operand vector and circulate and carry out the short vector arithmetic operation; From thread vector operations code segment, be used to read source operand vector sum destination operand vector and carry out the short vector arithmetic operation;

(3.4) evaluation and test excitation synchronization structure comprises: main thread synchronizing code section is used for judgement and waits for all threads completion vector operations; From the thread synchronization code segment, be used for whether accomplishing vector operations to this thread of main thread sign;

(3.5) data segment and data segment initialization statement, said data segment are multithreading shared data section, and said multithreading shared data section contains the source operand vector sum destination operand vector that multithreading is shared;

(4) the assembly language file that step (3) is obtained intersects the evaluation and test excitation of double precision short vector structure as the multithreading that generates automatically.

As further improvement of the present invention:

In the said step (3.3), the foundation step of said multithreading vector operations program segment is following:

(3.3.1) according to the vector operations type and the vector length of input, distribute the calculation task of each thread (comprise main thread with all from thread) after, confirm the reference position and the length of the vector of each threading operation;

(3.3.2) each thread is according to thread number and vector length, and calculation sources operand address and cycle count register are provided with base address register and cycle count register;

(3.3.3) each thread calculates the destination operand address according to thread number, and the destination operand base address register is set;

(3.3.4) each thread inserts the assembly instruction that vector reads, operates or the result writes back according to separately calculation task in the assembly language program(me) text chunk, forms main thread vector operations code segment and from thread vector operations code segment.

In the said step (3.5), said shared data section is made up by following steps:

(3.5.1) adopt the random number generator of double-precision floating point data to generate the double precision vector as source operand, vector length is specified by the user; Convert the double precision datum in the vector into 16 systems, as the source operand vector;

(3.5.2) reserve the destination operand storage space, as the destination operand vector according to the vector length of input.

Said data segment also comprises lock variable and the thread count variable that supplies said evaluation and test excitation synchronization structure to use; Said evaluation and test excitation synchronization structure only has a thread to upgrade the thread count variable through the lock same time of Variable Control, and through the thread count variable differentiate and guarantee a plurality of threads must all accomplish the operation of being done separately after main thread just continue to carry out subsequent operation.

After said step (3.4) is accomplished, in said assembly language file, write be used to verify the main thread vector operations as a result correctness main thread result of calculation comparison code segment, be used to verify from thread vector operations operation result correctness from thread computes comparison code segment and be used for result of calculation the code segment that reports an error of staggering the time is relatively arranged as a result; In the said step (3.5), said data segment also comprises and supplies said main thread result of calculation comparison code segment and from the thread computes correct result of calculation vector that reads of comparison code segment as a result.

Compared with prior art, the invention has the advantages that:

1, the evaluation and test excitation method for automatically constructing of multithreading intersection double precision short vector structure of the present invention; Adopted the thought of componentization program design; The basic building block of basic assembly language code segment as the structure assembly language program(me); Automatically the evaluation and test excitation of structure multithreading intersection double precision short vector structure helps developing the evaluation and test excitation fast towards such architecture, can reduce personnel's workload.

2, vector operations type and vector length to be evaluated or to be verified are imported in the present invention's realization of can programming, and the evaluation and test of output is actuated to multithreading and utilizes the short vector instruction to accomplish the assembly language program(me) of calculation task.Repeatedly move this program, import different arithmetic type and vector length, can the evaluation and test of automatic batch structure encourage, can shorten making up duration, practice thrift cost, improve evaluation and test and encourage development efficiency, be very easy to processor checking and Performance Evaluation.

Description of drawings

Fig. 1 is the processor cores structural representation of multithreading intersection double precision short vector structure.

Fig. 2 is the main-process stream synoptic diagram of the specific embodiment of the invention.

Fig. 3 is the schematic flow sheet of the evaluation and test excitation of constructing in the specific embodiment of the invention.

Embodiment

Below will combine Figure of description and specific embodiment that the present invention is done further explain.

As shown in Figure 2, adopt the evaluation and test excitation method for automatically constructing of multithreading intersection double precision short vector structure of the present invention to be configured to evaluate and test and verify multithreading intersection double precision short vector structural system as shown in Figure 1 that step is following:

1, input vector operations type and vector length to be evaluated.

2, create a content automatically and be empty assembler source program file f oo.s.

3, in said foo.s file, write the assembly language program(me) text of following content:

Move initialized assembly language code segment 3.1 be used for multithreading; The present invention adopts conventional setup code section, and the required initialization procedure of the assembly language program(me) that possessive construction goes out is all identical.

3.2 evaluation and test excitation control structure, it comprises:

3.2.1 start the code segment of multithreading execution pattern, be used to be provided with the multithreading enable register and make processor get into the multithreading duty;

3.2.2 thread is selected and the code segment of redirect, is used to read the privately owned thread number register of each thread and jumps to each thread according to thread number; Wherein thread number is that 0 thread is a main thread, and other threads are from thread.

3.3 multithreading vector operations program segment, it comprises:

3.3.1 main thread vector operations code segment, the first address and the vector length that are used for each thread computes Task Distribution, operand calculate, read source operand vector sum destination operand vector and circulate and carry out the short vector arithmetic operation.Its building process is following:

A. main thread is according to oneself thread number and distribution of computation tasks, the first address of calculating operation number and vector length;

B. main thread is according to the length of thread number and vector, and calculation sources operand address and cycle count register are provided with base address register and cycle count register;

C. main thread calculates the destination operand address according to thread number, and the destination operand base address register is set;

D. main thread inserts the assembly instruction that vector reads, operates or the result writes back according to separately calculation task in the assembly language program(me) text chunk, forms main thread vector operations code segment and from thread vector operations code.

3.3.2, be used to read source operand vector sum destination operand vector and carry out the short vector arithmetic operation from thread vector operations code segment; Its building process is following:

A. from thread number and the distribution of computation tasks of thread according to oneself, the first address of calculating operation number and vector length;

B. from the length of thread according to thread number and vector, calculation sources operand address and cycle count register are provided with base address register and cycle count register;

C. calculate the destination operand address from thread according to thread number, the destination operand base address register is set;

D. from thread according to separately calculation task, in the assembly language program(me) text chunk, insert the assembly instruction that vector reads, operates or the result writes back, form main thread vector operations code segment and from thread vector operations code.

3.4 evaluation and test excitation checking structure.It comprises:

3.4.1 main thread result of calculation is code segment relatively, is used to verify main thread vector operations correctness as a result;

3.4.2 compare code segment from the thread computes result, be used for checking from thread vector operations operation result correctness;

3.4.3 be used for result of calculation the code segment that reports an error of staggering the time is arranged relatively.

For test and excitation as checking, need be after the code segment of main thread and intact vector operation operation from thread execution, the result who inserts the operation result correctness that is used for the verification vectors operation is code segment relatively.During actual motion, the comparison code segment reads the numerical result that each thread just calculated and compares with the corresponding good correct result of calculation of calculating in advance as a result, if numerical value difference then carry out the code that reports an error.

3.5 evaluation and test excitation synchronization structure, it comprises:

3.5.1 main thread synchronizing code section is used for judgement and waits for all threads completion vector operations; If all threads have all been accomplished operation, the main thread report is also withdrawed from; Otherwise main thread cycle repeats said process is waited for all threads and is accomplished vector operations.

3.5.2, be used for whether accomplishing vector operations to this thread of main thread sign from the thread synchronization code segment.After thread is accomplished the computing of double precision short vector, the thread count variable is increased by 1, get into busy grade for state.

3.6 data segment and data segment initialization statement.Said data segment comprises: source operand vector sum destination operand vector, lock variable, thread count variable and the correct result of calculation that is used to verify; Wherein, source operand vector sum destination operand vector is a multithreading shared data section.The construction step of data segment and data segment initialization statement is following:

3.6.1 make up the shared data section:

A. source operand is vectorial.Adopt the random number generator of double-precision floating point data to generate the double precision vector as source operand, vector length is specified by the user; Convert the double precision datum in the vector into 16 systems, as the source operand vector, and the alignment thereof of specific data section is align 32.

B. destination operand is vectorial.Vector length according to input is reserved the destination operand storage space, and as the destination operand vector, the alignment thereof of data segment is align 32.

C. correct result of calculation vector.For test and excitation as checking usefulness, need to create the correct result data segment, what preserve in this data segment is the correct result of calculating in advance; This data segment is read by result's comparison code segment (being created by step 3.4) of the excitation of checking usefulness.

D. create the lock variable.The lock variable is used to control the same time only has thread to upgrade the thread count variable.In the present embodiment, in sharing data area, create one 64 integer lock variable, and initial value is set is 0, alignment thereof is align 8.When the value of lock variable is 0, represent not lock, when the value of lock variable is 1, represent to lock.When program run, exchange instruction reads the lock variable and judges locking state with comparing also in each thread circulation, if the value of lock variable is 0, representes not have thread use shared data; The thread that obtains the lock variable will lock variable earlier and change 1 into, then the read-write shared data carried out vector operations, accomplish vector operations after, will lock variable and revert to 0 with comparing also exchange instruction.

E. create the thread count variable.The thread count variable be used to differentiate and guarantee a plurality of threads must all accomplish the operation of being done separately after main thread just continue to carry out subsequent operation.In the present embodiment, in sharing data area, create one 64 integer thread count variable, and initial value is set is 0, alignment thereof is align 8.When program run, after obtaining to lock the thread completion vector operations of variable, the thread count variable is added 1; When main thread judges that thread counter equals total number of threads, judge that all threads have all accomplished vector operations, then report and withdraw from.

4, the assembly language file that step 3 is obtained intersects the evaluation and test excitation of double precision short vector structure as the multithreading that generates automatically.

In the above-mentioned steps; The vector operations program segment of evaluation and test excitation, checking structure and synchronization structure write the order of foo.s file and do not limit, and can carry out according to above-mentioned steps, also can adopt order shown in Figure 2; Earlier three kinds of structures of main thread have been write, write corresponding construction more one by one from thread.As shown in Figure 3, the execution flow process of the evaluation and test excitation that present embodiment generates automatically is following:

(1) multithreading execution environment initialization.

(2) read the corresponding thread number register of hardware thread, jump to main thread or from the code porch of thread according to different thread number; If thread number is 0, then skips to step (3) and forward the main thread inlet to; Otherwise jump to step (5), change over to accordingly from thread entry.

(3) main thread is accomplished the operation of double precision short vector according to thread number and distribution of computation tasks; Carry out the code that obtains the lock variable afterwards,, then thread counter is added 1, discharge the lock variable then if obtained the lock variable.

(4) read thread counter, judge whether thread counter equates with total Thread Count, has all accomplished vector operations if equate then to represent all threads, and then program stops, and the answering system state also withdraws from; If the two is unequal, then this step of circulation execution equates until the two.

(5) accordingly from thread based on thread number and distribution of computation tasks, accomplish the operation of double precision short vector; Carry out the code that obtains the lock variable afterwards,, then thread counter is added 1, discharge the lock variable then, get into busy grade for state if obtained the lock variable.

The above only is a preferred implementation of the present invention, and protection scope of the present invention also not only is confined to the foregoing description, and all technical schemes that belongs under the thinking of the present invention all belong to protection scope of the present invention.Should be pointed out that for those skilled in the art some improvement and retouching not breaking away under the principle of the invention prerequisite should be regarded as protection scope of the present invention.

Claims

1. the evaluation and test of a multithreading intersection double precision short vector structure encourages method for automatically constructing, it is characterized in that may further comprise the steps:

(1) input vector operations type and vector length to be evaluated;

(3) in said assembly language file, write following content:

(3.1) multithreading operation setup code section;

(3.4) evaluation and test excitation synchronization structure comprises: main thread synchronizing code section is used for judgement and waits for all threads completion vector operations; From the thread synchronization code segment, be used to identify this thread and accomplished vector operations;

2. the evaluation and test excitation method for automatically constructing of multithreading intersection double precision short vector structure according to claim 1 is characterized in that in the said step (3.3), the foundation step of said multithreading vector operations program segment is following:

(3.3.1) according to the vector operations type and the vector length of input, distribute the calculation task of each thread after, confirm the reference position and the length of the vector of each threading operation;

3. the evaluation and test excitation method for automatically constructing of multithreading intersection double precision short vector structure according to claim 2 is characterized in that in the said step (3.5), said shared data section is made up by following steps:

4. the evaluation and test excitation method for automatically constructing of multithreading intersection double precision short vector structure according to claim 3; It is characterized in that; Said data segment also comprises lock variable and the thread count variable that supplies said evaluation and test excitation synchronization structure to use; Said evaluation and test excitation synchronization structure only has a thread to upgrade the thread count variable through the lock same time of Variable Control, and through the thread count variable differentiate and guarantee a plurality of threads must all accomplish the operation of being done separately after main thread just continue to carry out subsequent operation.

5. the evaluation and test according to claim 1 or 2 or 3 or 4 described multithreadings intersection double precision short vector structures encourages method for automatically constructing; It is characterized in that; After said step (3.4) is accomplished, in said assembly language file, write be used to verify the main thread vector operations as a result correctness main thread result of calculation comparison code segment, be used to verify from thread vector operations operation result correctness from thread computes comparison code segment and be used for the code segment that result of calculation reports an error when more wrong as a result; In the said step (3.5), said data segment also comprises and supplies said main thread result of calculation comparison code segment and from the thread computes correct result of calculation vector that reads of comparison code segment as a result.