FPUvsSoftfloat
From JopWiki
The task of FPU vs. SoftFloat was to compare two different floating point implementations on the JOP. Namely the hardware floating point unit (FPU) that has already been successfully integrated into the JOP and a software solution (SoftFloat). First the integration of the FPU into the JOP as external module had to be thoroughly tested, then a test algorithm was derived from the existing SoftFloat project to test the results of both units against each other, assuming that SoftFloat is always right. Then a performance evaluation was performed together with the previously mentioned test algorithm to measure the amount of ticks used per operation. Finally, to further improve performance, the invocation of a hardware FPU operation has been integrated into the microcode of the Java Virtual Machine running on the JOP.
This work has been performed by Stephan Ramberger.
Contents |
[edit] Contestants
[edit] FPU
The Floating Point Unit has been developed by Jidan Al-Eryani as a pipelined implementation of floating point calculation in VHDL at the Vienna University of Technology. It consists of four different modules (Add/Subtract, Multiplication, Division, Square Root) that can be integrated individually to a design.
[edit] SoftFloat
Softfloat is a Java floating point class based on the SoftFloat IEC/IEEE Floating-point Arithmetic Package, Release 2b, derived by Wolfgang Puffitsch for the purpose of using floating point operations with the JOP. It is included in the JOP CVS at opencores and can be found at jop/java/target/src/common/com/jopdesign/sys/SoftFloat.java
[edit] Realization
[edit] Optimization of the FPU connection
[edit] sc_fpu.vhd
| sc_fpu.vhd | |
| Revision | 1.2 (diff) |
| Date | 2007-06-29 |
| Path | jop/vhdl/scio |
Setting the start flag start_i has been moved from the section where the master reads from the FPU to the one where the operation is written. The master-reads-from-fpu section has been removed as the ready counter has its own process. This results in the following write sequence that should be adhered to:
- Operand A
- Operand B
- Operator
Now it is assured that the FPU will start working as soon as possible, and does not delay the read operation.
| cvs revision 1.1 | cvs revision 1.2 (modification) |
|---|---|
-- master writes to FPU
process(clk_i, reset_i)
begin
if (reset_i='1') then
opa_i <= (others => '0');
opb_i <= (others => '0');
fpu_op_i <= (others => '0');
elsif rising_edge(clk_i) then
if wr_i='1' then
if address_i="0000" then
opa_i <= wr_data_i;
elsif address_i="0001" then
opb_i <= wr_data_i;
elsif address_i="0010" then
fpu_op_i <=wr_data_i(2 downto 0);
end if;
end if;
end if;
end process;
|
-- master writes to FPU
process(clk_i, reset_i)
begin
if (reset_i='1') then
opa_i <= (others => '0');
opb_i <= (others => '0');
fpu_op_i <= (others => '0');
start_i <= '0';
elsif rising_edge(clk_i) then
start_i <= '0';
if wr_i='1' then
if address_i="0000" then
opa_i <= wr_data_i;
elsif address_i="0001" then
opb_i <= wr_data_i;
elsif address_i="0010" then
fpu_op_i <=wr_data_i(2 downto 0);
start_i <= '1';
end if;
end if;
end if;
end process;
|
[edit] Addition to the microcode
[edit] jvm.asm
| jvm.asm | |
| Revision | 1.51 (diff) |
| Date | 2007-07-31 |
| Path | jop/asm/src |
The java microcode has been extended to allow native function calls for floating point operations namely fadd, fsub, fmul, fdiv that will be called for the according arithmetic operation of floating point numbers. Two different modifications have been made to the source, one are additional defines to address the locations of the external operands/operators and the other are the implemenetation of the four operations.
The following defines are used to access the memory locations mapped to the FPU which are specified in the file Const.java at jop/java/target/src/common/com/jopdesign/sys/.
#ifdef FPU_ATTACHED // assuming the FPU at IO_FPU = IO_BASE + 0x70 = -16 fpu_const_a = -16 // Const.java: FPU_A = IO_FPU + 0; fpu_const_b = -15 // Const.java: FPU_B = IO_FPU + 1; fpu_const_op = -14 // Const.java: FPU_OP = IO_FPU + 2; fpu_const_res = -13 // Const.java: FPU_RES = IO_FPU + 3; #endif
The fadd block below is representative for the four implemented floating point operations that will take advantage of the hardware FPU. It basically consists of three external memory writes (jopsys_wrmem) followed by an external memory read (jopsys_rdmem).
To obtain the three other operations, one has to substitute the parameter of the third ldi instruction (the one with a non-predefined immediate) with the correct number of the desired operation, where add = 0, sub = 1, mul = 2 and finally div = 3.
fadd:
ldi fpu_const_b // load address: FPU_B
stmwa // store memory address
stmwd // store memory data - b already on stack
wait
wait // execute 1+nws
ldi fpu_const_a // load address: FPU_A
stmwa // store memory data
stmwd // store memory data - a already on stack
wait
wait // execute 1+nws
ldi 0 // load FPU_OP_ADD (data)
ldi fpu_const_op // load FPU_OP (address)
stmwa // store FPU_OP
stmwd // store FPU_OP_ADD
wait
wait // execute 1+nws
ldi fpu_const_res // load address of FPU_RES
stmra // read memory
wait
wait // execute 1+nws
ldmrd nxt // read ext. mem
[edit] Invocation of the FPU microcode
To take advantage of the FPU specific microcode for floating point operations it is only necessary to activate the line
#GCC_PARAMS="-DFPU_ATTACHED"
in the Makefile (#see below) by removing the leading hash (#) from the parameter assignment to pass the define to the compiler that generates the jvm{usb|ser}.asm. It is also recommended to include the FPU itself, e.g. by using the project cycfpu.
[edit] Makefile
| Makefile | |
| Revision | 1.67 (diff) |
| Date | 2007-09-02 |
| Path | jop |
The problem of activating the FPU microcode is how to pass the information from the Makefile - where the basic setup takes place - to the compiler that preprocesses the microcode definition (jvm.asm), regardless of the programming interface (be it serial or USB). Therefore a new parameter (GCC_PARAMS) has been introduced to the makefile for FPU microcode addition. It will be passed on to the shell that starts the build process with a call to the batch files #jopser.bat or #jopusb.bat, where it is used as a command line parameter for the gcc. Basically it is an additional command line define for the preprocessor (-DFPU_ATTACHED) that activates the FPU microcode in jvm.asm.
GCC_PARAMS="" # uncomment this if you want floating point operations in hardware # ATTN: be sure to choose 'cycfpu' as QPROJ else no FPU will be available GCC_PARAMS="-DFPU_ATTACHED" |
Section jopser where the parameter gets exported (also valid for jopusb)
cd asm && export GCC_PARAMS=$(GCC_PARAMS) && ./jopser.bat
[edit] jopser.bat
| jopser.bat | |
| Revision | 1.4 (diff) |
| Date | 2007-09-02 |
| Path | jop/asm |
The parameter (GCC_PARAMS) exported from the Makefile will be used as a command line parameter for the gcc.
gcc -x c -E -C -P %GCC_PARAMS% src\jvm.asm > generated\jvmser.asm |
[edit] jopusb.bat
| jopusb.bat | |
| Revision | 1.2 (diff) |
| Date | 2007-09-02 |
| Path | jop/asm |
The parameter (GCC_PARAMS) exported from the Makefile will be used as a command line parameter for the gcc.
gcc -x c -E -C -P -DUSB %GCC_PARAMS% src\jvm.asm > generated\jvmusb.asm |
[edit] Tests
[edit] FPUvsSoftFloat.java
| FPUvsSoftFloat.java | |
| Revision | 1.1 (initial) |
| Date | 2007-08-13 |
| Path | jop/java/target/src/test/fpu |
The floating point test program is based on the GenFloatTest.java already available in the cvs tree at jop/java/target/src/test/fpu/. Again, an example for a typical test run is appended below. Given an array float nums[] of floating point numbers, the function loops through each entity and performs an operation against every other element. Three different test runs will be performed, first a call to the SoftFloat reference implementation retrieves the value to be tested against. Then a block of Native.wr() operations are used to communicate with the FPU via the SimpCon interface to aquire the result of the FPU. Finally, the microcode implementation is tested by adding (subtracting/multiplying/dividing) both floating point numbers using the native operator. At the end of the result block the function TestValues(..) tests the reference value against the calculated result and prints all necessary information to the output.
public static void TestADD()
{
float fA,fB,fJOP;
int iA,iB,iFPU,iSoft,iJOP;
int iTimerCnt;
iADDFPUError = 0;
iADDJOPError = 0;
iADDSoftTime = 0; iADDSoftMaxTime = 0; iADDSoftMinTime = 0x7FFFFFFF;
iADDFPUTime = 0; iADDFPUMaxTime = 0; iADDFPUMinTime = 0x7FFFFFFF;
iADDJOPTime = 0; iADDJOPMaxTime = 0; iADDJOPMinTime = 0x7FFFFFFF;
iFPU = 0;
System.out.println("Testing ADD (+) - START -");
for (int i=0; i<nums.length; i++) {
for (int k=0; k<nums.length; k++) {
// acquire new test values as float and raw int bits
fA = nums[i];
fB = nums[k];
iA = Float.floatToRawIntBits(fA);
iB = Float.floatToRawIntBits(fB);
// test run for SoftFloat
iTimerCnt = Native.rd(Const.IO_CNT);
iSoft = SoftFloat.float32_add(iA,iB);
iTimerCnt = Native.rd(Const.IO_CNT) - iTimerCnt;
iADDSoftTime += iTimerCnt;
if (iTimerCnt > iADDSoftMaxTime) iADDSoftMaxTime = iTimerCnt;
if (iTimerCnt < iADDSoftMinTime) iADDSoftMinTime = iTimerCnt;
// test run for attached FPU via SimpCon
iTimerCnt = Native.rd(Const.IO_CNT);
Native.wrMem(iA, Const.FPU_A);
Native.wrMem(iB, Const.FPU_B);
Native.wrMem(Const.FPU_OP_ADD, Const.FPU_OP);
iFPU = Native.rdMem(Const.FPU_RES);
iTimerCnt = Native.rd(Const.IO_CNT) - iTimerCnt;
iADDFPUTime += iTimerCnt;
if (iTimerCnt > iADDFPUMaxTime) iADDFPUMaxTime = iTimerCnt;
if (iTimerCnt < iADDFPUMinTime) iADDFPUMinTime = iTimerCnt;
// test run for microcode implementation
iTimerCnt = Native.rd(Const.IO_CNT);
fJOP = fA + fB;
iTimerCnt = Native.rd(Const.IO_CNT) - iTimerCnt;
iADDJOPTime += iTimerCnt;
if (iTimerCnt > iADDJOPMaxTime) iADDJOPMaxTime = iTimerCnt;
if (iTimerCnt < iADDJOPMinTime) iADDJOPMinTime = iTimerCnt;
iJOP = Float.floatToRawIntBits(fJOP);
// comparison of FPU results against the SoftFloat implementation
iADDFPUError += TestValues("ADD","FPU",i,k,iSoft,iFPU,1);
iADDJOPError += TestValues("ADD","JOP",i,k,iSoft,Float.floatToRawIntBits(fJOP),1);
}
}
System.out.println("Testing ADD (+) - FINISHED -");
System.out.println("----------------------------");
}
[edit] Results
[edit] Measurements
The basis for the performance evaluation is an internal counter that counts clock cycles and can be accessed via a call to Native.rd() and the location Const.IO_CNT. Immediately before and after a floating point operation the counter will be read and its difference calculated.
| ticks/instruction | ADD | SUB | MUL | DIV |
|---|---|---|---|---|
| SoftFloat | 526 | 579 | 19923 | 137689 |
| FPU | 49 | 49 | 54 | 76 |
| FPU /w microcode | 42 | 42 | 47 | 69 |
The downside of such type of measurement ist the influence of the read operation but as a single call to Native.rd(Const.IO_CNT) can also be measured, we use the following code fragment to determine the amount of ticks and subtract the result from the above table. Another way of correcting the measurement would be to count the low-level instructions and subtract the amount of cycles used to execute them.
int iTime, iTimer; iTime = Native.rd(Const.IO_CNT); iTimer = Native.rd(Const.IO_CNT) - iTime; // Time per Native.rd() instruction is iTimer |
|
| ticks/instruction | ADD | SUB | MUL | DIV |
|---|---|---|---|---|
| SoftFloat | 517 | 571 | 19915 | 137681 |
| FPU | 41 | 41 | 46 | 68 |
| FPU /w microcode | 34 | 34 | 39 | 61 |
| ticks/instruction | ADD | SUB | MUL | DIV | ||||
|---|---|---|---|---|---|---|---|---|
| MIN | MAX | MIN | MAX | MIN | MAX | MIN | MAX | |
| SoftFloat | 517 | 571 | 19915 | 137681 | ||||
| 300 | 1650 | 300 | 1686 | 179 | 42186 | 179 | 478719 | |
| FPU | 41 | 41 | 46 | 68 | ||||
| 41 | 41 | 41 | 41 | 46 | 46 | 68 | 68 | |
| FPU /w microcode | 34 | 34 | 39 | 61 | ||||
| 34 | 34 | 34 | 34 | 39 | 39 | 61 | 61 | |
[edit] Errors
An error in the multiplication of denormalized numbers has been found during the performance evaluation of the FPU and an error report was posted on the corresponding yahoo group that maintains the FPU. Since it also affects the serial multiplier it is highly probable that the error originates in the pre-normalization routines.
