FPUvsSoftfloat

From JopWiki

Jump to: navigation, search

The task of FPU vs. SoftFloat was to compare two different floating point implementations on the JOP. Namely the hardware floating point unit (FPU) that has already been successfully integrated into the JOP and a software solution (SoftFloat). First the integration of the FPU into the JOP as external module had to be thoroughly tested, then a test algorithm was derived from the existing SoftFloat project to test the results of both units against each other, assuming that SoftFloat is always right. Then a performance evaluation was performed together with the previously mentioned test algorithm to measure the amount of ticks used per operation. Finally, to further improve performance, the invocation of a hardware FPU operation has been integrated into the microcode of the Java Virtual Machine running on the JOP.

This work has been performed by Stephan Ramberger.


Contents

[edit] Contestants

[edit] FPU

The Floating Point Unit has been developed by Jidan Al-Eryani as a pipelined implementation of floating point calculation in VHDL at the Vienna University of Technology. It consists of four different modules (Add/Subtract, Multiplication, Division, Square Root) that can be integrated individually to a design.


[edit] SoftFloat

Softfloat is a Java floating point class based on the SoftFloat IEC/IEEE Floating-point Arithmetic Package, Release 2b, derived by Wolfgang Puffitsch for the purpose of using floating point operations with the JOP. It is included in the JOP CVS at opencores and can be found at jop/java/target/src/common/com/jopdesign/sys/SoftFloat.java


[edit] Realization

[edit] Optimization of the FPU connection

[edit] sc_fpu.vhd

sc_fpu.vhd
Revision 1.2 (diff)
Date 2007-06-29
Path jop/vhdl/scio

Setting the start flag start_i has been moved from the section where the master reads from the FPU to the one where the operation is written. The master-reads-from-fpu section has been removed as the ready counter has its own process. This results in the following write sequence that should be adhered to:

  1. Operand A
  2. Operand B
  3. Operator

Now it is assured that the FPU will start working as soon as possible, and does not delay the read operation.

cvs revision 1.1 cvs revision 1.2 (modification)
-- master writes to FPU
process(clk_i, reset_i)

begin

  if (reset_i='1') then
    opa_i <= (others => '0');
    opb_i <= (others => '0');
    fpu_op_i <= (others => '0');

  elsif rising_edge(clk_i) then


    if wr_i='1' then
      if address_i="0000" then
          opa_i <= wr_data_i;
      elsif address_i="0001" then
          opb_i <= wr_data_i;
      elsif address_i="0010" then
          fpu_op_i <=wr_data_i(2 downto 0);

      end if;
    end if;

  end if;

end process;
 
 -- master writes to FPU
process(clk_i, reset_i)

begin

  if (reset_i='1') then
    opa_i <= (others => '0');
    opb_i <= (others => '0');
    fpu_op_i <= (others => '0');
    start_i <= '0';
  elsif rising_edge(clk_i) then
    start_i <= '0';

    if wr_i='1' then
      if address_i="0000" then
          opa_i <= wr_data_i;
      elsif address_i="0001" then
          opb_i <= wr_data_i;
      elsif address_i="0010" then
          fpu_op_i <=wr_data_i(2 downto 0);
          start_i <= '1';
      end if;
    end if;

  end if;

end process;


[edit] Addition to the microcode

[edit] jvm.asm

jvm.asm
Revision 1.51 (diff)
Date 2007-07-31
Path jop/asm/src

The java microcode has been extended to allow native function calls for floating point operations namely fadd, fsub, fmul, fdiv that will be called for the according arithmetic operation of floating point numbers. Two different modifications have been made to the source, one are additional defines to address the locations of the external operands/operators and the other are the implemenetation of the four operations.

The following defines are used to access the memory locations mapped to the FPU which are specified in the file Const.java at jop/java/target/src/common/com/jopdesign/sys/.

 #ifdef FPU_ATTACHED
   // assuming the FPU at IO_FPU = IO_BASE + 0x70 = -16
   fpu_const_a   = -16  // Const.java: FPU_A   = IO_FPU + 0;
   fpu_const_b   = -15  // Const.java: FPU_B   = IO_FPU + 1;
   fpu_const_op  = -14  // Const.java: FPU_OP  = IO_FPU + 2;
   fpu_const_res = -13  // Const.java: FPU_RES = IO_FPU + 3;
 #endif
 


The fadd block below is representative for the four implemented floating point operations that will take advantage of the hardware FPU. It basically consists of three external memory writes (jopsys_wrmem) followed by an external memory read (jopsys_rdmem).

To obtain the three other operations, one has to substitute the parameter of the third ldi instruction (the one with a non-predefined immediate) with the correct number of the desired operation, where add = 0, sub = 1, mul = 2 and finally div = 3.

 fadd:
    ldi fpu_const_b     // load address: FPU_B
    stmwa               // store memory address
    stmwd               // store memory data - b already on stack
    wait
    wait                // execute 1+nws
    ldi fpu_const_a     // load address: FPU_A
    stmwa               // store memory data
    stmwd               // store memory data - a already on stack
    wait
    wait                // execute 1+nws
    ldi 0               // load FPU_OP_ADD (data)
    ldi fpu_const_op    // load FPU_OP (address)
    stmwa               // store FPU_OP
    stmwd               // store FPU_OP_ADD
    wait
    wait                // execute 1+nws
    ldi fpu_const_res   // load address of FPU_RES
    stmra               // read memory
    wait
    wait                // execute 1+nws
    ldmrd nxt           // read ext. mem
 

[edit] Invocation of the FPU microcode

To take advantage of the FPU specific microcode for floating point operations it is only necessary to activate the line

#GCC_PARAMS="-DFPU_ATTACHED"

in the Makefile (#see below) by removing the leading hash (#) from the parameter assignment to pass the define to the compiler that generates the jvm{usb|ser}.asm. It is also recommended to include the FPU itself, e.g. by using the project cycfpu.


[edit] Makefile

Makefile
Revision 1.67 (diff)
Date 2007-09-02
Path jop

The problem of activating the FPU microcode is how to pass the information from the Makefile - where the basic setup takes place - to the compiler that preprocesses the microcode definition (jvm.asm), regardless of the programming interface (be it serial or USB). Therefore a new parameter (GCC_PARAMS) has been introduced to the makefile for FPU microcode addition. It will be passed on to the shell that starts the build process with a call to the batch files #jopser.bat or #jopusb.bat, where it is used as a command line parameter for the gcc. Basically it is an additional command line define for the preprocessor (-DFPU_ATTACHED) that activates the FPU microcode in jvm.asm.

 GCC_PARAMS=""

 # uncomment this if you want floating point operations in hardware
 # ATTN: be sure to choose 'cycfpu' as QPROJ else no FPU will be available
 GCC_PARAMS="-DFPU_ATTACHED"
 

Section jopser where the parameter gets exported (also valid for jopusb)

cd asm && export GCC_PARAMS=$(GCC_PARAMS) && ./jopser.bat


[edit] jopser.bat

jopser.bat
Revision 1.4 (diff)
Date 2007-09-02
Path jop/asm

The parameter (GCC_PARAMS) exported from the Makefile will be used as a command line parameter for the gcc.

gcc -x c -E -C -P %GCC_PARAMS% src\jvm.asm > generated\jvmser.asm

[edit] jopusb.bat

jopusb.bat
Revision 1.2 (diff)
Date 2007-09-02
Path jop/asm

The parameter (GCC_PARAMS) exported from the Makefile will be used as a command line parameter for the gcc.

gcc -x c -E -C -P -DUSB %GCC_PARAMS% src\jvm.asm > generated\jvmusb.asm

[edit] Tests

[edit] FPUvsSoftFloat.java

FPUvsSoftFloat.java
Revision 1.1 (initial)
Date 2007-08-13
Path jop/java/target/src/test/fpu

The floating point test program is based on the GenFloatTest.java already available in the cvs tree at jop/java/target/src/test/fpu/. Again, an example for a typical test run is appended below. Given an array float nums[] of floating point numbers, the function loops through each entity and performs an operation against every other element. Three different test runs will be performed, first a call to the SoftFloat reference implementation retrieves the value to be tested against. Then a block of Native.wr() operations are used to communicate with the FPU via the SimpCon interface to aquire the result of the FPU. Finally, the microcode implementation is tested by adding (subtracting/multiplying/dividing) both floating point numbers using the native operator. At the end of the result block the function TestValues(..) tests the reference value against the calculated result and prints all necessary information to the output.


 public static void TestADD()
 {
   float fA,fB,fJOP;
   int iA,iB,iFPU,iSoft,iJOP;
   int iTimerCnt;

   iADDFPUError = 0;
   iADDJOPError = 0;
   iADDSoftTime = 0; iADDSoftMaxTime = 0; iADDSoftMinTime = 0x7FFFFFFF;
   iADDFPUTime  = 0; iADDFPUMaxTime  = 0; iADDFPUMinTime  = 0x7FFFFFFF;
   iADDJOPTime  = 0; iADDJOPMaxTime  = 0; iADDJOPMinTime  = 0x7FFFFFFF;
   iFPU = 0;

   System.out.println("Testing ADD (+) - START -");
   for (int i=0; i<nums.length; i++) {
     for (int k=0; k<nums.length; k++) {
       // acquire new test values as float and raw int bits
       fA = nums[i];
       fB = nums[k];
       iA = Float.floatToRawIntBits(fA);
       iB = Float.floatToRawIntBits(fB);

       // test run for SoftFloat
       iTimerCnt = Native.rd(Const.IO_CNT);
       iSoft = SoftFloat.float32_add(iA,iB);
       iTimerCnt = Native.rd(Const.IO_CNT) - iTimerCnt;
       iADDSoftTime += iTimerCnt;
       if (iTimerCnt > iADDSoftMaxTime) iADDSoftMaxTime = iTimerCnt;
       if (iTimerCnt < iADDSoftMinTime) iADDSoftMinTime = iTimerCnt;

       // test run for attached FPU via SimpCon
       iTimerCnt = Native.rd(Const.IO_CNT);
       Native.wrMem(iA, Const.FPU_A);
       Native.wrMem(iB, Const.FPU_B);
       Native.wrMem(Const.FPU_OP_ADD, Const.FPU_OP);
       iFPU = Native.rdMem(Const.FPU_RES);
       iTimerCnt = Native.rd(Const.IO_CNT) - iTimerCnt;
       iADDFPUTime += iTimerCnt;
       if (iTimerCnt > iADDFPUMaxTime) iADDFPUMaxTime = iTimerCnt;
       if (iTimerCnt < iADDFPUMinTime) iADDFPUMinTime = iTimerCnt;

       // test run for microcode implementation
       iTimerCnt = Native.rd(Const.IO_CNT);
       fJOP = fA + fB;
       iTimerCnt = Native.rd(Const.IO_CNT) - iTimerCnt;
       iADDJOPTime += iTimerCnt;
       if (iTimerCnt > iADDJOPMaxTime) iADDJOPMaxTime = iTimerCnt;
       if (iTimerCnt < iADDJOPMinTime) iADDJOPMinTime = iTimerCnt;

       iJOP = Float.floatToRawIntBits(fJOP);

       // comparison of FPU results against the SoftFloat implementation
       iADDFPUError += TestValues("ADD","FPU",i,k,iSoft,iFPU,1);
       iADDJOPError += TestValues("ADD","JOP",i,k,iSoft,Float.floatToRawIntBits(fJOP),1);
     }
   }
   System.out.println("Testing ADD (+) - FINISHED -");
   System.out.println("----------------------------");
 }
 

[edit] Results

[edit] Measurements

The basis for the performance evaluation is an internal counter that counts clock cycles and can be accessed via a call to Native.rd() and the location Const.IO_CNT. Immediately before and after a floating point operation the counter will be read and its difference calculated.


Ticks per instruction without measurement correction
ticks/instruction ADD SUB MUL DIV
SoftFloat 526 579 19923 137689
FPU 49 49 54 76
FPU /w microcode 42 42 47 69


The downside of such type of measurement ist the influence of the read operation but as a single call to Native.rd(Const.IO_CNT) can also be measured, we use the following code fragment to determine the amount of ticks and subtract the result from the above table. Another way of correcting the measurement would be to count the low-level instructions and subtract the amount of cycles used to execute them.

  int iTime, iTimer;

  iTime  = Native.rd(Const.IO_CNT);
  iTimer = Native.rd(Const.IO_CNT) - iTime;

  // Time per Native.rd() instruction is iTimer
Correction Value
ticks
Native.rd(Const.IO_CNT) 8
Ticks per instruction with measurement correction of -8 ticks per call to Native.rd()
ticks/instruction ADD SUB MUL DIV
SoftFloat 517 571 19915 137681
FPU 41 41 46 68
FPU /w microcode 34 34 39 61


MIN/MAX Ticks with measurement correction of -8 ticks per call to Native.rd()
ticks/instruction ADD SUB MUL DIV
MIN MAX MIN MAX MIN MAX MIN MAX
SoftFloat 517 571 19915 137681
300 1650 300 1686 179 42186 179 478719
FPU 41 41 46 68
41 41 41 41 46 46 68 68
FPU /w microcode 34 34 39 61
34 34 34 34 39 39 61 61

[edit] Errors

An error in the multiplication of denormalized numbers has been found during the performance evaluation of the FPU and an error report was posted on the corresponding yahoo group that maintains the FPU. Since it also affects the serial multiplier it is highly probable that the error originates in the pre-normalization routines.

Personal tools