User:DanClemmensen/Current
From JopWiki
|
This is the log for the re-started project.
During the interregnum:
- 2008-12-25 got the new Core i7 computer
- 2008-12-30 finished build-out of new computer with 64-bit Gentoo
- 2009-01 restored the old Xilinx HOWTO after the old wiki moved. Now at Xilinx ISE WebPACK (10 and Earlier)
- 2009-04 Extracted the hard disks from all my old computers and copied them to the new conputer (ITB drive)
Restart:
- 2009-09-01 restarted JOP project
- 2009-09-05 completed installation of Xilinx WebPack 11.1
- 2009-09-10 created a new HOWTO for the new ISE, at Xilinx ISE Webpack 11
- 2009-09-10 made minor changes to the makefile to accomodate the new ISE.
- 2009-09-10 built and loaded JOP with the test bootloader, It worked as expected.
- 2009-09-12 built the Xilinx UNISIM library for GHDL, using this makefile.
- 2009-09-12 built and ran JOP with GHDL. and viewed it with GTKwave. It worked.
- 2009-09-13 researched Plasma, registered at OpenCores.
- 2009-09-14 Downloaded Plasma
- 2009-09-14 Downloaded and built the gnu_mips toolchain
- 2009-09-14 Built Plasma tools, built Plasma bootloader.
- 2009-09-15 Adapted my Jop s3Esk Xilinx batch makefile to Plasma
- 2009-09-15 Built Plasma FPGA image and loaded it
- 2009-09-15 built Plasma kernel and (some) apps.
- 2009-09-15 built Plasma with a "blink" bootloader and ran it. It works!
- 2009-09-16 added LCD peripheral to Plasma, built and tested
- 2009-09-16 added SPI peripheral to Plasma, built.
- 2009-09-17 investigating data2MEM. manual(pdf), howto
- 2009-09-18 improved the SPI selector
- 2009-09-19 integrated data2MEM into my toolchain. successfully updated the boot image!
- 2009-09-20 integrated the kernel append tools. successfully added the kernel to the image.
- 2009-09-21 (Plasma) validated writing to DDR!
- 2009-09-22 (Plasma) successfully loaded a dummy kernel from platform flash to DDR and executed it!
- 2009-09-23 Simplified and improved the makefile.
- 2009-09-24 (Plasma) loaded real kernel. Ethernet works!
- 2009-09-24 Begin upgrading to recent JOP sources.
- 2009-09-25 Upgraded to new sources
- 2009-09-26 Began work on makefile
- 2009-09-27 More work on makefile
- 2009-09-28 Tools are now working
- 2009-10-03 integrated my old VHDL into the latest JOP sources
- 2009-10-03 got a clean build of the integrated project.
- 2009-10-04 Microcode has grown too long for one BRAM. I must rework my mem layout to use 2 BRAMs.
- 2009-10-06 upgraded XBlockGen for multi-BRAM initialization.
- 2009-10-06 upgraded jvm_rom BRAM template to use two BRAMs.
- 2009-10-08 built and loaded the new version: it did not work.
- 2009-10-09 building for sim using GHDL. GHDL is very picky about syntax, so I'm making minor changes.
- 2009-10-10 built and ran sim.
- 2009-10-10 built for target and loaded. Finally! JBC boot loader running in the JBC memory.
- 2009-10-17 completed design and code of tools to use data2MEM for arbitrary BRAM load.
- 2009-10-23 suspended work on sophisticated use of data2mem -- the program is too unstable.
- 2009-10-23 switched back to simple brute-force use of data2mem, i.e., run it once for each address space.
- 2009-10-24 resume work on JBC bootloader. built a test app ("HelloLCD.") How to init the "special pointers?"
- 2009-10-25 wrote a tiny utility to convert mem_main.dat to a vmem file, Updated makefile to append to PROM, loaded PROM.
- 2009-11-07 re-investigating oc_ddr project. It has a cleaner DDR interface than plasma, but uses DCMs
- 2009-11-08 GHDL does not like the Xilinx UNISIM DCM module. Working on isolating the problem.
- 2009-11-10 fixed oc_ddr in simulation.
- 2009-11-11 ran oc_ddr on the target. It seems to work.
- 2009-11-26 converted oc_ddr main to alternate reads and writes. tested and ran
- 2009-11-27 added a dummy simpcon interface to oc_ddr main and integrated with JOP.
- 2009-11-28 successfully ran the dummy!
[edit] Plasma
I am again working on the DDR problem. I am now investigating the Plasma project, by Steve Rhoads, which is "available" at Opencores. This project implements a MIPs processor.
- runs on the Spartan 3E starter Kit
- public domain "license"
- uses VHDL
- implements the DDR interface
- implements the ethernet interface
- is actually up and working
So we have a real worked example that uses the same hardware. Wonderful! the only drawback so far is that you must register before you can download, and apparently the registration process requires human intervention at the server end, This took about six hours, but you can view the source in the SVN repository via the web viewer even without registering, so I researched it during that time. Of interest: Plasma preloads a bootloader, and executes object code before it accesses DDR. It depends on code writen in C language to do the elaborate DDR initialization stuff. I have plenty of space in my JOP bootloader, so perhaps I will do the same.
On my linux machine, I downloaded Plasma and downloaded and built the MIPS-targeted GNU cross tool chain using the instructions on the Plasma web site. I then built the plasma tools and bootloader. I adapted my Xilinx batch makefile for plasma and (finally) built the plasma FPGA. As a side-efffect, I cleaned up my JOP s3Esk makefile.
I added the LCD peripheral and the SPI peripheral. These are preliminary to testing the DDR. The SPI peripheral (VERY crude) is used by the bootloader to read the DDR image from the "PROM", (i.e., the platform flash.)
This scheme is now working, and I have stored the kernel in the platform flash and then loaded it to DDR, where it successfully runs all its functions including Ethernet, DHCP client, TCP/IP, and Web server.
Next steps:
- analyze DDR and ethernet internal interfaces and adapt for JOP/Simpcon
[edit] Data2MEM
Xilinx provides the Data2MEM application as part of ISE, and it is included in the WebPack. It is intended to make it easy to update the BRAMs in the FPGA bitfile without needing to re-synhesize the whole FPGA. Even with a modern computer, the full synthesis can take more than ten minutes, while a BRAM update should take a few seconds, so the Data2MEM functionality is very attractive when debugging the bootloader. Unfortunately, the data2MEM documentation is incomprehensible.
Data2MEM has one essential function: it can update the BRAM contents within an FPGA image file. Unfortunately, Xilinx chose to add additional related capabilities to the program, The descriptions of all of this extra stuff completely obscures the fundamental function.
In keeping with our philosopy of using Opensource tools as much as possible, we will not depend on data2MEM for any of its ancillary functions. We will perform the conversion to the ".mem" format that data2MEM uses for data, pre-normalizing all BRAMS to the simplest 9-bit format regardless of the actual internal format. We continue to use XBlockGen (in JOP) and ram_image.exe (in Plasma) to create the VHDL for of the BRAMs, and extend them to create the 9-bit "mem" files we will use with data2MEM.
To update the FPGA image, data2MEM accepts three inputs and emits one output. The output is the updated FPGA image file. The inputs are:
- bitfile-- the FPGA image to be updated
- .mem -- the contents of the BRAMS to be updated
- .bmm -- a file that describes the mapping from the mem file to the BRAMS.
The big problems with the documentation are first understanding the function of the .bmm file (i.e., that it provides the mapping) and then understanding the semantics of the .bmm file (i.e, how the BRAMS are described and how the locations in the input file are described,) and finally, how to determine the actual "physical names" of the BRAMs.
Mapping. The data2MEM documentation describes "address spaces" that you associate with each BRAM. The fundamental concept that is never made clear is that the .mem file is seen as a byte-addresable memory and the addresses refer to the addresses in the .mem file, not the BRAM address spaces as used within the FPGA. The BMM format provides control for all sorts of elaborate transformations. We use none of them, but just provide a .mem file with a contiguous set of 16-bit words that contain 9-bit values. We create a .bmm file that treats each BRAM as a completely independent set of 2048 values.
BRAM identification You must tell data2MEM which BRAMs to fill. You tell it the physical X,Y coordinates of the BRAM within the FPGA. This location is assigned to your BRAM by the PAR tool during the process of building the bitfile. The Xilinx tools try to help you with this: you can provide a symbolic name for your BRAMs (the "BRAM instance name") in your .bmm file, and the tools will give you the X, Y coordinates. Unfortunately, the "BRAM instance name" is not completely under your control, so this is of limited use. Data2MEM is of no help in determiningthe "BRAM instance names" of your BRAMs: you are on your own for that, and that is the only difficult step. We mitigate this by using a regular expression match on the xdl file emitted by the build process.
My "tool chain" for Plasma: first compile into an elf file (.axf), then objcopy to an srec, then convert to the data2MEM's brain-damaged input format using the srec_cat program from the opensource srecord package, and finally use data2MEM:
- $(MIPS_TOOLS)objcopy -I elf32-big -O srec $(TOOLS)/test.axf generated/test.srec
- srec_cat generated/test.srec -O generated/test.mem -vmem 8 -line_length 22
- xilinx data2mem -bm ISE_scripts/plasma_code.bmm -bd generated/test.mem -o h output
The critical part of the BMM file for use in the annotation step is the "BRAM_instance_name." This is the full-qualified hierarchical name of the particular BRAM instance in the design. This is related to the names of the components and entities in your entire FPGA design, but way the name is derived is not completely obvious. Therefore, we use a trick to find the hierarchical names that the tool chain derives. the ever-useful Ken Champan (of Picoblaze fame) tells us how. Basically, after you have routed your design the first time, you get a huge opaque design file named main.ncd. You can get a (big) description of the result in a readable form by running:
- xdl -ncd2xdl main.ncd
which creates main.xdl. Now, find the relevant BRAMs by searching for "RAMB16" to find lines similar to:
- inst "program_rom/ram_1024_x_18" "RAMB16",placed BMR8C2 RAMB16_X1Y4
The stuff inside the first set of quotes (program_rom/ram_1024_x_18 in this example) is the BRAM instance name.
Now build your input BMM file using these instance names. You can also immediately create the "annotated" BMM file manually this time by using the location from this line: X1Y4 in this example. Otherwise, you will need to run the synthesis again to get it to annotate. your BMM file, which is silly this first time. We use the BMM to simplify life if a later synthesis moves the BRAMS. Note, however, that you will need to manually re-discover the "BRAM instance names" if you make substantial changes in your component hierarchy. Ken recommends that you use distinctive, readily-recognizable names for your BRAM components to avoid confusion.
After generating the FPGA bitfile (named main-fpga.bit in this example) I update the boot BRAMs to create an updated bitfile as follows:
- data2mem -bm plasma_code_bd.bmm -bd boot.mem -bt main-fpga.bit -o b merged.bit
For JOP, we have two independent memory spaces to initialize:the microcode and the bytecode cache. The data2MEM documentation shows how to createa a single BMM with multiple memory spaces, but Nobody has actualy figured out how to make this work. (multiple blocks on an single space works just fine, as in th ePlasma example.) On the other hand, other tools ionthe chain can hnadle multiple BMM files, therefore, the trick for JOP is to use Data2MEM independentlyh for eahc memory space, with separate BMM files.
[edit] Appending the main code image
I want to use the empty half of the program flash to store the main code image (the kernel in Plasma, or the Java Application in JOP.) I intended to write a little host application to append the file, but srec_cat can handle it, with a little help from the upstream build tools. For Plasma, I want to follow the FPGA image with three words: load start, load end, and bss end, followed by the kernel image. I wrote the bootloader to read and discard the FPGA image, then read the three words, and then read the image and write it to RAM, word at a time, and then initialize the BSS area to zero. To generate the three words, I simply prepended a tiny assembler module to the kernel with the three words, and then told the limnker to start 12 bytes (three words) early, and to emit the object code as an srec file. To append this file, we first use Xilinx promgen to create the FPGA prom file named main-fpga.mcs. Then we use srec_cat (again!) as follows:
- offset=$(echo "ibase 16;45480-(10000000-0C)"|bc)
- srec_cat main-fpga.mcs -Intel kernel.srec -OF $offset -o main.mcs -Intel
This says: input the main-fpga.mcs, in intel format input the kernel.srec, but offset each address by adding an offset. Then, create a file in intel mcs format named main.mcs. OK, I suppose the offset (which is decimal -268151668) requires a small explanation. The FPGA image is 0x45480 bytes long. We want to load the kernel at location 0x10000000, but we put three words (12 bytes) in front of it and told the linker to start at 0x0ffffff4, so the addresses srec file start there, but the addresses in the prom device are offset by 0x45480-0x0ffffff4, which is -268151668. This number must be recomputed for any new FPGA size and any new code load base address. We use the Unix "bc" command to do arithmetic, by feedig it input from the "echo" command, and we put the output of the bc command into the shell variable named "offset." the bc command outpouts a base 10 number as a string, which is what the srec_cat command expects to see.
Unfortunately, the xilinx impact loader cannot handle the next-to-last line in the emitted file, which is a "type 5" record. I wrote write a tiny sed script to remove it and piped the srec_cat to it:
- srec_cat main-fpga.mcs -Intel kernel.srec -OF $offset -o - -Intel|sed /^:04000005/d >main.mcs
As you see, I changed the srec_cat output file to "-", which means "standard output" and sed discards any lines matching the type 5 record pattern and outputs the result to main.mcs.
We will use a different technique for JOP, but the general strategy is similar: start by reviewing the srec_cat documentation.
Next steps:
- get the Plasma RTOS running.
- port DDR and ethernet to JOP
- port the plasma RTOS to JOP.
[edit] Boot compression
The PF has more than 240KB of unused area. The uncompressed Plasma RTOS is less than 64K of code now. We can write a decompresser in less than 8K, leaving 232K of space for a compressed kernel blob. Mips code compresses as 8 to 1 or better, so we have room for a really big kernel. for JOP, bytecode will not compress as much, but it is a lot more compact in the first place: I think the equivalent functionality will take much less space.
The boot loader will be unmodified. It loads code where it is told to do so, so we will tell it to load a self-decompressing executable to high memory. This file will consist of a small uncompressed decompressor plus a data blob containing the compressed kernel. The decompresser will decompresses the kernel into low mem and then execute it. Since we don't need big kernels yet, I will defer this. The relevant worked examples are available as part of the opensource zlib project.
[edit] Makefile concepts
The basic target output of the makefile is a PROM image. This image consists of four parts: The FPGA image with "empty" BRAMs, the JOP microcode BRAM contents, the "Boot" bytecode BRAM contents, and the kernel. The last stage of the make combiunes these four parts and loads the PROM. The build of the "empty" FPGA consumes approximately 99% of the cpu time needed for a complete make. Therefore, the makefile only rebuilds the empty image when it is necessary. It is necessary to rebuild the empty FPGA image when there is a change to any VHDL file other than the BRAM images.
When a non-BRAM VHDL file changes, we re-run the entire Xilinx tool chain. This takes several minutes and creates the "empty" fpga bitfile.
When the boot java source code changes, we rebuild the boot code, extract the boot method, and build the boot BRAM mem file.
When the microcode source changes, we re-run Jopa and then re-build the jvm mem files.
When any Java kernel source changes we rebuild the kernel.
As a completely separate step, we can also build and run the simulation. We do a complete rebuild whenever any VHDL or generated VHDL changes. This differs from the actual target. To make this work, we must use a slightly different dependency list for the sim target than we do for the PROM target. In particular, the PROM target uses "placeholder" BRAM vhdl, while the sim uses generated BRAM vhdl. In addition, the sim target uses a sim "top" vhdl and may also use additional vhdl files to simulate off-chip circuitry.
[edit] jtbl BRAM
Jopa emits jtbl.vhd, which is a generic translate instead of an explicit memory. This breaks the quick build, because we must re-synthesize whenever jtbl.vhd changes. The Xilinx tool chain is not synthesizing jtbl into a BRAM, so we cannot use data2MEM when jtbl changes. This means we are stuck with long synthesis times for microcode changes. There are several ways to avoid this, but since I ultimately want to get rid of the jtbl I will simply live with it, and use data2MEM only for the JBC bootloader. Update: Since we are forced to use 2 BRAMs for the microcode anyway (see below,) we can add unsed space between routines in the micorcode. If done carefully, this permits us to preserve the jtbl eve when the microcode changes. This will require a small set of compile options for JOPA. When jopa is compiled with "initialize space," it will intelligently add nops in front of entrypoints and create a jtbl. Whne compiled with "preserve" it will take the jtbl as input and force the entrypoints to match the existing jtbl.
[edit] New BRAM layout for microcode
After finally getting a claen VHDL build, I checked the memory sizes. Sadly, the microcode has grwon in the last 18 months. It was 0x3e4 x 10 without the boot code. It is now 0x4da x 12 without the boot code, but a BRAM holds 0x400 x 18, so I now need two BRAMs. I don't know if it is better to use them "wide" or "long," so I will go "wide." Thus, each 12-bit word will have 8 bits in one BRAM and 4 bits in the other. This requires changes to the VHDL template and to XBlockGen. The changes were fairly simple.
If BRAM usage is critical, we can reduce the JVM ROM back to a single BRAM of 1365*13. Observe that the BRAM HW has two separately-configurable ports, but we only need one port. So, configure one port as 2048*9 and the other as 4096*4. To create a 13-bit word, we invert the address bits for the 4-bit read, so the 4-bit words start at the top of the BRAM while the 9-bit words start at the bottom, and the words do not overlap until the address is >1364(decimal) The code to build the image is straightforward. The VHDL is also straightforward. The current JVM, without boot loaders, is 1246 decimal words of 12 bits.
[edit] JBC bootloader
This code is restricted Java bytecode that resides in the method cache. It needs to perform all functions currently perfomed by the microcode bootloader. So far, it can read from the program flash and should be able to write to the DDR as soon as I port the DDR VHDL. Both word 0 (length) and word 1 (pointer to special pointers) are copied into memory:the file's first word becomes the first word in memory. However, The microcode also performs some steps after the RAM is loaded:
- copy contents of word 0 add one, and store in the heap pointer.
- copy contents of word 1 to mp: "pointer to "special" pointer list
- copy contents of mp[1] to jjp
- copy contents of mp[2] to jjhp
- load mp to TOS
- jmp invoke_main.
The variables (mp,cp,heap,jjp, jjhp) are in the "internal mem" (i.e., the stack BRAM) and are accessible from the Jave code using the native methods RdIntMem and WrIntMem, so we only need to worry about doing a "jmp invoke_main" with the mp in TOS. By examining the microcode, we see that "push mp. jmp invoke_main" is equivalent to "push mp[0], jmp invoke." and "invoke" is already a native method.
[edit] Integrating the OC_DDR project
I reverted to the OC_DDR project as a basis for my DDR. Plasma does not use DCMs, and therefore runs the DDR at 25Mhz, without the DLL, The manufacturer states that this mode is unsupported, so I decided to try to use DCMs and a 100Mhz clock. This requires interesting settings in hte UCF file and in other places. Initially, the 100Mhz clock is used only within the DDR controller. The rest of JOP runs at 50Mhz. Given my primitive understanding of VHDL and FPGA design, I decided to decompose the problem. First, I implemented the standaline DDR project: this worked. Then, I modified the standalone project to perform alternating reads and writes instead of continuous writes followed by continuous reads, since the original code did not make it clear how to terminate a command. Then, I moved the entire DDR test into my JOP FPGA, converting the LED output of the oc_ddr into a status word that is read via a interface. I worte a "bootloader" that reads this word and displayes it to the LEDs. This dummy simpcon interface within oc_ddr runs at 50Mhz while the rest runs at 100Mhz. This worked. This shows that all the elaborate DCM and UCF stuff was synthesized correctly even when the FPGA is much more fully populated that it was with the simple test, and that the JOP functionality and the DDR functionality are both functional. I can now implement and debug the real simpcon interface without worrying about these other issues. I wanted to isolate this step because I have never tried to bridge between two clock domains before.
Since the internal DDR interface timing is not well documented, I derived it by experiment, coed analysis, and observation. The result is extremely conservative and is therefore quite slow,with no attempt at pipelining. I intend to get teh simpcon interface working before I try to optimize it.
[edit] Clock domain crossing
I continue to struggle with the DDR interface. My various random solutions are failing, and I'm reasonably sure the problem relates to crossing the clock domain boundary. I finally decided to search the web, and I found many refrences to a circuit called "the flancter," by Rob Weinstein. I'm reasonably sure I am properly registering the actual data, so I only need a single "my turn, your turn" signal, and the flancter appears to provide this nicely, so I will try this. If I can eventually speed JOP up to at least 70Mhz, I can shift to a synchronous design avoid the problem, but the asynchronous approach is more flexible adn I hope it will let be get operational.
In sober consderation. I may be able to just use a single flip-flop: set it on the rising edge of the simpcon "set" signal, and clear it based on the level of the oc_ddr "clear" signal. This works because the handshake is quite long in this design, so the two transitons will never occur during a single cycle of either machine. I wil try this first, andmove to the flancter only if needed.
