From: Frank Buss on 22 Aug 2006 18:06 jacko wrote: > search for MSL16 as a compact example of stack machine, i would use > slightly different ops, and things if i did it. The paper at http://www.cse.cuhk.edu.hk/~phwl/mt/public/archives/old/msl16/fccm98_fcpu.pdf says it needs 175 CLBs on a Xlinx FPGA. And http://www.xilinx.com/publications/xcellonline/xcell_48/xc_picoblaze48.htm says that the PicoBlaze needs 76 slices (311 slices, if you add serial ports and timers). I'm not sure if this is valid for every FPGA, but somewhere I've read that 4 slices = 1 CLB, so the MSL16 needs more than 9 times more logic gates than PicoBlaze. This is not my idea of a small core. -- Frank Buss, fb(a)frank-buss.de http://www.frank-buss.de, http://www.it4-systems.de
From: Frank Buss on 22 Aug 2006 21:00 Jim Granville wrote: > One stack machine, that is still small, but could help greatly with > software flows (being an already defined std) > is the Instruction List language of IEC 61131-1 > > http://www.3s-software.com/index.shtml?CoDeSys_IL > > and > > http://en.wikipedia.org/wiki/Instruction_list This looks very interesting, because every command has only one operand, which make developing the core really easy and leaves much space for addressing modes etc. in an opcode, even with 8 bit opcodes. I'll try to mix this with my last addressing modes. I don't need "call", because this is only a jump, where the return address is stored somewhere (I don't need recursion). 4 bits: instruction lda: load accu sta: store accu or: accu = accu or argument xor: " and: " add: " sub: " cmp: " bcc: branch if carry clear bcs: branch if carry set bne: branch if zero set beq: branch if zero clear jmp: jump inp: read from port or special register (pc, flags, i/o ports, timer etc.) outp: write to port or special register I don't need it, but the last possible instruction could be rti, return from interrupt, which restores pc, accu and the flags, which are saved on interrupt entry. With inp/outp the interrupt address and frequency could be setup. 4 bits: address mode (pc relative, 16 bit argument, doesn't make much sense, so all useful combinations fits in 4 bits) immediate, 8 bit argument immediate, 16 bit argument immediate, no arguments, #0 immediate, no arguments, #1 8 bit transfer width: address, 8 bit argument address, 16 bit argument address, pc relative, 8 bit argument address indirect, 8 bit argument address indirect, 16 bit argument address indirect, pc relative, 8 bit argument 16 bit transfer width: address, 8 bit argument address, 16 bit argument address, pc relative, 8 bit argument address indirect, 8 bit argument address indirect, 16 bit argument address indirect, pc relative, 8 bit argument The "pc relative" address modes adds the argument to the pc to get the value. This can be used for the branches and jumps for short jumps, but as well for using some kind of local variables. Let's try the swap algorithm: ; swap 6 byte source and destination MACs p1: .dw 0 p2: .dw 0 tmp: .db 0 lda #5 sta p1 (pc relative) lda #11 sta p2 (pc relative) loop: lda (p1) (address indirect, pc relative) sta tmp (address, pc relative) lda (p2) (address indirect, pc relative) sta (p1) (address indirect, pc relative) lda tmp (address, pc relative) sta (p2) (address indirect, pc relative) lda p2 (pc relative) sub #1 (one byte, because #1 needs no operand) sta p2 (pc relative) lda p1 (pc relative) sub #1 (one byte, because #1 needs no operand) sta p1 (pc relative) bcc loop (pc relative) 37 bytes This is not as good as my RISC idea (20 bytes), but the code is much better to understand: you need not to think about it when reading and writing it. But maybe this is only because some ages ago I've written some demos and intros on C64 (6502), which uses a similiar instruction set :-) Do you think the core for this design would be smaller than PicoBlaze or my RISC idea? BTW: There are some nice contructs possible for smaller code, like to use some kind of zero-page, like implemented in the 6502, because the lda/sta instructions could be used with 8 bit arguments addresses. But code size and speed is not so important for me, a small core is more important, and maybe easy to write assembler code, to avoid implementing a GCC backend for my CPU. -- Frank Buss, fb(a)frank-buss.de http://www.frank-buss.de, http://www.it4-systems.de
From: Jim Granville on 22 Aug 2006 22:19 Frank Buss wrote: > Jim Granville wrote: > > >>One stack machine, that is still small, but could help greatly with >>software flows (being an already defined std) >>is the Instruction List language of IEC 61131-1 >> >>http://www.3s-software.com/index.shtml?CoDeSys_IL >> >>and >> >>http://en.wikipedia.org/wiki/Instruction_list > > > This looks very interesting, because every command has only one operand, > which make developing the core really easy and leaves much space for > addressing modes etc. in an opcode, even with 8 bit opcodes. > > I'll try to mix this with my last addressing modes. I don't need "call", > because this is only a jump, where the return address is stored somewhere > (I don't need recursion). > > 4 bits: instruction > lda: load accu > sta: store accu > or: accu = accu or argument > xor: " > and: " > add: " > sub: " > cmp: " > bcc: branch if carry clear > bcs: branch if carry set > bne: branch if zero set > beq: branch if zero clear > jmp: jump > inp: read from port or special register (pc, flags, i/o ports, timer etc.) > outp: write to port or special register > > I don't need it, but the last possible instruction could be rti, return > from interrupt, which restores pc, accu and the flags, which are saved on > interrupt entry. With inp/outp the interrupt address and frequency could be > setup. > > 4 bits: address mode (pc relative, 16 bit argument, doesn't make much > sense, so all useful combinations fits in 4 bits) > > immediate, 8 bit argument > immediate, 16 bit argument > immediate, no arguments, #0 > immediate, no arguments, #1 > > 8 bit transfer width: > address, 8 bit argument > address, 16 bit argument > address, pc relative, 8 bit argument > address indirect, 8 bit argument > address indirect, 16 bit argument > address indirect, pc relative, 8 bit argument > > 16 bit transfer width: > address, 8 bit argument > address, 16 bit argument > address, pc relative, 8 bit argument > address indirect, 8 bit argument > address indirect, 16 bit argument > address indirect, pc relative, 8 bit argument > > The "pc relative" address modes adds the argument to the pc to get the > value. This can be used for the branches and jumps for short jumps, but as > well for using some kind of local variables. Let's try the swap algorithm: > > ; swap 6 byte source and destination MACs > p1: .dw 0 > p2: .dw 0 > tmp: .db 0 > lda #5 > sta p1 (pc relative) > lda #11 > sta p2 (pc relative) > loop: lda (p1) (address indirect, pc relative) > sta tmp (address, pc relative) > lda (p2) (address indirect, pc relative) > sta (p1) (address indirect, pc relative) > lda tmp (address, pc relative) > sta (p2) (address indirect, pc relative) > lda p2 (pc relative) > sub #1 (one byte, because #1 needs no operand) > sta p2 (pc relative) > lda p1 (pc relative) > sub #1 (one byte, because #1 needs no operand) > sta p1 (pc relative) > bcc loop (pc relative) > > 37 bytes > > This is not as good as my RISC idea (20 bytes), but the code is much better > to understand: you need not to think about it when reading and writing it. > But maybe this is only because some ages ago I've written some demos and > intros on C64 (6502), which uses a similiar instruction set :-) > > Do you think the core for this design would be smaller than PicoBlaze or my > RISC idea? The core can certainly be made very small, it depends on the datatypes you decide to support. - I've been looking at the very similar, but venerable MC14500 ICU into CPLDs ( effectvely IL with only Boolean type ) Note that the IL syntax allows brackets, and I think has an implicit stack; a bit like reverse-polish calculators - see this example I got from the web : Example IL code, from the net ( derived from a ladder diagram ) : Read as O:001/00 = I:000/00 AND ( I:000/01 OR ( I:000/02 AND NOT I:000/03) ) Label Opcode Operand Comment START: LD %I:000/00 (* Load input bit 00 *) AND( %I:000/01 (* Start a branch and load input bit 01 OR( %I:000/02 (* Load input bit 02 *) ANDN %I:000/03 (* Load input bit 03 and invert *) ) ) ST %O:001/00 (* SET the output bit 00 *) With the implicit stack, your swap becomes LD VarNameA LD VarNameB ST VarNameA ST VarNameB This also makes the assembler a little more complex, as it needs to re-order, and be bracket aware, before final-code-generate :) > BTW: There are some nice contructs possible for smaller code, like to use > some kind of zero-page, like implemented in the 6502, because the lda/sta > instructions could be used with 8 bit arguments addresses. But code size > and speed is not so important for me, a small core is more important, and > maybe easy to write assembler code, to avoid implementing a GCC backend for > my CPU. Another good reference site I've found, is this http://www.tracemode.com/products/dev/ they offer a free ( 117MB ) version, I have not got the time to look at yet. Something like this, should (hopefully) allow simulation and development of IL code, as the software aspects of this will be the key elements. If you can keep to a defined type/operator subset of IL, then this should also be somewhat portable. I did see that some of their IL examples, suggest two operands, but the standards docs I have here, do not mention that ? It could be that two operands simply does an implicit load of the first one, and is done to make the code slightly more readable. -jg
From: Frank Buss on 23 Aug 2006 06:04 Martin Schoeberl wrote: > What do you mean with 'very close to the hardware'? I try to > avoid vendor specific library elements as much as possible and > stay with plain VHDL. If you mean that the VHDL coding style > is more hardware oriented, than I agree. Yes, this was what I mean, e.g. figures 5.6 to 5.9 of your thesis, where you describe the processor pipeline with gates and which is implemented like this in VHDL. But maybe this is the normal case and I'm just to new to VHDL to write and interconnect components in this way. http://www.jopdesign.com/thesis/thesis.pdf > I started directly > in an FPGA implementation and did almost no simulation. Why not? When I was implementing my CRC32 check for my network core, I've tested the algorithm with a VHDL testbench (ethernet packet send and receive works at 10 Mbit and 100 Mbit on my Spartan 3E starter kit now). The turnaround times are faster with simulation and it is very easy to debug it, instead of debugging a synthesized core in hardware. The same was true for my DS2432 ROM id reader, where I've written the testbench, first and then implemented the reader. http://www.frank-buss.de/vhdl/spartan3e.html -- Frank Buss, fb(a)frank-buss.de http://www.frank-buss.de, http://www.it4-systems.de
From: Walter Banks on 23 Aug 2006 08:43
Jim Granville wrote: > The tiniest CPUs do not need a stack, and interupts do not need to be > re-entrant, so a faster context switch is to re-map the Registers, Flags > (and even PC ? ) onto a different area in BRAM. > You can share this resource by INTs re-map top-down, and calls re-map > bottom up - with a hardware trap when they collide :) Once you get into seeing clearly the relationship between features and cost a lot can be removed. Interrupts can be removed at extremely low cost to applications. Both the Microchip PIC12 and Freescale RS08 do not have interrupts. In the RS08 C compiler we developed some software IP to where possible go into a power down mode and launch execution threads that compiled as execution to completion. The threads are typically short and a as a side effect run to completion makes local re-use easy C compilers implemented for small processors work well with out either a data or subroutine return stack. Two of the processors we have written compilers for in the last couple years both used an assessable return register. Flow control analysis in the compiler make nested subroutines user transparent. The instruction set reduction in the RS08 from the S08 parent had a 4-6% impact on application performance. Walter.. |