From: John_H on 21 Apr 2010 06:26 On Apr 20, 1:00 pm, Eric <eric.lafor...(a)gmail.com> wrote: > Hi All, > > I've recently published a paper exploring how to implement memories > with multiple read and write ports on existing FPGAs. I figured it > might be of interest to some. > > Summary, paper, slides, and example code are here:http://www.eecg.utoronto.ca/~laforest/multiport/index.html > > There are no patents or other additional IP encumbrances on the code. > If you have any comments or other feedback, I'd like to hear it. > > Eric LaForest > PhD student, ECE Dept. > University of Torontohttp://www.eecg.utoronto.ca/~laforest/ Could you mention here or on your page what you mean by "multipumping?" If you mean time multiplexed access, I can see why multipumping is bad. [The "pure logic" approach also isn't obvious.] Do you update the LVT in the same way I might update the RAM value in a many-write BlockRAM? To implement a many-write BlockRAM, each "write bank" is maintained separately. To write a new value, only the write bank for the associated write port is updated but with the XOR of the write data with the reads for all the other write banks at the same address. By reading the same address from all write banks, the XOR of all the reads is the most recent write value (unless there are multiple writes in the same clock cycle). Since the LVT only has to be as wide as the number of entries, I can see how this would be very beneficial for wide data and many write ports. Aside from wide data, however, I don't see (without going into the attachments on that page) how updating the LVT is any different than updating the memory in the first place.
From: Weng Tianxiang on 23 Apr 2010 22:39 On Apr 22, 3:45 pm, John_H <newsgr...(a)johnhandwork.com> wrote: > On Apr 22, 1:55 pm, Eric <eric.lafor...(a)gmail.com> wrote: > > > > > > > On Apr 22, 12:36 pm, rickman <gnu...(a)gmail.com> wrote: > > > > I guess I don't understand what you are accomplishing with this. > > > Block rams in FPGAs are almost always multiported. Maybe not N way > > > ported, but you assume they are single ported when they are dual > > > ported. > > > But what if you want more ports, say 2-write/4-read, without wait > > states? > > I assume them to be "simply dual-ported", which means one write port > > and one read port, both operating concurrently. It is also possible to > > run them in "true dual port" mode, where each port can either read or > > write in a cycle. Some of the designs in the paper do that. > > > > Can you give a general overview of what you are doing without using > > > jargon? I took a look and didn't get it at first glance. > > > OK. Let me try: > > > Assume a big, apparently multiported memory of some given capacity and > > number of ports. Inside it, I use a small multiported memory > > implemented using only the fabric of an FPGA, which stores only the > > number of the write port which wrote last to a given address. Thus > > this small memory is of the same depth as the whole memory, but much > > narrower, hence it scales better. > > > When you read at a given address from the big memory, internally you > > use that address to look up which write port wrote there last, and use > > that information to steer the read to the correct internal memory bank > > which will hold the data you want. These banks are built-up of > > multiple Block RAMs so as to have one write port each, and as many > > read ports as the big memory appears to have. > > > The net result is a memory which appears to have multiple read and > > write ports which can all work simultaneously, but which leaves the > > bulk of the storage to Block RAMs instead of the FPGA fabric, which > > makes for better speed and smaller area. > > > Does that help? > > > Eric > > I appreciate the elaboration here in the newsgroup. > > The "true dual port" nature of the BlockRAMs allows one independent > address on each of the two ports with a separate write enable for each > port. The behavior of the BlockRAM can be modified to provide read > data based on the new write data, old data, or no change in the read > data value from last cycle (particularly helpful for multi-pumping). > > For an M write, N read memory, your approach appears to need M x (N+1) > memories since you can have M writes all happening at the same time N > accesses are made to the same "most recently written" memory. Please > correct me if I'm wrong. This is the same number of memories required > with the XOR approach but without the LVT overhead. The time delay in > reading the LVT and multiplexing the memories feels like it would be > cumbersome. While this might not add "wait states" it appears the > system would not be able to run terribly quickly. XORs are pretty > quick. > > There are always more ways to approach a problem that any one group > can come up with. Kudos on your effort to bring a better approach to > a tough system level issue for difficult designs. John_H, What is the XOR method in this regard? Can you give an explanation? or give a hint on the source? Weng
From: Weng Tianxiang on 25 Apr 2010 00:03 On Apr 23, 7:39 pm, Weng Tianxiang <wtx...(a)gmail.com> wrote: > On Apr 22, 3:45 pm, John_H <newsgr...(a)johnhandwork.com> wrote: > > > > > > > On Apr 22, 1:55 pm, Eric <eric.lafor...(a)gmail.com> wrote: > > > > On Apr 22, 12:36 pm, rickman <gnu...(a)gmail.com> wrote: > > > > > I guess I don't understand what you are accomplishing with this. > > > > Block rams in FPGAs are almost always multiported. Maybe not N way > > > > ported, but you assume they are single ported when they are dual > > > > ported. > > > > But what if you want more ports, say 2-write/4-read, without wait > > > states? > > > I assume them to be "simply dual-ported", which means one write port > > > and one read port, both operating concurrently. It is also possible to > > > run them in "true dual port" mode, where each port can either read or > > > write in a cycle. Some of the designs in the paper do that. > > > > > Can you give a general overview of what you are doing without using > > > > jargon? I took a look and didn't get it at first glance. > > > > OK. Let me try: > > > > Assume a big, apparently multiported memory of some given capacity and > > > number of ports. Inside it, I use a small multiported memory > > > implemented using only the fabric of an FPGA, which stores only the > > > number of the write port which wrote last to a given address. Thus > > > this small memory is of the same depth as the whole memory, but much > > > narrower, hence it scales better. > > > > When you read at a given address from the big memory, internally you > > > use that address to look up which write port wrote there last, and use > > > that information to steer the read to the correct internal memory bank > > > which will hold the data you want. These banks are built-up of > > > multiple Block RAMs so as to have one write port each, and as many > > > read ports as the big memory appears to have. > > > > The net result is a memory which appears to have multiple read and > > > write ports which can all work simultaneously, but which leaves the > > > bulk of the storage to Block RAMs instead of the FPGA fabric, which > > > makes for better speed and smaller area. > > > > Does that help? > > > > Eric > > > I appreciate the elaboration here in the newsgroup. > > > The "true dual port" nature of the BlockRAMs allows one independent > > address on each of the two ports with a separate write enable for each > > port. The behavior of the BlockRAM can be modified to provide read > > data based on the new write data, old data, or no change in the read > > data value from last cycle (particularly helpful for multi-pumping). > > > For an M write, N read memory, your approach appears to need M x (N+1) > > memories since you can have M writes all happening at the same time N > > accesses are made to the same "most recently written" memory. Please > > correct me if I'm wrong. This is the same number of memories required > > with the XOR approach but without the LVT overhead. The time delay in > > reading the LVT and multiplexing the memories feels like it would be > > cumbersome. While this might not add "wait states" it appears the > > system would not be able to run terribly quickly. XORs are pretty > > quick. > > > There are always more ways to approach a problem that any one group > > can come up with. Kudos on your effort to bring a better approach to > > a tough system level issue for difficult designs. > > John_H, > What is the XOR method in this regard? Can you give an explanation? or > give a hint on the source? > > Weng Eric, Here is my answer to your paper. 1. It is an excellent paper. If let me give a 0-100 mark, I would give 90. 2. It really has inventive idea: it solidly resolves a problem: with a current FPGA chip available, how to generate a block RAM with required number of multiple read/write ports using available block RAM which have limited and fixed number of write/read ports. In this regard, your method is the best !!! I will use it if I need to. 3. The references are good, but it lacks the most important things: patents in this regard filed by Altera and Xilinx. They must have their technique patented. No matter what, big or small, both companies would patent everything they think it is important to them. 4. If you want to expand your inventive idea to new block RAM with multiple write/read ports made in FPGA, it is not a good idea: Essentially there is no difficulty to build a block RAM with multiple write/read ports. If you add 4 sets of read decoder and 4 sets of write decoder to a current block RAM, the block RAM immediately would have 4 write/read ports.There is no data racing in the design at all. The problem is if there is big need in the market for FPGA manufactures to do so. 5. Your final conclusion about write-and-read schedule is not right. When people is using your method, they are still facing write-and-read scheduling. For example, there is a wait list pool to receive write and read requests, and the pool can hold 200 write requests and 200 read requests. How to deliver the proper write and read request to your block RAM ports has the write-and-read scheduling problem. So your method doesn't eliminate the write-and-read scheduling problem. And you can only say that your solution doesn't generate new write-and-read scheduling problem and there is a new write/read rule: if a write and a read with same address are issued in the same clock cycle, the read would return the data which is to be overwritten by the write request issued in the same clock cycle. Here is an answer to John_H and Rick: > > For an M write, N read memory, your approach appears to need M x (N+1) > > memories since you can have M writes all happening at the same time N > > accesses are made to the same "most recently written" memory. Please > > correct me if I'm wrong. John_H is wrong. It needs M block RAM with another column used by LVT. The LVT column needs M write ports and N read ports and its data width is Upper_bound( log M ). Each write port writes its data into its block RAM and at the same time writes its column number to LVT column so that LVT column stores the latest write column number for the latest write address. When being read, read port first reads the LVT column to get the column number which stores the latest write data with the read address, then read the latest write column ( block RAM ) into its read output register. It is really a clever idea !!! Weng
From: John_H on 25 Apr 2010 22:57 On Apr 25, 12:03 am, Weng Tianxiang <wtx...(a)gmail.com> wrote: > Each write port writes its data into its block RAM and at the same > time writes its column number to LVT column > so that LVT column stores the latest write column number for the > latest write address. > > When being read, read port first reads the LVT column to get the > column number which stores the latest write data with the read > address, > then read the latest write column ( block RAM ) into its read output > register. > So what happens when the multiple reads all want to read from the same memory because that happens to be where all the last values were written for that read? As to the XOR, I don't have code to share; I developed it a while ago for some asynchronous stuff and it applies well to multi-port writes. As I try to put together a clearer explanation, I find I may have been wrong about the memory count for the XOR approach such that the LVT would use fewer. I still believe the LVT approach requires M*N BlockRAMs for an M-write, N-read multi-port memory plus the LVT; I'm having trouble remembering why I thought the "+1" was needed. The XOR approach appears to need M*(M-1+N) memories. If you have 3 write ports and 4 read ports, you'll need 3 sets of *6* memories. The 6 memories in the "write bank" set all have the same write address and write data corresponding to that write port. When a write occurs, the *read* value for that write address is accessed from the other two write banks so each set of 6 must provide a read address for the other two write bank addresses. The four reads have their own memories for their own addresses in each of the write banks. When the write occurs, the read values for that write address are retrieved from the other write banks and the XOR of those values along with the new write data are written to all the write ports for its bank of 6 memories. When a read is performed, the read value within each write bank is retrieved and the values (3 in this case) are XORed to get the original data. newDataWr0^oldDataWr1^oldDataWr2 overwrites oldDataWr0 for the write. The later reads retrieve oldDataWr0, oldDataWr1, and oldDataWr2 but since oldDataWr0 was updated to newDataWr0^oldDataWr1^oldDataWr2, the XOR of the three read values are oldDataWr0^oldDataWr1^oldDataWr2 == (newDataWr0^oldDataWr1^oldDataWr2)^oldDataWr1^oldDataWr2 == newDataWr0^(oldDataWr1^oldDataWr1)^(oldDataWr2^oldDataWr2) == newDataWr0^(0)^(0) == newDataWr0 So with a little coordination, the XOR approach requires M*(M-1+N) BlockRAMs for an M write, N read port memory along with the XORs. The LVT needs a memory for each write port but requires multiples of them to accommodate every read port in case the multiple reads for any one cycle are all from the same write bank for the most recently updated value. Depending on the complexity of the LVT, the number of write ports, and the allowable latencies, the LVT could be a more effective approach.
From: Weng Tianxiang on 26 Apr 2010 00:04 On Apr 25, 7:57 pm, John_H <newsgr...(a)johnhandwork.com> wrote: > On Apr 25, 12:03 am, Weng Tianxiang <wtx...(a)gmail.com> wrote: > > > Each write port writes its data into its block RAM and at the same > > time writes its column number to LVT column > > so that LVT column stores the latest write column number for the > > latest write address. > > > When being read, read port first reads the LVT column to get the > > column number which stores the latest write data with the read > > address, > > then read the latest write column ( block RAM ) into its read output > > register. > > So what happens when the multiple reads all want to read from the same > memory because that happens to be where all the last values were > written for that read? > > As to the XOR, I don't have code to share; I developed it a while ago > for some asynchronous stuff and it applies well to multi-port writes. > > As I try to put together a clearer explanation, I find I may have been > wrong about the memory count for the XOR approach such that the LVT > would use fewer. I still believe the LVT approach requires M*N > BlockRAMs for an M-write, N-read multi-port memory plus the LVT; I'm > having trouble remembering why I thought the "+1" was needed. The XOR > approach appears to need M*(M-1+N) memories. > > If you have 3 write ports and 4 read ports, you'll need 3 sets of *6* > memories. The 6 memories in the "write bank" set all have the same > write address and write data corresponding to that write port. When a > write occurs, the *read* value for that write address is accessed from > the other two write banks so each set of 6 must provide a read address > for the other two write bank addresses. The four reads have their own > memories for their own addresses in each of the write banks. > > When the write occurs, the read values for that write address are > retrieved from the other write banks and the XOR of those values along > with the new write data are written to all the write ports for its > bank of 6 memories. When a read is performed, the read value within > each write bank is retrieved and the values (3 in this case) are XORed > to get the original data. > > newDataWr0^oldDataWr1^oldDataWr2 overwrites oldDataWr0 for the write. > > The later reads retrieve oldDataWr0, oldDataWr1, and oldDataWr2 but > since oldDataWr0 was updated to newDataWr0^oldDataWr1^oldDataWr2, > the XOR of the three read values are > oldDataWr0^oldDataWr1^oldDataWr2 == > (newDataWr0^oldDataWr1^oldDataWr2)^oldDataWr1^oldDataWr2 == > newDataWr0^(oldDataWr1^oldDataWr1)^(oldDataWr2^oldDataWr2) == > newDataWr0^(0)^(0) == > newDataWr0 > > So with a little coordination, the XOR approach requires M*(M-1+N) > BlockRAMs for an M write, N read port memory along with the XORs. > > The LVT needs a memory for each write port but requires multiples of > them to accommodate every read port in case the multiple reads for any > one cycle are all from the same write bank for the most recently > updated value. > > Depending on the complexity of the LVT, the number of write ports, and > the allowable latencies, the LVT could be a more effective approach. "The LVT needs a memory for each write port but requires multiples of them to accommodate every read port in case the multiple reads for any one cycle are all from the same write bank for the most recently updated value. " I don't see any problem reading any of 9 block RAM into one read register by a selection data or to any number of read registers by a selection data. Your XOR method is inferior to the LVT method absolutely, even though I have no full knowledge of your XOR method. The LVT method is very clean, very fast and easy to implement and so flexible that even 2 block RAM with each having independent 2 write ports and 2 read ports can be easily expanded into 4 write ports and 4 read ports. Weng
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 Prev: clock routing to generic IO pins? Next: Raggedstone2 Spartan-6 Board Update |