Prev: TPS
Next: Style, or lack thereof in Graham's ANSI CL
From: Kaz Kylheku on 18 Dec 2009 16:21 On 2009-12-16, Tamas K Papp <tkpapp(a)gmail.com> wrote: > On Wed, 16 Dec 2009 18:04:14 +0000, Kaz Kylheku wrote: > >> Without cycles, you can't express graph structures, so an entire branch >> of computer science goes out the window. Many useful algorithms > > I disagree. Graphs can be represented, for example, in a matrix > (a[i,j] is nonzero iff i->j, etc). You can use this to reconstruct a > lookup table, basically emulating pointers. > > Which would, of course, would be a pretty pointless thing to do -- but > the OP's questions have been pretty pointless so far, so maybe this > would be a harmonious outcome :-) If we have a graph structure, it means that given some node object N, we can navigate to some adjancent object. For instance if the edges are numbered from zer0, we can call a function like (neighbor N 3) to obtain the fourth neighbor of N. In order to be able to do this with the matrix representation, the object N has to have a backpointer to the matrix. And that creates the cyclical reference at the storage level: N points to the array, and the array contains N somewhere. The only way you can eliminate the circularity is if the nodes do not know who their neighbors are (do not have a backreference to the matrix). But that's a different kind of graph then. I don't want to be constrained into working only with that kind of graph, just because my memory management is dumb.
From: Kaz Kylheku on 18 Dec 2009 16:27 On 2009-12-16, George Neuner <gneuner2(a)comcast.net> wrote: > On Wed, 16 Dec 2009 18:04:14 +0000 (UTC), Kaz Kylheku ><kkylheku(a)gmail.com> wrote: > >>Refcounting can't be deferred, > > There are lazy refcounters which defer recycling until the memory is But do they defer the counting? The actual bumping up and down of refcounts? That's a cost. My point was that in short-lived programs, GC has zero cost, because it never happens, and the program code itself contains vanishingly few traces of anything that caters to the task of memory management. Recounting programs have to always be refcounting. The code to do so is compiled in. If you want a refcounted program to have zero overhead in cases where it is expected to be short lived, you have to compile a separate image of that program, in which the inline refcounting instructions are removed.
From: Kaz Kylheku on 18 Dec 2009 17:22 On 2009-12-16, Pillsy <pillsbury(a)gmail.com> wrote: > On Dec 16, 1:04 pm, Kaz Kylheku <kkylh...(a)gmail.com> wrote: > >> On 2009-12-16, Bennu Strelitzia <bennu.strelit...(a)gmail.com> wrote: > >> > i.e. reference counting could do the trick for an efficient minimal >> > implementation if there were no referential cycles, I hate the real > >> Reference counting is not efficient. Except in some highly contrived >> benchmarks that don't reflect real-world use, it is slower than garbage >> collection. Even simple mark-and-sweep can beat reference counting in >> realistic situations. > > The following is a very naive observation. > > The one thing it seems like reference counting could allow you to do > is immediate recycling of space. For example, if I want to sum two > vectors, using > > (map 'vector #'+ a-big-vector another-big-vector) > > I'll get a third big vector that has to be consed. If I know that A- > BIG-VECTOR isn't being used by anything else, I can recycle it using > MAP-INTO instead, but in general I have no idea. However, if there > were a reference count on those big vectors, the implementation could > make the decision for me at run-time. I'm primarily thinking of using > using reference counting as a way of augmenting garbage collection, > not as a replacement for it; indeed, you could limit it to only work > with some data types or objects, like big simple arrays of floats or > characters. > > There's no way this is a new idea, and it's entirely possible I'm > overlooking some trade-offs that keep it from being worthwhile. Your idea depends on the assumption that ``it is a big deal that a large vector has been allocated for temporary use and is now garbage.'' Why is that a big deal? A larger vector is no more difficult to hunt down and reclaim than a small vector. How vectors can be implemented is that they are small control records which are themselves in a garbage-collected heap, but the vector data is in an ordinary heap that isn't garbage-collected. So then it's a large /number/ of vectors that contributes to GC times, rather than large vector size. Under this representation, when we allocate a vector, that only contributes a small record to the garbage collected heap which must be traversed in preparation for marking, and then traversed again to sweep it for unreachable objects. So the remaining reason why we might be concerned about a large vector being garbage is that we are holding on to a large amount memory longer than necessary, preventing it from being used elsewhere. This can be solved without reference counting. Simply provide an operation which lets the program explicitly shrink the vector. If a ten million element vector is reduced to size zero (or to a tiny, nonzero size, like 1), the memory can be reclaimed instantly. A shrunk vector is better than one which is prematurely deleted, because it isn't garbage. It's safe. The worst that can happen is that some code thinks that the vector is larger than it really is; but that only results in out-of-bounds array references, which can be checked. We can disallow direct interior pointers into vector data, such that everything goes through indexing which is checked agains size (even displaced-to arrays).
From: Kaz Kylheku on 18 Dec 2009 17:58 On 2009-12-17, George Neuner <gneuner2(a)comcast.net> wrote: > On Wed, 16 Dec 2009 11:29:01 -0800 (PST), Pillsy <pillsbury(a)gmail.com> > wrote: > >>On Dec 16, 1:04 pm, Kaz Kylheku <kkylh...(a)gmail.com> wrote: >> >>> On 2009-12-16, Bennu Strelitzia <bennu.strelit...(a)gmail.com> wrote: >> >>> > i.e. reference counting could do the trick for an efficient minimal >>> > implementation if there were no referential cycles, I hate the real >> >>> Reference counting is not efficient. Except in some highly contrived >>> benchmarks that don't reflect real-world use, it is slower than garbage >>> collection. Even simple mark-and-sweep can beat reference counting in >>> realistic situations. >> >>The following is a very naive observation. >> >>The one thing it seems like reference counting could allow you to do >>is immediate recycling of space. For example, if I want to sum two >>vectors, using >> >>(map 'vector #'+ a-big-vector another-big-vector) >> >>I'll get a third big vector that has to be consed. If I know that A- >>BIG-VECTOR isn't being used by anything else, I can recycle it using >>MAP-INTO instead, but in general I have no idea. However, if there >>were a reference count on those big vectors, the implementation could >>make the decision for me at run-time. I'm primarily thinking of using >>using reference counting as a way of augmenting garbage collection, >>not as a replacement for it; indeed, you could limit it to only work >>with some data types or objects, like big simple arrays of floats or >>characters. >> >>There's no way this is a new idea, and it's entirely possible I'm >>overlooking some trade-offs that keep it from being worthwhile. > > Refcounters (even lazy ones) eventually have to scan all the garbage > blocks. That's right. Suppose P to a root object R is the program's last reference to a huge DAG of millions of objects rooted at R. When the program, through pointer P, drops its last reference on R, the entire DAG becomes garbage. What this means is that each object in the DAG is visited, and its refcount is dropped. So this action may cause a drastic pause in the execution of the program: the very kind of pause that garbage collection is accused of, and which refcounting is supposed to eliminate.
From: Kaz Kylheku on 18 Dec 2009 18:25
On 2009-12-16, Rahul Jain <rjain(a)nyct.net> wrote: > Kaz Kylheku <kkylheku(a)gmail.com> writes: > >> Every time an object's refcount is modified, an entire cache line >> becomes dirty and has to be flushed to main memory. When another >> processor in the system wants to access anything which lands in that >> cache line, having noticed that it's dirty, it has to fetch the latest >> version of that cache line from main memory. This means that two or >> more processors can't work with the same refcounted object in a way that >> is scalable, if at all they touch the refcounts---even if they are >> treating the objects as read-only! As the refcounts are bumped, you have >> performance-wasting cache ping-pong, and this can happen even with >> separate objects (false sharing) if they land wholly or partially into >> the same cache line. Refcounts have to be updated in a thread-safe way, >> using expensive atomic operations that can cause stalls of thousands of >> cycles. The consequences is that a refcounted program simply will not >> scale well to a multiprocessor system. > > Thanks, Kaz. I hadn't even thought of that issue. It's a huge one for > modern programs! Here is another one. The most naive garbage collection has a sweep phase. But let me present an argument that this sweep phase is better than refcounting. Whenever refcounting reclaims an object, it is because a refcount has reached zero. These zero refcounts are discovered dynamically by chasing the references among objects: essentially, the order in which objects are discovered to be unreachable is random. What's worse, sometimes refcounted objects are shared. An unreachable object may have a refcount that is greater than one, and has to be visited that many times before its count drops to zero. These visits may be spread apart such that the object is evacuated from the cache and fetched again. Under mark-and-sweep naive GC, objects are discovered to be unreachable in a nice, predictable order: they are visited by marching a pointer through the heap (often, a heap of identically-sized objects, like conses, or the header records of vectors, etc). Each object is visited once. It is discovered that it was not marked with the ``reachable'' flag, and so its is updated in some way to recycle it. Marching through memory in a linear way is something that processors can do fast. Firstly, we avoid fetching any of the data into the L1 cache more than once, since we are making one pass. Secondly, the access pattern is such that it is susceptible to speedup by prefetching. Some modern processors have L2 to L1 prefetch, which is similar to prefetching the blocks of a file, under the assumption that it's being sequentially accessed. This means that while you are working with one cache line full of data, the processor can be generating the cache refill for the next one. Secondly, on some processors, there are cache hint instructions: you can tell the processors that you exepect to be accessing a particular address. If you're sweeping through objects in a linear heap, you know the address of any number of objects in advance by simple pointer arithmetic. Using cache hints, we can tell the processor to load several cache lines ahead. If you're traversing objects under refcounting, you can generate a cache hinit only for the immediately reachable objects. To get any objects which are two hops away, you have to load the first-hop objects into the cache, check their type, so you know where their references are. Thirdly, small objects are packed several to a cache line. If your processor has 64 byte L1 cache lines, and you have 8 byte conses, you get 8 conses to a cache line. It's much faster to access several object in the same cache line, than several objects all over the place. This is similar to traversing a disk-based database in record-bucket order, compared to fetching records in random order. Fourth, DRAM memories support burst mode access, which works only for consecutive addresses. The memory bandwidth of linear access is much faster than that of random access. Burst mode allows fast main memory -> L2 cache fills. Again, works in your favor if you're linearly scanning. We have had decently high resolution displays with decent refresh rates, even in the dark ages of slow memory with 1000 nanosecond + access times. High resolution video memories have for decades demonstrated that predictable, linear access to memory can be trivially optimized with a bit of parallelism, and a high bandwidth can be achieved even though the latency of random accesses is high. Summary: visiting dead objects random order with refcounting: bad. Dumb sweep through dead objects in address order: better. |