Ticket #207 (closed enhancement: fixed)
Garbage collection within a replicated graph
|Reported by:||Bernhard Haumacher (haui at haumacher dot de)||Owned by:|
An object gets included into a replicated object graph by storing a reference to that object within a non-transient instance variable of another object that is already part of the graph. In Java, objects become subject to garbage collection, when they are no longer reachable by reference. But, objects that are part of a replicated object graph are registered in an identifier table for extracting modifications and applying patches to them. To enable garbage collection, this registration must not use strong references, which prevent those objects from being garbage collected, even if they are no longer referenced - neither from within the replicated graph, nor from its outside.
This feature does not target garbage collection of whole replicated objects.
There are three reasons, why a local object that was attached to a replicated graph should become garbage collected:
- The object is locally unused, because all references to it were removed from the local replica.
- After redistributing a partially replicated object, the object is no longer marked as being required on the local node, or it was referenced only from objects that are themselves no longer required on the local node.
- Within a partially replicated object, the application may decide to insert a new object on a node, where it is not finally required. This happens frequently, when constructing a partially replicated object on one node and distributing it from there. This is similar to the situation described above, but occurs even without explicitly redistributing a partially replicated object.
There are two options for enabling garbage collection for objects that are associated with replicated object graphs:
- Explicitly searching for objects that are no longer referenced from the replicated graph and removing them from the internal structures. After removal, they become applicable to regular garbage collection of the virtual machine, if the application does not hold any further references to them.
- Cooperating with the local virtual machine's garbage collector by using special reference objects to prevent directly referencing objects within the replicated graph.
Explicitely searching unreferenced objects.
- Advantage: The additional overhead is saved that is caused by creating reference objects and accessing parts of the replicated graph through an additional indirection
- Disadvantage: Searching for unreferenced objects introduces additional costs. It is unclear in which intervals to start the search. At the end, one would end up re-implementing a full-fledged GC algorithm for objects within replicated graphs.
- Disadvantage: Two stage garbage collection (within a replicated graph and within the whole virtual machine) introduces problematic semantics. Because a replicated object is accessed directly, the application may keep references to arbitrary objects outside the replicated graph. When an object is detected to be detached from a replicated graph, and it is excluded from further updates, those references become dangling references to outdated copies. When reinserting those reference later, problems get worse, because the outdated copy and the original object coexist in the replicated graph.
Cooperation with the local garbage collector.
- Advantage: Reuse of the virtual machine's garbage collector for replicated object graphs.
- Advantage: Semantics are less problematic, because the application may never operate on dangling references to outdated copies.
- Disadvantage: Overhead due to an additional indirection when inserting objects to the replicated graph and when searching for modifications.
- Disadvantage: Updates to objects can get lost in certain situations: An object that is associated with a replicated graph is modified and removed from all graph objects on one node within one update step. If there are no further references to that object, it may now be garbage collected before the update mechanism could extract the changes. In case of partial replication, corresponding objects on other nodes may still be reachable, if they are referenced from another graph object that is only distributed to those nodes.
Cooperating with the local garbage collector seems to have far less semantic problems and is less complex than re-implementing garbage collection for replicated object graphs. Which solution would give better performance is unclear. Therefore, the cooperating approach should be chosen.
Cooperating with the local garbage collector does not solve all problems. In combination with partial replication (see ticket:171), the following situation may occur: Suppose the following partially replicated object graph, where R and B are distributed to nodes #0 and #1, but object A is only distributed to node #0 and object C is only distributed to node #1. Both, A and B reference object D that is implicitly distributed to all nodes, where it is referenced.
#0: #1: [R] [R] +---+---+ +---+---+ | | | | | | [A] [B] C A [B] [C] | | +--(D) (D)--+
The application does now remove the reference A-D on the partition on node #0. The resulting update does not change the partition on node #1, because A is not distributed to that node.
#0: #1: [R] [R] +---+---+ +---+---+ | | | | | | [A] [B] C A [B] [C] | ?D? (D)--+
This operation removed the last reference to object D on node #0. This does not mean that D is immediately garbage collected, but it may become subject to garbage collection at any time in the future. This has the effect, that node #0 cannot tell node #1 during the update, that object D was removed from its local partition.
Now a new reference B-D is added on node #1, but before node #1 sends its update to node #0, the local garbage collector on node #0 deletes the local copy of object D. Since node #1 does not know that #0 has no longer a local copy of D, it only sends the new reference B-D within its update, not the whole object D.
#0: #1: [R] [R] +---+---+ +---+---+ | | | | | | [A] [B] C A [B] [C] | | | ? (D)--+
But on node #0, the target of the reference B-D does no longer exist. When retrieving the object for D's identifier, a null value is fetched. Therefore, the replicated object is in an inconsistent state after applying the update from node #1.
Another situation that is even worse could occur, if node #1 would additionally modify D by introducing a reference D-E to a new object E. The patch for object D now cannot be applied, because there is no longer a local copy of D on node #0. This patch must be skipped. In consequence, the new object E, which is marshaled along with the patch for D, is not installed on node #0.
When referencing objects through "special" references those objects cannot be prevented from being garbage collected at an arbitrary point in time. Therefore, no node can rely on its local information, whether some object is available on some other node, or not.
To solve those problems, a node must be able to revive a locally dead object. The application cannot distinguish the revived object from the dead one, because it cannot have a reference to a dead and garbage collected object.
There are several costly and nearly impractical options for reviving locally dead objects:
- Deferring reference updates to locally dead objects and skipping patches for those. Whenever encountering a reference to a locally dead object, the corresponding instance variable that should receive the update could be remembered for recovery. Patches that should be applied to locally dead objects could be skipped. During recovery, all locally dead objects that where referenced within a patch could be requested from the sender of the patch.
This requires a major modification of object serialization, because the instance variable that must be fixed later has to be remembered. In Java, this may be solved with reflection only.
When a patch for a locally dead object is received, this patch must be skipped without knowing the type of objects this patch should have been applied to. This requires the patch format to be traversable without knowing the type of its target object. This requires additional (type or size) information to be transmitted along with a patch record.
Skipping a patch may introduce downstream faults: If the whole patch is skipped, and there is a new object included within that patch, referencing this object from within a later patch record will also fail. This causes more recoveries than absolutely necessary.
- Remembering classes of objects beyond their death and reallocating them just in time. Whenever registering an object with a weak reference within the administrative structures of a replicated graph, its class object could be remembered along with the reference. When a locally dead object is referenced within a patch, a plain uninitialized instance of this class could be created and used as new reference. After the patch is applied completely, the complete state of all resurrected objects could be requested from the sender and applied as secondary patch.
The ability of creating an uninitialized instance of an object for later initialization must be included into the "patchable" interface. Such plain object creation is only possible for non-immutable classes. In contrast, when creating an object of type String, it must be completely initialized during construction. Therefore, references to locally dead string objects could not be resurrected using this approach. Array objects require the length of the array at the time, the array is (re-)constructed. More general: To (re-)construct an object, all its final fields are required (and must be remembered beyond its death).
Fortunately, there is a trivial solution to the reconstruction problem: For each object within a replicated graph, there is a backup copy that is required to find differences during the update process. Note: The approach described in the following may cause inconsistencies and unwanted updates to replicated objects (see ticket:181 for a resolution): This backup copy can be used to apply a patch, in case a patch for a locally dead object is received, and it can be used to revive the original object, in case a reference to a locally dead object is received within a patch.
Problem: Detached objects
The local garbage collector of a virtual machine can only detect that an object is no longer referenced within its whole VM. It cannot detect that an object was detached from a replicated graph by removing all its references from the graph, but keeping a reference outside the graph. This problem gets even worse with partial replication (see ticket:171). Here, the application might detach an object from the graph on one node while keeping the object connected to the graph on other nodes.
Solution: Detached objects
Once, an object is associated with a replicated graph, it stays associated until its death. An object may die locally, but copies stay alive on other nodes. An object that is locally dead may even revive, if a reference to a corresponding object is imported to the local part of the replicated graph through an update.
Problem: Updates for locally dead objects
An object may die at any time, even after an update of the replicated graph has been constructed locally and sent to other nodes. If a corresponding object of a locally dead object is modified on another node, the resulting update cannot be applied to the node with the locally dead object.
Solution: Updates for locally dead objects
Note: The solution as described in the following is broken (see ticket:181 for details). An update to a locally dead object is applied to its backup copy, which is still alive, until the death of the object is detected during the next active update. The backup copy can also be used to revive a locally dead object, in case, a reference to it is imported through an update.
implemented since 1.08b.
- Severity changed from normal to trivial
- Component changed from uka.karo to JP trafo
- Priority changed from normal to highest
- Version set to 1.05d
- Milestone changed from 2.0 to 1.05f
- Type changed from enhancement to defect
- Severity changed from trivial to minor
- Component changed from JP trafo to uka.karo
- Priority changed from highest to low
- Version changed from 1.05d to 1.04b
- Milestone changed from 1.05f to 1.03g
- Type changed from defect to task
- Severity changed from minor to critical
- Component changed from uka.karo to uka.lang
- Priority changed from low to lowest
- Version changed from 1.04b to 1.04e
- Milestone changed from 1.03g to 1.03k
- Type changed from task to defect