Event Sourced Content Repository

Tech

Sebastian Kurfürst25.09.2017

I am currently on the way back from Kiel, where we've been doing a week-long sprint for the event sourced Content Repository of Neos. It was quite an amazing week – really intense, really long days and nights, and a mind-bending concept. With this post, I am trying to explain our ideas, as a way to document what we want to achieve and also to help me explain the concepts in a better way. Many of the things outlined below are part of our prototype implementation already, though it is not yet clear to me how many unknown issues we still have – practice will tell us!

Don't worry if you do not understand everything written below; as a user of the Content Repository you will mostly not need to deal with these things. As a user, we'll still provide the familiar NodeInterface as compatibility layer, so this won't change much.

The text below is also meant as a reference and a "Gedankenstütze" for myself and the others involved in the workshop.

Event Sourcing in the Content Repository (CR) - Basic Idea

For a basic explanation of CQRS and event sourcing, see this post. It explains the concept of the Event Stream, Projections, Soft and Hard Constraints and Eventual Consistency vs Immeditate Consistency.

The basic idea of the event sourced CR is that we want to store all modifications which have been done in the event stream; allowing features such as history and rollback, audit logging, conflict detection on publish, and vastly superior performance and scalability compared to our current Content Repository. We want to exploit the fact 95% or more of all CR access operations are only reading the CR; and not writing to it. Thus, we want the reading to be really fast. In order to do that, we will pre-calculate data when a write occurs, ensuring that very little operations need to be done on read. Through the use of projections, we can build up read models in a (somehow) denormalized way, optimizing for the common access patterns.

The following example shows a high-level overview of what happens when the user triggers the creation of the node:

The CreateNodeCommand encapsulates the user's intent to create a node. It contains information where the node shall be created, what the required values are, what the node type is, etc.
Then, we do some soft constraint checks (like checking whether the parent node exists). We can check this by querying the Graph Projection; however we must be aware that there are cases where the soft constraint cannot be enforced (if concurrent requests and a race occurs).
After the soft constraint checks were completed, one or multiple NodeWasCreatedEvent are created from the command, e.g. taking auto-created child nodes into account.
At the time the Event(s) are stored and persisted in the Event Store, they have actually happened and are part of the system's history and contribute to the system's state.
We then update the projections. The most important projection is the graph projection, which is an efficient way to store the content tree in different dimensions and workspaces. Despite the name, the Graph Projection "lives" by default in a few database tables.
There might be other projections, like a search index projection; which are updated after the events are stored in the Event Store.

When replaying the events from the event store, it must be guaranteed that the projection will end up with exactly the same state. This means the projection is not allowed to e.g. read configuration from Node Types or other settings; but it is only allowed to consume events.

Content Streams a.k.a Workspaces

Neos has the concept of a "workspace": If the user changes content, he only changes the content in his personal workspace. After he completed all his changes, he can publish the workspace to the live website. This can be roughly compared like branching in Git and Merging back. You can also build review workflows with workspaces, i.e. by publishing from the user's workspace to a review workspace, and then from the review workspace to the live workspace.

Today, when the user logs in for the first time, his user workspace is still empty and he will fully see the contents of the live website "shining though" – I always think of a fully transparent layer in Photoshop; where you can fully look through. When the user starts modifying content, the user workspace contains the modified Nodes, effectively hiding the corresponding nodes in the live workspace.

While this intuitively seems very useful and sensible, there are two main issues: First, the "shine-through" logic needs to be implemented at runtime (which is slow). Second, the shine-through is very hard to implement in all cases, especially if you want to support moving of nodes combined with multiple content dimensions (i.e. translations). Furthermore, there was no good way to detect merge conflicts or even resolve them; always the last one who published won.

We need to divide the problem into smaller sub-problems!

In the Event Sourced Content Repository (displayed on the right in the image above), every workspace conceptually has his own event store. However, as the user-sebastian workspace "branched off" (or "forked") at a certain time from the live workspace, we do not need to copy all events into their own store; but we can simply remember where we branched off.

Then, conceptually there are different projections for the live content tree and the user-sebastian workspace.

As long as nobody has modified the live workspace during our changes, we can directly "merge" back to the live workspace – as there cannot be any conflicts. However, if some other change went live in the meantime, we cannot be sure that the changes from our workspace and the live workspace won't interfere or conflict – that's why merges are forbidden if something changed in the base workspace in the meantime.

Conflict Resolution: Rebase

Of course, we need a way to detect and fix potential conflicts; and to finally be able to merge. Our idea is the following: Instead of having some "black-box" merge mechanism, we want to support something like a "rebase": This way, we can re-use the soft constraints for conflict and consistency detection. We want to automatically rebase in the background, while the user is continuing to edit. To fully support this, we are introducing the concept of a "content stream", which is an additional indirection between workspace and the data inside the workspace.

Every workspace which contains some changes has a "current" content stream assigned, which is the one the user currently sees and pushes his changes to. While a rebase is in progress, a "next" content stream is assigned to the workspace as well. When the rebase was successful, the "next" content stream will become the "current" one; and the old one will be removed.

The concept of a content stream is also relevant for optimizing performance, as we only need to create a content stream as soon as a user starts editing – while a workspace exists for every user which exists in the system.

To fully support rebase, we are introducing the concept of a content stream, which is an additional indirection between workspace and the data inside the workspace.

Nodes, Dimensions and Node Aggregates

The Content Repository so far is centered around the concept of "Nodes", which we also have in the event sourced CR; albeit with some subtle, but important differences. The image below shows the differences and namings today and in the future event sourced version of the CR. We have good reasons for this change, which I can hopefully explain further down in the post. I'm starting with this comparison image to get the terminology straight. Below, I'll always use the new terminology.

Dimensions as the way to model content variants

We now need to introduce the concept of Dimensions before we get back to the details of the CR. The problem at hand is that you sometimes have another representation of the same information. Very often you translate your website into multiple languages, or want to change the content depending on the user's location. This is generalized into the concept of dimensions. In Neos, it is possible to not just have one axis of these representation variants, but multiple of them (i.e. language and country). With the new CR, we are formalizing this model some more:

A Dimension represents a single axis of content variation. It specifies what the allowed discrete Dimension Values for this dimension are. As an example, the dimension "Language" might have the values "de" (indicating German content), and "en" (indicating English content).
All dimensions which are configured form the Dimension Space, which is n-dimensional (with n being the number of dimensions).
A Dimension Space Point is a single point in the Dimension Space. It is characterized by an n-tuple (i.e. a tuple with n elements); where the first tuple element must be one of the first dimension values, etc.
Thus, the Dimension Space for three dimensions contains x * y * z Dimension Points, with x being the number of Dimension Values in the 1st Dimension, y being the number of Dimension Values in the 2nd Dimension, and z being the number of Dimension Values in the 3rd dimension.
Not all of these dimension points make sense for a particular project (i.e. the combination "Language fr, country fr", "Language de, country ch" might be allowed, but "Language de, country fr" might not make sense). That's why we allow to restrict the dimension space further using some constraints. The result of applying these constraints to the Dimension Space is the Allowed Dimension Subspace (which is a subset of the Dimension Space). In the example below, the Allowed Dimension Subspace is depicted by the circle outlines.
Using these constraints only makes sense for 2 or more dimensions. For an 1-dimensional space, you can just remove the un-needed dimension value.

It is important to remember that from a user's perspective, the bigger the dimension space is, the more content the editor has to produce. Thus, we think that in many real-world projects, the Allowed Dimension Subspace will not be huge. However, the CR is designed in a way to also support large dimension spaces.

Dimension Fallbacks

In the 2-dimensional-space example above, every content would need to be created four times, which means a lot of work for editors. In the above example, the Swiss version of the website and the German one are quite similar; the editors might only want to change certain header images, some intro texts on landing pages; but otherwise want to keep the German contents. Just copying the German content is not feasible, as this would need to be done for every future change on the German website: Over time, the swiss and german variants would drift further and further apart.

That's why we support so-called dimension fallback rules (which are illustrated as orange inheritance-arrows in the image above); i.e. you can specify things like "If I did not specialize a content in Swiss-German (ch), I should use the German (de) version".

Across a single dimension, the fallback rules form the intra-dimensional fallback graph (orange above). Starting from these, we can calculate the inter-dimensional fallback graph (displayed in green above), which is the fallback of all Dimension Space Points.

Nodes and Node Aggregates

A Node is located in exactly one Dimension Space Point, but it can be visible in more Dimension Space Points. Let's take the 1-dimensional example from above to illustrate this:

A Node located in the "language=fr" Dimension Space Point is visible exactly in French.
A Node located in the "language=de" Dimension Space Point is visible in German and in Swiss (because of the dimension fallback), as long as no Node located in language=ch exists.

Every node has an individual identity; but we somehow need a way to group nodes which "mean the same" together. As an example, we need to be able to find out whether the current page exists in french or not. In order to do this, we group together related nodes from different Dimension Space Points into the same Node Aggregate.

A Node Aggregate has an identifier (a UUID) and the node type; to ensure all nodes in the different Dimension Space Points are at least adhering to the same "interface", i.e. have the same node type.

The Node Aggregate's identifier is the external identity of a node; meaning this will be used for referencing other nodes throughout the system, i.e. for links and references.

A Node Aggregate has to ensure the invariant that all Nodes belonging to it are visible in different Dimension Space Points. This will be done by implementing a hard constraint, by an "NodeCreationAndDeletionAggregateRoot", with using the Node Aggregate UUID as stream name.

Node Type Strategies

Especially when moving nodes and using different dimensions, the current behavior is as follows:

If a Document Node is moved, we currently also move the node in the other dimensions.
If a Content Node is moved (i.e. inside a Document Node), the other dimensions are not touched.

The rationale behind this is as follows: Most users are thinking of their website as a single big tree of pages; even though there are multiple translations of it. They usually know where the content is located in the tree. If we would not move Document Node Variants alongside, it would be very likely that the user would start translating, move a single variant somewhere else; and then over-time, the connection between the different variants would diminish. If a visitor of the website would then navigate from the German to the English version of a page, he might end up at a totally different part of the website (which he would not expect at all).

Currently, the behavior when moving is quite hard-coded; and only the approaches above exist. However, for certain use-cases, one might want to adjust this behavior and fine-tune it. That's why we are introducing the concept of a Move Strategy, which is configured on the Node Type. This is also the reason that the Node Aggregate contains the Node Type: We want all Nodes in an aggregate behave in a similar manner, e.g. when moving nodes. We won't just make a "Move Strategy" configurable, but also other strategies as we need them.

Identity of a Node

To sum it up, the following properties are the main invariants for Nodes:

Nodes are uniquely identified by a Content Stream Identifier and a Node Identifier.
When creating a Node, you additionally need to specify a Dimension Space Point (where the node resides)
A Node is visible in multiple Dimension Space Points.
No two nodes belonging to the same node aggregate share the same Dimension Space Point where they are visible. Thus a node can be uniquely resolved by Node Aggregate Identifier, Content Stream Identifier and Dimension Space Point.

Moving a Node in the Dimension Space: Generalization, Specialization, Translation

We now introduce the three main ways by which a node can be moved in the Dimension Space; just to ensure we all use the same wordings.

Below, we show an examplary 1-dimensional fallback graph (inter- and intradimensional). mul is the base language, de and en inherit from it; and de has the special cases "ch" and "at". Let's think about a node which originally is in the "de" language (outlined by the blue border in the diagram below):

This node in content dimension "de" will, by default, also be visible in "ch" and "at" (if there does not exist another node in the same Node Aggregate in ch or at). When creating the node in "de", the command handler will look at the fallback graph above and see that the node is also visible in "ch" and "at". In the event, all three Dimension Space Points (de, ch, at) will be stored.

When the user wants to create a specialized version of the Node for schweizerdütsch (ch), we call this a specialization. In this case, a new Node (belonging to the same Node Aggregate) will be created for ch; and the "ch" dimension space point will be removed from the "de" node from above (as the German content does not shine through anymore, but has been replaced by specialized swiss content).

When the user erroneously created content as "de", but wants to move the content up the hierarchy in the fallback tree (as multilanguage / mul), we call this a generalization.

Last but not least, when the user wants to copy a Node to a different part of the fallback tree (not anchestors or children), we call this a a translation.

Implementation Details of the Content Graph Projection

Our main read model is the so-called content graph, giving us the traditional tree-shaped projection we need for the usual access patterns. Our projection basically has a nodes and an hierarchy-edge table in the database.

The Content graph is optimized for reading; meaning we want to traverse the content tree for a particular workspace and Dimension Space Point really efficiently, without calculating fallbacks. On the other side, a modification should also be relatively cheap, especially:

We want to let the database do as many operations as possible.
Moving of nodes should just relocate the edges, and not touch the nodes.
The fork content stream operation must be relatively cheap.

We conceptually give each edge a color, corresponding to a combined identifier of Content Stream Identifier + Dimension Space Point. When we traverse all edges with the same color, we navigate the content tree as we know it from within Neos.

There's one important carveat to note: The graph projection must be correct for all content streams at the same time. This must be handled specifically when the Fork Content Stream operation takes place.

The image above illustrates a simple case with two content streams and a single Dimension Space point (i.e. a 0-dimensional dimension space). After NodeWasCreated, the projection contains the root node and NodeA, connected through a single "live" edge (the black one). On processing the ContentStreamWasForked event, all black edges are duplicated and painted green (i.e. flagged from live to user-sebastian). The NodeA contains the "original" text string; meaning this is the string visible in both content streams. Now, we have the situation shown in the image above.

Now, a PropertyWasModified on user-sebastian needs to be processed. We cannot just modify NodeA directly, as we would then also modify the projection for the live content stream. Instead, we need to do a Copy on Write: If a Node has incoming edges from more than our current content stream, we need to copy the node, change the edges, and then update the node property. The Node's identifier is still "NodeA", both in content stream live and content stream user-sebastian. This is all displayed in the diagram below.

The same applies if the live content stream changes after another content stream was forked from it: Even here, we need to do the Copy on Write, to ensure the forked content stream sees the point in time where the fork occurred.

Note: We never need to join the two nodes together again (the "green" NodeA and the "black" NodeA), because if a merge happens, we just replay the events on the live content stream, and discard the green one.

Identifiers in the Graph Projection

In order to do the copy on write in the database correctly, we need a way to distinguish the two nodes from above (remember, they both have the Node Identity "NodeA"). We do this by adding a new, projection-internal identifier for implementing Copy on Write, which we call (currently) Node Anchor Point.

In summary, we have the following three levels of identifiers for a node:

The Node Aggregate Identifier is the external identity: You can store it persistently, send it to another system, link to it, ...
The Node Identifier is an identifier you are not allowed to remember persistently (in user code), but you need it for e.g. adding a node as child of another node.
The Node Anchor Point is a purely internal implementation detail of the Graph Projection, needed to implement Copy on Write. It is assigned inside the graph projector, and is never passed to the outside world.

Summing Up

I've tried to give a comprehensive overview from the basic concepts down to all the details I can remember from last week's workshop. Don't hesitate to ask questions, I am reachable on twitter as @skurfuerst and you'll find us on slack.neos.io in #project-cr-rewrite.

I'll leave the current status and our next steps up for another post, so that this can serve as a reference of our ideas.