Summaries/Databases/Neo4j/Graph Data Modeling Fundame...

10 KiB

title updated created
Graph Data Modeling Fundamentals 2022-07-30 13:32:01Z 2022-07-17 19:04:53Z

What is Graph Data Modeling?

Why model?

If you will use a Neo4j graph to support part or all of your application, you must collaboratively work with your stakeholders to design a graph that will:

  • Answer the key use cases for the application.
  • Provide the best Cypher statement performance for the key use cases.

Components of a Neo4j graph

The Neo4j components that are used to define the graph data model are:

  • Nodes
  • Labels
  • Relationships
  • Properties

Data modeling process

Here are the steps to create a graph data model:

  1. Understand the domain and define specific use cases (questions) for the application.
  2. Develop the initial graph data model: a. Model the nodes (entities). b. Model the relationships between nodes.
  3. Test the use cases against the initial data model.
  4. Create the graph (instance model) with test data using Cypher.
  5. Test the use cases, including performance against the graph.
  6. Refactor (improve) the graph data model due to a change in the key use cases or for performance reasons.
  7. Implement the refactoring on the graph and retest using Cypher.

Graph data modeling is an iterative process. Your initial graph data model is a starting point, but as you learn more about the use cases or if the use cases change, the initial graph data model will need to change. In addition, you may find that especially when the graph scales, you will need to modify the graph (refactor) to achieve the best performance for your key use cases.

Refactoring is very common in the development process. A Neo4j graph has an optional schema which is quite flexible, unlike the schema in an RDBMS. A Cypher developer can easily modify the graph to represent an improved data model.

The Domain

Understanding the domain for your application

Before you begin the data modeling process you must:

  • Identify the stakeholders and developers of the application.
  • With the stakeholders and developers:
    • Describe the application in detail.
      • Identify the users of the application (people, systems).
      • Agree upon the use cases for the application.
      • Rank the importance of the use cases.

Purpose of the Model

Types of models

When performing the graph data modeling process for an application, you will need at least two types of models:

  • Data model
  • Instance model

Data model

The data model describes the labels, relationships, and properties for the graph. It does not have specific data that will be created in the graph.

Here is an example of a data model: 0e5c55b7a519831b5ba0393544641782.png

There is nothing that uniquely identifies a node with a given label. A graph data model, however is important because it defines the names that will be used for labels, relationship types, and properties when the graph is created and used by the application.

Style guidelines for modeling

As you begin the graph data modeling process, it is important that you agree upon how labels, relationship types, and property keys are named. Labels, relationship types, and property keys are case-sensitive, unlike Cypher keywords which are case-insensitive.

A Neo4j best practice is to use the following when you name the elements of the graph, but you are free to use any convention for your application.

  • A label is a single identifier that begins with a capital letter and can be CamelCase.
    • Examples: Person, Company, GitHubRepo
  • A relationship type is a single identifier that is in all capital letters with the underscore character.
    • Examples: FOLLOWS, MARRIED_TO
  • A property key for a node or a relationship is a single identifier that begins with a lower-case letter and can be camelCase.
    • Examples: deptId, firstName

Note: Property key names need not be unique. For example, a Person node and a Movie node, each can have the property key of tmdbId.

Instance model

An important part of the graph data modeling process is to test the model against the use cases. To do this, you need to have a set of sample data that you can use to see if the use cases can be answered with the model.

Here is an example of an instance model: 4097a690bcdbf2af251898baec7a4adf.png

Modeling Nodes

Defining labels

Entities are the dominant nouns in your application use cases:

  • What ingredients are used in a recipe?
  • Who is married to this person?

The entities of your use cases will be the labeled nodes in the graph data model.

Node properties

Node properties are used to:

  • Uniquely identify a node.
  • Answer specific details of the use cases for the application.
  • Return data.

Properties for nodes

In addition to the Uniquely indentifiers, that is used to uniquely identify a node, we must revisit the use cases to determine the types of data a node must hold.

Modeling Relationships

Relationships are connections between entities

Connections are the verbs in your use cases:

  • What ingredients are used in a recipe?
  • Who is married to this person?

At a glance, connections are straightforward things, but their micro- and macro-design are arguably the most critical factors in graph performance. Using “connections are verbs” is a fine shorthand to get started.

Naming relationships

Choosing good names (types) for the relationships in the graph is important. Relationship types need to be something that is intuitive to stakeholders and developers alike. Relationship types cannot be confused with an entity name.

Relationship direction

When you create a relationship in Neo4j, a direction must either be specified explicitly or inferred by the left-to-right direction in the pattern specified. At runtime, during a query, direction is typically not required.

Fanout

The main risk about fanout is that it can lead to very dense nodes, or supernodes. These are nodes that have hundreds of thousands of incoming or outgoing relationships Supernodes need to be handled carefully.

Properties for relationships

Properties for a relationship are used to enrich how two nodes are related. When you define a property for a relationship, it is because your use cases ask a specific question about how two nodes are related, not just that they are related.

Testing the Model

You use the use cases to design the data model:

  • includes labels for nodes
  • relationship types and direction
  • properties for the nodes and relationships.

Implement the data model with a small set of test data and test against the graph.

Test each use case against the graph by executing Cypher queries. Refactor the data model if a use case cannot be answered. You may specify what the expected result should be. More data for testing is OK => test scalability

The Cypher code used to test the use cases needs to be carefully reviewed for correctness.

Refactoring the Graph

Refactoring

changing the data model and the graph. three reasons why refactor:

  • The graph as modeled does not answer all of the use cases.
  • A new use case has come up that you must account for in your data model.
  • The Cypher for the use cases does not perform optimally, especially when the graph scales

Steps (must) for refactoring:

  1. Design the new data model.
  2. Write Cypher code to transform the existing graph to implement the new data model.
  3. Retest all use cases, possibly with updated Cypher code.

Labels in the Graph

In Cypher, you cannot parameterize labels so keeping the country as a property makes the Cypher code more flexible. Limit the number of labels to 4 What is the primary reason to add labels to nodes is reduce the number of data accessed at runtime.

Retesting After Refactoring

  • After refactoring the graph, revisit all queries for all use cases.
  • Rewrite any Cypher queries for use cases that are affected by the refactoring.

Avoid These Labels

  • Do not to use the same type of label in different contexts.
  • “Semantically orthogonal” is a term that means that labels should have nothing to do with one another. Avoid this.
  • avoid labeling your nodes to represent hierarchies.

Eliminating Duplicate Data

  • avoid duplicating data in your graph
  • elilimnate duplication -> improve query performance
    • In order to perform the query, all nodes must be retrieved to match a property.
    • example refactoring list property to nodes
MATCH (m:Movie)
UNWIND m.languages AS language
WITH  language, collect(m) AS movies
MERGE (l:Language {name:language})
WITH l, movies
UNWIND movies AS m
WITH l,m
MERGE (m)-[:IN_LANGUAGE]->(l);
MATCH (m:Movie)
SET m.languages = null

Eliminating Complex Data in Nodes

Storing complex data in the nodes like this may not be beneficial for a couple of reasons:

  • Duplicate data.
  • Queries related to the information in the nodes require that all nodes be retrieved.

Using Specific Relationships

Relationships are fast to traverse and they do not take up a lot of space in the graph.

In most cases where we specialize relationships, we keep the original generic relationships as existing queries still need to use them.

The code to refactor the graph to add these specialized relationships uses the APOC library.

MATCH (n:Actor)-[r:ACTED_IN]->(m:Movie)
CALL apoc.merge.relationship(n,
                              'ACTED_IN_' + left(m.released,4),
                              {},
                              m ) YIELD rel
RETURN COUNT(*) AS `Number of relationships merged`

It has a apoc.merge.relationship procedure that allows you to dynamically create relationships in the graph. It uses the 4 leftmost characters of the released property for a Movie node to create the name of the relationship.

Intermediate Nodes

a relationship that connects more than two nodes. Mathematics allows this, with the concept of a hyperedge. Impossible in Neo4j.

69b4c46435ed52c1fe5be0ba6a074be5.png Email is new intermediate node

ae805ac0f184fdb6cf93d6b038af28a9.png

  • Intermediate nodes deduplicate information.
  • Connect more than two nodes in a single context.
  • Share data in the graph.
  • Relate something to a relationship.