If you will use a Neo4j graph to support part or all of your application, you must collaboratively work with your stakeholders to design a graph that will:
- Answer the key use cases for the application.
- Provide the best Cypher statement performance for the key use cases.
### Components of a Neo4j graph
The Neo4j components that are used to define the graph data model are:
Graph data modeling is an iterative process. Your initial graph data model is a starting point, but as you learn more about the use cases or if the use cases change, the initial graph data model will need to change. In addition, you may find that especially when the graph scales, you will need to modify the graph (refactor) to achieve the best performance for your key use cases.
Refactoring is very common in the development process. A Neo4j graph has an optional schema which is quite flexible, unlike the schema in an RDBMS. A Cypher developer can easily modify the graph to represent an improved data model.
# The Domain
## Understanding the domain for your application
Before you begin the data modeling process you must:
- Identify the stakeholders and developers of the application.
- With the stakeholders and developers:
- Describe the application in detail.
- Identify the users of the application (people, systems).
- Agree upon the use cases for the application.
- Rank the importance of the use cases.
# Purpose of the Model
## Types of models
When performing the graph data modeling process for an application, you will need at least two types of models:
- Data model
- Instance model
## Data model
The data model describes the labels, relationships, and properties for the graph. It does not have specific data that will be created in the graph.
There is nothing that uniquely identifies a node with a given label. A graph data model, however is important because it defines the names that will be used for labels, relationship types, and properties when the graph is created and used by the application.
## Style guidelines for modeling
As you begin the graph data modeling process, it is important that you agree upon how labels, relationship types, and property keys are named. Labels, relationship types, and property keys are case-sensitive, unlike Cypher keywords which are case-insensitive.
A Neo4j best practice is to use the following when you name the elements of the graph, but you are free to use any convention for your application.
- A label is a single identifier that begins with a capital letter and can be CamelCase.
- Examples: Person, Company, GitHubRepo
- A relationship type is a single identifier that is in all capital letters with the underscore character.
- Examples: FOLLOWS, MARRIED_TO
- A property key for a node or a relationship is a single identifier that begins with a lower-case letter and can be camelCase.
- Examples: deptId, firstName
Note: Property key names need not be unique. For example, a Person node and a Movie node, each can have the property key of tmdbId.
## Instance model
An important part of the graph data modeling process is to **test** the model against the use cases. To do this, you need to have a set of sample data that you can use to see if the use cases can be answered with the model.
Entities are the dominant **nouns** in your application use cases:
- What **ingredients** are used in a **recipe**?
- Who is married to this **person**?
The entities of your use cases will be the labeled nodes in the graph data model.
## Node properties
Node properties are used to:
- Uniquely identify a node.
- Answer specific details of the use cases for the application.
- Return data.
## Properties for nodes
In addition to the Uniquely indentifiers, that is used to uniquely identify a node, we must revisit the use cases to determine the types of data a node must hold.
# Modeling Relationships
## Relationships are connections between entities
Connections are the verbs in your use cases:
- What ingredients are used in a recipe?
- Who is married to this person?
At a glance, connections are straightforward things, but their micro- and macro-design are arguably the most critical factors in graph performance. Using “connections are verbs” is a fine shorthand to get started.
## Naming relationships
Choosing good names (types) for the relationships in the graph is important. Relationship types need to be something that is intuitive to stakeholders and developers alike. Relationship types cannot be confused with an entity name.
## Relationship direction
When you create a relationship in Neo4j, a **direction must either be specified explicitly** or **inferred by the left-to-right direction** in the pattern specified. At runtime, during a query, direction is typically not required.
## Fanout
The main risk about fanout is that it can lead to very dense nodes, or supernodes. These are nodes that have hundreds of thousands of incoming or outgoing relationships Supernodes need to be handled carefully.
Properties for a relationship are used to enrich how two nodes are related. When you define a property for a relationship, it is because your use cases ask a specific question about how two nodes are related, not just that they are related.
- Queries related to the information in the nodes require that all nodes be retrieved.
# Using Specific Relationships
Relationships are fast to traverse and they do not take up a lot of space in the graph.
In most cases where we specialize relationships, we keep the original generic relationships as existing queries still need to use them.
The code to refactor the graph to add these specialized relationships uses the **APOC library**.
```
MATCH (n:Actor)-[r:ACTED_IN]->(m:Movie)
CALL apoc.merge.relationship(n,
'ACTED_IN_' + left(m.released,4),
{},
m ) YIELD rel
RETURN COUNT(*) AS `Number of relationships merged`
```
It has a **apoc.merge.relationship** procedure that allows you to **dynamically create relationships in the graph**. It uses the 4 leftmost characters of the released property for a Movie node to create the name of the relationship.
# Intermediate Nodes
a relationship that connects more than two nodes. Mathematics allows this, with the concept of a hyperedge. Impossible in Neo4j.