You’ve probably seen or interacted with a graph, whether you realized it or not. Our world is made up of relationships. Who we know, how we interact, how we transact – graphs structure information in a way that makes these inherent relationships explicit.
Analytically speaking, knowledge graphs provide the most intuitive means to synthesize and represent connections within and between data sets for analysis. A knowledge graph is a technical artifact “that presents data visually as entities and the relationships between them.” It provides the analyst with a digital model of a problem. And it looks like this…
This article discusses what makes a great chart and answers some common questions related to its technical implementation.
Graphs can represent almost anything where there is interaction or exchange. Entities (or nodes) can be people, companies, documents, geographic locations, bank accounts, cryptocurrency wallets, physical assets, etc. Edges (or links) can represent conversations, phone calls, emails, academic appointments, network packet transfer, ad impressions and conversions, financial transactions, personal relationships, etc.
So what makes a great chart?
- The purpose of the graph is clear.
The domain of graph-based solutions includes an analytical environment (often powered by a graph database), graph analysis techniques, and graph visualization techniques. Graphs, like most analytical tools, require specific use cases. Graphs can be used to visualize connections within and between data sets, to uncover latent connections, to simulate the spread of information or model contagion, to model network traffic or social behavior, to identify the most influential actors in a social network, and many other use cases. Who is using the graph? What are these users trying to accomplish analytically or visually? Are they exploring an organization's data? Are they answering specific questions? Are they analyzing, modeling, simulating, predicting? Understanding the use cases that the graph-based solution is meant to address is the first step in establishing the purpose of the graph and identifying the graph domain.
- The graph is domain-specific.
Probably the biggest mistake in implementing graph-based solutions is the attempt to create a master graph. One graph to rule them all. In other words, all of the company’s data in one graph. The graph is not a master data management (MDM) solution or a replacement for a data warehouse, even if the organization has a scalable graph database. The most successful graphs represent a given domain of analytical investigation. For example, a financial intelligence graph might contain companies, beneficial ownership structures, financial transactions, financial institutions, and high net worth individuals. A pattern-of-life location graph might contain high-volume signal data such as IP addresses and mobile phone data, along with physical locations, technical assets, and individuals. Once the purpose and domain of a graph are clear, architects can move on to the data available and/or needed to build the graph.
- The graph has a clear outline.
A graph that is in a graph database will have a schema that dictates its structure. In other words, the schema will specify the types of entities that exist in the graph and the relationships that are allowed between them. One advantage of a graph database over other types of databases is that the schema is flexible and can be updated as new data, entities, and relationship types are added to the graph over time. Graph data engineers make many decisions when designing a graph database to represent the ontology (the conceptual structure of a data set) in a schema that makes sense for the graph being built. If the data is well understood in the organization, the graph architecture process can often begin with schema creation, but if the nature of the graph and the data sets included are more exploratory, then ontology design may be required first.
Consider the sample schema in the image below. There are five types of entities: people (yellow), physical and virtual locations (blue), documents (grey), companies (pink), and financial accounts (green). Between entities, several types of relationships are allowed, for example, “is_related_to”, “mentions”, and “invests_in”. This is a directed graph, meaning that the directionality of the relationship has meaning, i.e., two people are_married_to_each_other (bidirectional link) and one person lives_at_a_place (directed link).
- There is a clear mechanism for connecting data sets.
Connections between entities in different data sets are not always explicit in the data. Simply importing two data sets into a graph environment can result in many nodes with no connections between them.
Consider a medical dataset which has an entry for Tom Marvolo Riddle and a voter registration dataset which has an entry for TM Riddle and an entry for Merope Riddle Gaunt. In the medical dataset, Merope Gaunt is listed as the mother of Tom Riddle. In the voter registration dataset, no family members are described. How are the entries for Tom Marvolo Riddle and TM Riddle de-duplicated when merging the datasets in the graph? i.e. there should not be two separate nodes in the graph for Tom Riddle and TM Riddle as they are the same person. How are Tom Riddle and Merope Gaunt connected, and how is their connection specified as in the image below? e.g. connected, related, mother/son? Is the relationship weighted?
These questions require not only a data engineering team to specify the graph schema and implement the graph layout, but also some sort of entity resolution process, which I've written about previously.
- The graph is designed to scale.
The chart data is pre-joined in the chart data store, which means that single hop queries Graph analysis operations run faster than on traditional databases, for example, querying Tom Riddle and seeing all his immediate connections. However, analytical operations on graphs are quite slow, for example, 'show me the shortest path between Tom Riddle and Minerva McGonagall' or 'which character has the highest eigenvector centrality in Harry Potter and the Half-Blood Prince'. As a rule of thumb, latency on graph operations increases exponentially with the density of the graph (a ratio of existing connections in the graph to all possible connections in the graph). Most graph visualization tools struggle to render several tens of thousands of nodes on the screen.
If an organization is looking for scalable graph solutions for multiple concurrent analyst users, a tailored graph data architecture is required. This includes a scalable graph database, various graph data engineering processes, and an interface visualization tool.
- The graph has a solution to handle temporality.
Once a graph solution is built, one of the biggest challenges is how to maintain it. Connecting five data sets into a graph database and rendering the resulting graph analysis environment produces a snapshot in time. What is the periodicity of those data sets and how often should the graph be updated, i.e. weekly, monthly, quarterly, real-time? Is data overwritten or appended? Are deleted entities removed from the graph or retained? How are updated data sets provided, i.e. delta tables, the entire data set again? If there are temporal elements in the data, how are they represented?
- The graph-based solution is designed by graph data engineers.
Charts are beautiful. They are human-intuitive, engaging, and highly visual. Conceptually, they are deceptively simple. Put together some data sets, specify relationships between the data sets, merge the data, and a chart is born. Analyze the chart and generate pretty pictures. But the data engineering challenges associated with architecting a scalable graph-based solution are non-trivial.
Tool and technology selection, schema design, graph data engineering, approaches to entity resolution and data deduplication, and good architecture for the intended use are just a few of the challenges. What is important is to have a true graph team at the helm of designing an enterprise graph-based solution. A graph visualization capability does not a graph solution make. And a simple point-and-click self-service software may work for a single analyst user, but it is a far cry from being a relevant graph analytics environment for the organization. Graph data engineers, methodologists, and solution architects with graph expertise are required to build a high-fidelity graph-based solution in light of all the challenges mentioned above.
Conclusion
I’ve seen how charts have changed many real-world analytics organizations. Regardless of the analytics domain, much of an analyst’s work is manual. There are numerous technology products that attempt to automate analyst workflows or create point-and-click solutions. Despite these efforts, the fundamental problem remains: the data an analyst needs is rarely easily accessible through an interface, let alone interconnected and ready for iterative exploration. Data is provided to analysts through a variety of platforms, application programming interfaces (APIs), and query tools, all of which require varying levels of technical acumen to access. It is then up to the analyst to manually synthesize the data and draw meaningful analytical conclusions.
Graph-based solutions bring all of an analyst’s relevant data into one place and represent it in an intuitive way. This gives the analyst the ability to quickly click on entities and connections as appropriate for analysis. I have personally helped teams build anti-money laundering solutions, identify malicious actors and illicit financial transactions, intercept migrants lost at sea, track the movement of illegal substances, tackle illegal wildlife trafficking, and predict migration routes – all with graph-based solutions. To harness the power of graph solutions for analytics businesses, you first need to build a great graph – a solid foundation on which to build stronger, more impactful analytical research.