Exploring cancer types with neo4j | by David Wells | August, 2024

How to identify and visualize clusters in knowledge graphs

In this post, we will identify and visualize different groups of cancer types by analyzing disease ontology as a knowledge graph. Specifically, we will set up neo4j in a Docker container, import the ontology, generate graph clusters and embeddings, before using dimension reduction to plot these clusters and gain some insights. Although we are using `disease_ontology` as an example, the same steps can be used to explore any ontology or graph database.

Cancer types seen as inlays and colored by group, image by the author

In a graph database, instead of storing data as rows (like a spreadsheet or relational database), data is stored as nodes and relationships between nodes. For example, in the figure below, we see that melanoma and carcinoma are subcategories of the cell type cancer tumor (shown by the relationship SCO). With this type of data, we can clearly see that melanoma and carcinoma are related, even though this is not explicitly stated in the data.

Example of a graphical database, image by the author

Ontologies are a formalized set of concepts and relationships between those concepts. They are much easier for computers to parse than free text, and therefore it is easier to extract meaning from them. Ontologies are widely used in the life sciences, and you may find an ontology that interests you in https://obofoundry.org/Here we will focus on disease ontology, which shows how different types of diseases relate to each other.

Neo4j is a tool for managing, querying and analyzing graph databases. To facilitate its configuration, we will use a Docker container.

docker run \
-it - rm \
- publish=7474:7474 - publish=7687:7687 \
- env NEO4J_AUTH=neo4j/123456789 \
- env NEO4J_PLUGINS='("graph-data-science","apoc","n10s")' \
neo4j:5.17.0

In the above command, the `-publish` flags set ports to allow Python to query the database directly and allow us to access it via a browser. The `NEO4J_PLUGINS` argument specifies which plugins to install. Unfortunately, the Windows Docker image doesn’t seem to be able to handle the installation, so to proceed, you’ll need to install Neo4j Desktop manually. Don’t worry though, the other steps should work for you.

While neo4j is running, you can access your database by going to http://localhost:7474/ in your browser, or you can use the Python driver to connect as shown below. Note that we are using the port we published with our docker command above and we are authenticating with the username and password we also defined above.

URI = "bolt://localhost:7687"
AUTH = ("neo4j", "123456789")
driver = GraphDatabase.driver(URI, auth=AUTH)
driver.verify_connectivity()

Once you have your neo4j database set up, it’s time to get some data. The neo4j n10s plugin is designed to import and handle ontologies; you can use it to integrate your data into an existing ontology or to explore the ontology itself. Using the cypher commands below, we first set up some configurations to make the results clearer, then set a uniqueness constraint, and finally import the disease ontology.

CALL n10s.graphconfig.init({ handleVocabUris: "IGNORE" });
CREATE CONSTRAINT n10s_unique_uri FOR (r:Resource) REQUIRE r.uri IS UNIQUE;
CALL n10s.onto.import.fetch(http://purl.obolibrary.org/obo/doid.owl, RDF/XML);

To see how this can be done with the Python driver, see the full code here https://github.com/DAWells/do_onto/blob/main/import_ontology.py

Now that we have imported the ontology, you can explore it by opening http://localhost:7474/ in your web browser. This allows you to explore your ontology a bit manually, but we are interested in the bigger picture, so let's do some analysis. Specifically, we will perform Louvain clustering and generate fast random projection embeddings.

Louvain clustering is a clustering algorithm for networks like this. In short, it identifies sets of nodes that are more connected to each other than to the larger set of nodes; this set is then defined as a cluster. When applied to an ontology, it is a fast way to identify a set of related concepts. Fast random projection, on the other hand, produces an embedding for each node, i.e. a numerical vector where more similar nodes have more similar vectors. With these tools we can identify which diseases are similar and quantify that similarity.

To generate embeddings and clusters, we need to “project” the parts of our graph that we are interested in. Because ontologies are often very large, this subdivision is an easy way to speed up the computation and avoid memory errors. In this example, we are only interested in cancers and no other disease types. We do this with the embedding query below; we match the node with the label “cancer” and any nodes that are related to it by one or more SCO or SCO_RESTRICTION relationships. Because we want to include relationships between cancer types, we have a second MATCH query that returns the connected cancer nodes and their relationships.

MATCH (cancer:Class {label:"cancer"})<-(:SCO|SCO_RESTRICTION *1..)-(n:Class)
WITH n
MATCH (n)-(:SCO|SCO_RESTRICTION)->(m:Class)
WITH gds.graph.project(
"proj", n, m, {}, {undirectedRelationshipTypes: ('*')}
) AS g
RETURN g.graphName AS graph, g.nodeCount AS nodes, g.relationshipCount AS rels

Once we have the projection (which we have called “proj”) we can calculate the clusters and embeddings and write them back to the original graph. Finally, by querying the graph, we can obtain the new embeddings and clusters for each cancer type, which we can export to a csv file.

CALL gds.fastRP.write(
'proj',
{embeddingDimension: 128, randomSeed: 42, writeProperty: 'embedding'}
) YIELD nodePropertiesWrittenCALL gds.louvain.write(
"proj",
{writeProperty: "louvain"}
) YIELD communityCount
MATCH (cancer:Class {label:"cancer"})<-(:SCO|SCO_RESTRICTION *0..)-(n)
RETURN DISTINCT
n.label as label,
n.embedding as embedding,
n.louvain as louvain

Let’s take a look at some of these clusters to see what type of cancers they are grouped together in. After we have loaded the exported data into a Pandas dataframe in Python, we can inspect the individual clusters.

Group 2168 is a group of pancreatic cancers.

nodes(nodes.louvain == 2168)("label").tolist()
#array(('"islet cell tumor"',
#       '"non-functioning pancreatic endocrine tumor"',
#       '"pancreatic ACTH hormone producing tumor"',
#       '"pancreatic somatostatinoma"',
#       '"pancreatic vasoactive intestinal peptide producing tumor"',
#       '"pancreatic gastrinoma"', '"pancreatic delta cell neoplasm"',
#       '"pancreatic endocrine carcinoma"',
#       '"pancreatic non-functioning delta cell tumor"'), dtype=object)

Group 174 is a larger group of cancers, but mainly carcinomas.

nodes(nodes.louvain == 174)("label")
#array(('"head and neck cancer"', '"glottis carcinoma"',
#       '"head and neck carcinoma"', '"squamous cell carcinoma"',
#...
#       '"pancreatic squamous cell carcinoma"',
#       '"pancreatic adenosquamous carcinoma"',
#...
#       '"mixed epithelial/mesenchymal metaplastic breast carcinoma"',
#       '"breast mucoepidermoid carcinoma"'), dtype=object)p

These are sensible groupings based on organ or cancer type that will be useful for visualization. On the other hand, the embeddings are still too high-dimensional to be meaningfully visualized. Fortunately, TSNE is a very useful method for dimension reduction. In this case, we used TSNE to reduce the 128-dimensional embedding to 2, while keeping closely related nodes close together. We can see that this worked by plotting these two dimensions as a scatterplot and coloring them according to Louvain clusters. If these two methods match, we should see the nodes grouped by color.

from sklearn.manifold import TSNEnodes = pd.read_csv("export.csv")
nodes('louvain') = pd.Categorical(nodes.louvain)
embedding = nodes.embedding.apply(lambda x: ast.literal_eval(x))
embedding = embedding.tolist()
embedding = pd.DataFrame(embedding)
tsne = TSNE()
x = tsne.fit_transform(embedding)
fig, axes = plt.subplots()
axes.scatter(
x(:,0),
x(:,1),
c  = cm.tab20(Normalize()(nodes('louvain').cat.codes))
)
plt.show()

TSNE projection of cancer inclusions colored by group, image by the author

This is exactly what we see: similar cancer types are grouped together and made visible as single-color clusters. Notice that some single-color nodes are far apart – this is because we have to reuse some colors, as there are 29 clusters and only 20 colors. This gives us a great overview of the structure of our knowledge graph, but we can also add our own data.

We then plot the frequency of cancer type as node size and the mortality rate as opacity (Bray and others 2024). I only had access to this data for a few of the cancer types, so I have only plotted those nodes. Below we can see that liver cancer does not have a particularly high incidence overall. However, the incidence rates for liver cancer are much higher than for other cancers within its group (shown in purple), such as oropharyngeal, laryngeal, and nasopharyngeal cancer.

Frequency and mortality of colored cancers by group, image by the author

Here we have used the disease ontology to group different cancer types into clusters, which gives us the context for comparing these diseases. We hope this small project has shown you how to visually explore an ontology and add that information to your own data.

You can check the complete code of this project at https://github.com/DAWells/do_onto.

Bray, F., Laversanne, M., Sung, H., Ferlay, J., Siegel, R.L., Soerjomataram, I., & Jemal, A. (2024). Global cancer statistics 2022: GLOBOCAN estimates of worldwide incidence and mortality for 36 cancer types in 185 countries. CA: a cancer journal for clinicians, 74(3), 229–263.