As a reminder, the degree-weighted path count (DWPC) measures the prevalence of metapath between a specific source and target node [1]. It equals the sum of path degree products (PDPs), which provide a score for a single path based on the degrees along the path.
Traditionally, the DWPC sums the PDPs for all paths connecting the source and target node along a specified metapath. Here I propose a new type of DWPCs that only sums paths that traverse the same intermediate node at a specified position. In other words, traditional DWPCs are defined for a source–target–metapath combination, whereas the proposed DWPCs are defined for a source–target–metapath–position combination. Position refers to an intermediate metanode. However, this approach would also work with an intermediate metaedge as the position. Note that choosing either the source or target metanode as the position is equivalent to the traditional DWPC.
The purpose of this approach is to assess the contribution of intermediate nodes (or edges) in composing the DWPC. Remember that the sum of all "partial" DWPCs equals the traditional DWPC. This approach doesn't replace the need for traditional DWPCs — they serve different needs and answer different questions.
I'm not satisfied with the traditional versus partial nomenclature. @alizee, any advice?
Prelude: I recently helped @cgreene with a grant proposal titled "Network-based algorithms for drug discovery from genetic associations" (application 1R01HG009516-01A1). For this proposal, we wanted to show an example where considering the tissue-specificity of paths helped identify the mechanisms of drug efficacy. In the course of this analysis, we came up with the partial DWPC method and the following example (the tissue-specific additions are not included below).
Enalapril treats coronary artery disease (CAD) by inhibiting angiotensin-converting enzyme (ACE) [1]. Traditionally, if we were interested in potential pathways contributing to drug efficacy we may search for CbGpPWpGaD paths between enalapril and CAD. Below is the Cypher query to return all paths, ranked by PDP (run the query at https://neo4j.het.io):
MATCH path = (n0:Compound)-[:BINDS_CbG]-(n1)-[:PARTICIPATES_GpPW]-
(n2)-[:PARTICIPATES_GpPW]-(n3)-[:ASSOCIATES_DaG]-(n4:Disease)
USING JOIN ON n2
WHERE n0.name = 'Enalapril'
AND n4.name = 'coronary artery disease'
AND n1 <> n3
WITH
path,
[
size((n0)-[:BINDS_CbG]-()),
size(()-[:BINDS_CbG]-(n1)),
size((n1)-[:PARTICIPATES_GpPW]-()),
size(()-[:PARTICIPATES_GpPW]-(n2)),
size((n2)-[:PARTICIPATES_GpPW]-()),
size(()-[:PARTICIPATES_GpPW]-(n3)),
size((n3)-[:ASSOCIATES_DaG]-()),
size(()-[:ASSOCIATES_DaG]-(n4))
] AS degrees
RETURN
substring(reduce(s = '', node IN nodes(path)| s + '–' + node.name), 1) AS nodes,
reduce(pdp = 1.0, d in degrees| pdp * d ^ -0.4) AS PDP
ORDER BY PDP DESC
Overall, 757 paths were returned. The top 3 paths are:
nodes
PDP
Enalapril–ACE–Metabolism of Angiotensinogen to Angiotensins–ACE2–coronary artery disease
Now let's assume we're more interested in the contributions of specific pathway nodes rather than specific paths. In other words, we don't really care what genes got us to a pathway, we just want an overal score per pathway. In this case, we can select n2 as the position. Now we're computing a DWPC for Enalapril–binds–Gene–participates–Pathway–participates–Gene–associates–coronary artery disease, where bold indicates position. The query becomes:
MATCH path = (n0:Compound)-[:BINDS_CbG]-(n1)-[:PARTICIPATES_GpPW]-
(n2)-[:PARTICIPATES_GpPW]-(n3)-[:ASSOCIATES_DaG]-(n4:Disease)
USING JOIN ON n2
WHERE n0.name = 'Enalapril'
AND n4.name = 'coronary artery disease'
AND n1 <> n3
WITH
path,
n2 AS pathway,
[
size((n0)-[:BINDS_CbG]-()),
size(()-[:BINDS_CbG]-(n1)),
size((n1)-[:PARTICIPATES_GpPW]-()),
size(()-[:PARTICIPATES_GpPW]-(n2)),
size((n2)-[:PARTICIPATES_GpPW]-()),
size(()-[:PARTICIPATES_GpPW]-(n3)),
size((n3)-[:ASSOCIATES_DaG]-()),
size(()-[:ASSOCIATES_DaG]-(n4))
] AS degrees
RETURN
pathway.identifier AS pathway_id,
pathway.name AS pathway_name,
count(*) AS PC,
sum(reduce(pdp = 1.0, d in degrees| pdp * d ^ -0.4)) AS DWPC
ORDER BY DWPC DESC, pathway_name
40 pathways are returned, of which the top 5 are displayed below:
pathway_id
pathway_name
PC
DWPC
WP554_r84372
ACE Inhibitor Pathway
11
0.0015
PC7_8339
Transmembrane transport of small molecules
150
0.0008
PC7_5323
Metabolism of Angiotensinogen to Angiotensins
3
0.0005
PC7_7290
SLC-mediated transmembrane transport
40
0.0004
PC7_5322
Metabolism
309
0.0004
As shown, we now have a ranking of pathways based on their contribution to the overall CbGpPWpGaD metapath. Currently, I don't see a huge role for this approach for feature extraction, but think it's useful for following up on specific predictions and highlighting mechanisms of drug efficacy.
Pouya Khankhanian: Agree with "I think it's useful for following up on specific predictions and highlighting mechanisms of drug efficacy". Especially if the function to display this result is embedded in a button on the neo4j interface.
I'd love to see the weight given to various nodes in the top predictions for epilepsy, especially the ones in the top 100 which were not classified as AEDs.
The previous comment discussed grouping paths by an intermediate node and then calculating partial DWPCs. This comment introduces an alternative grouping method: grouping either by the source edge (first edge in the path) or target edge (last edge in the path).
Here's the intuition behind this approach. In a hetnet, a node derives its meaning from its relationships. For example, our algorithm is based solely on relationships. Therefore, a good way to investigate a prediction is to consider which edges of either the source compound or target disease mattered. We can this for a specific source–target–metapath combination, by grouping paths by their source or target edge.
For example, the following query takes the enalapril–CAD example and asks which target edges are composing the CbGpPWpGaD paths.
MATCH path = (n0:Compound)-[:BINDS_CbG]-(n1)-[:PARTICIPATES_GpPW]-
(n2)-[:PARTICIPATES_GpPW]-(n3)-[:ASSOCIATES_DaG]-(n4:Disease)
USING JOIN ON n2
WHERE n0.name = 'Enalapril'
AND n4.name = 'coronary artery disease'
AND n1 <> n3
WITH
path,
[
size((n0)-[:BINDS_CbG]-()),
size(()-[:BINDS_CbG]-(n1)),
size((n1)-[:PARTICIPATES_GpPW]-()),
size(()-[:PARTICIPATES_GpPW]-(n2)),
size((n2)-[:PARTICIPATES_GpPW]-()),
size(()-[:PARTICIPATES_GpPW]-(n3)),
size((n3)-[:ASSOCIATES_DaG]-()),
size(()-[:ASSOCIATES_DaG]-(n4))
] AS degrees, n3, n4
RETURN
n4.name AS target_name,
type(relationships(path)[3]) AS target_edge_type,
n3.name AS n3_name,
sum(reduce(pdp = 1.0, d in degrees| pdp * d ^ -0.4)) AS DWPC
ORDER BY DWPC DESC
The top five results are:
target_name
target_edge_type
n3_name
DWPC
coronary artery disease
BINDS_CbG
SLC22A3
0.00072
coronary artery disease
BINDS_CbG
ACE2
0.00058
coronary artery disease
BINDS_CbG
REN
0.00044
coronary artery disease
BINDS_CbG
SLC6A6
0.00038
coronary artery disease
BINDS_CbG
NR3C2
0.00025
These are the top ranking CAD-associated genes that participate in pathways with enalapril targets. As shown by the DWPC column, several of the top target edges are contributing to a similar extent. There is no one CAD-associated gene that is responsible for the bulk of the CbGpPWpGaD DWPC.
In instances where only one path composes the bulk of the total DWPC, you know that a single relationship is driving the score. For example, we can rewrite the above query to analyze the source edge:
MATCH path = (n0:Compound)-[:BINDS_CbG]-(n1)-[:PARTICIPATES_GpPW]-
(n2)-[:PARTICIPATES_GpPW]-(n3)-[:ASSOCIATES_DaG]-(n4:Disease)
USING JOIN ON n2
WHERE n0.name = 'Enalapril'
AND n4.name = 'coronary artery disease'
AND n1 <> n3
WITH
path,
[
size((n0)-[:BINDS_CbG]-()),
size(()-[:BINDS_CbG]-(n1)),
size((n1)-[:PARTICIPATES_GpPW]-()),
size(()-[:PARTICIPATES_GpPW]-(n2)),
size((n2)-[:PARTICIPATES_GpPW]-()),
size(()-[:PARTICIPATES_GpPW]-(n3)),
size((n3)-[:ASSOCIATES_DaG]-()),
size(()-[:ASSOCIATES_DaG]-(n4))
] AS degrees, n0, n1
RETURN
n0.name AS source_name,
type(head(relationships(path))) AS source_edge_type,
n1.name AS n1_name,
sum(reduce(pdp = 1.0, d in degrees| pdp * d ^ -0.4)) AS DWPC
ORDER BY DWPC DESC
source_name
source_edge_type
n1_name
DWPC
Enalapril
BINDS_CbG
ACE
0.00273
Enalapril
BINDS_CbG
SLCO1A2
0.00081
Enalapril
BINDS_CbG
ABCB1
0.00081
Enalapril
BINDS_CbG
SLC22A7
0.00068
These results show that enalapril's binding ACE is driving the CbGpPWpGaD DWPC. In other words, if enalapril did not bind ACE, the CbGpPWpGaD DWPC would be ~40% lower (the total CbGpPWpGaD DWPC between enalapril and CAD is 0.00677).
Nicolas Danchin, Michel Cucherat, Christian Thuillez, Eric Durand, Zena Kadri, Philippe G. Steg (2006) Archives of Internal Medicine. doi:10.1001/archinte.166.7.787