Introduction
-
Novel definition of the health records-based referral path as well as novel definition of salient features for referral paths generated from both network science and time series analysis.
-
Quantification of a physician’s position using centrality and other measures in the U.S. national cardiovascular referral network with the help of techniques specific to big data that are necessary for overcoming the infeasibility of using traditional algorithms for calculations at scale.
-
Investigation of the patterns of millions of referral paths in the referral network, which are validated by statistical tests.
-
Effective classification and regression models derived from novel referral path features and referral networks that distinguish (a) teaching status of a hospital and (b) patient treatment outcomes. These models pick up key predictors among network measures relevant to the optimization of an effective healthcare system.
Materials, notation, and methodology
Materials
Referral path
Referral network and computation of edge weights
(a) Raw visiting records | ||
Patient | Physician | date;HRR;HRRcity;state;zipcode;workRVU;specialty;PHN;teaching type; etc. |
α
| A | 2011-01-01;1010;Hanover;NH;03755;1.0;family practice;First hospital;0;etc. |
α
| B | 2011-01-10;1020;Boston;MA;02101;3.0;internal medicine;Second hospital;1; etc. |
α
| C | 2011-02-01;1050;New York;NY;10021;4.0;cardiology;Third hospital;1;etc. |
β
| B | 2011-03-01;1012;Lebanon;NH;03784;2.0;family practice;Fourth hospital;0;etc. |
β
| C | 2011-03-20;1022;Newton;MA;02461;5.0;vascular surgery;Fifth hospital;1; etc. |
(b) Referral path | ||
Patient | Node(date;#visiting records; RVU),divided by "->" | |
α
| A(2011-01-01;1.0,1.0)->B(2011-01-10;1.0;3.0)->C(2011-02-01;1.0;4.0) | |
β
| B(2011-03-01;1.0;2.0)->C(2011-03-20;1.0;5.0) | |
(c) Edges in the national referral network with the weights over all referral paths | ||
Directed edge | Weights of an edge | |
A->B | 3; 4; 4.82; 12.14; 23.42 | |
B->C | 5; 5; 5.12; 12.32; 18.22 |
Referral path features
-
Path length. The total number of physicians on a referral path. A physician could be counted multiple times if the patient visits the physician again. It is 5 in Fig. 2.
-
Average time gap between referrals on the referral path: \(\frac {T_{N}-T_{1}}{N-1}\).
-
Time range. TN−T1. It is the gap between the last visit and the first.
-
Recurrence. A binary variable recording whether there exists i,j, with 1≤i<j≤N, and Pi=Pj. It is true (set to “1”) in Fig. 2 because of multiple occurrences of physicians A and B.
-
Number of nodes before recurrence. This is defined as min{j}- 1, where (i,j) satisfy the above recurrence condition. It refers to the first reappearance of a node. In our example, it is 3 since the first three nodes A,B,C are different from each other before the first duplicate node, B.
-
Physician distribution entropy. This is the standard probabilistic definition of entropy \(\left (-{\sum \nolimits }_{x}p(x)\log _{2}(x)\right)\) derived here from the physician occurrence probability over the path. In Fig. 2, the frequencies of A,B,C are 2,2,1 respectively. The physician distribution entropy of the related probability distribution (0.4,0.4,0.2) is 1.522.
-
Hospital distribution entropy. The entropy of the derived physicians’ hospital distribution is another feature of diversity. Since we assume A and C are from the same hospital, the frequency distribution is (3,2) and the corresponding entropy is 0.971.
-
HRR distribution entropy. The entropy of the physicians’ HRR probability is another feature of diversity. It is the same value as PHN distribution entropy under the assumption that A and C are in the same HRR.
-
Main hospital. It is a derived referral path feature of the hospital in which the most physicians on the referral path are working. It is the hospital with A and C in Fig. 2.
-
Main or dominant HRR. The HRR in which the most physicians are working. It is the HRR with A and C in Fig. 2.
-
Number of pairs of nodes with reciprocal referrals on a referral path.\({\sum \nolimits }_{i,j} 1\left (1\leqslant i <j \leqslant T-1, P_{i}=P_{j+1}, P_{i+1} =P_{j}\right)\). There are two pairs of nodes (A,B) and (B,C) which have such reciprocal relations.
Node position features
-
Number of paths that contain the node.
-
Average index of the first-time occurrence in all paths. In Fig. 2, the index of first-time occurrence for nodes A,B,C is 1,2,3, respectively, so we can take the average over all referral paths.
-
Number of cross-HRR referrals proposed by the node. In Fig. 2, given the assumption that nodes A and C are from the same HRR, node A sends patients to node B in another HRR. Nodes B and C also form an edge that spans HRRs.
-
Number of cross-hospital referrals proposed by the node. In Fig. 2, given the assumption that nodes A and C are from the same PHN, node A sends patients to node B in another hospital. The same is true of nodes B and C.
Results
National, HRR and PHN network measures
Year | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 |
---|---|---|---|---|---|---|
# nodes | 272353 | 296008 | 313051 | 323042 | 334452 | 347586 |
# edges | 5708791 | 5948185 | 6313136 | 6544847 | 6785594 | 7047586 |
Exponent of indegree power law | 3.08 | 2.80 | 1.55 | 2.76 | 1.54 | 2.74 |
p-value of indegree power law test | 0.97 | 0.89 | 0.21 | 0.85 | 0.22 | 0.82 |
Exponent of outdegree power law | 3.01 | 2.69 | 2.71 | 2.66 | 2.56 | 2.68 |
p-value of outdegree power law test | 0.9 | 0.94 | 0.93 | 0.96 | 0.91 | 0.93 |
Size of the largest connected component | 271898 | 295405 | 312412 | 322452 | 333727 | 346711 |
(in, in) degree assortativity | -0.094 | -0.088 | -0.084 | -0.085 | -0.083 | -0.084 |
Self in/out degree correlation | 0.983 | 0.982 | 0.983 | 0.983 | 0.983 | 0.984 |
Reciprocity of #referral | 0.878 | 0.890 | 0.896 | 0.901 | 0.902 | 0.896 |
Referral path features
Year | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 |
---|---|---|---|---|---|---|
#referral paths | 4.44M | 4.45M | 4.54M | 4.59M | 4.63M | 4.66M |
Avg length | 3.850 | 3.907 | 3.983 | 4.023 | 4.061 | 4.115 |
Avg gap for a referral | 8.509 | 8.506 | 8.369 | 8.352 | 8.230 | 8.060 |
Avg time range | 24.247 | 24.727 | 24.969 | 25.245 | 25.192 | 25.109 |
Percent of paths with recurrent nodes | 33.418 | 32.879 | 32.836 | 32.784 | 32.573 | 32.301 |
Avg #nodes before recurrence | 4.087 | 4.130 | 4.179 | 4.196 | 4.223 | 4.271 |
Avg physician entropy | 1.400 | 1.410 | 1.423 | 1.427 | 1.436 | 1.448 |
Avg hospital entropy | 0.475 | 0.473 | 0.476 | 0.459 | 0.480 | 0.481 |
Avg HRR entropy | 0.107 | 0.109 | 0.108 | 0.105 | 0.112 | 0.116 |
Avg bidirectional pairs in a path | 0.450 | 0.455 | 0.465 | 0.474 | 0.476 | 0.479 |
Patterns of referral paths
Year | 2007 | 2008 | 2009 | 2010 | 2011 |
---|---|---|---|---|---|
Clustering coefficient | 75.0 | 74.9 | 74.9 | 74.8 | 74.7 |
Betweenness centrality | 74.9 | 74.7 | 74.8 | 74.7 | 74.5 |
Eigenvector centrality | 74.3 | 74.2 | 74.2 | 74.1 | 74.0 |
PageRank centrality | 74.8 | 74.6 | 74.7 | 74.6 | 74.5 |
h-index | 70.7 | 70.6 | 70.8 | 70.8 | 70.8 |
(a) Top 5 specialties. |
Cardiovascular disease |
Internal medicine |
Family practice |
Interventional cardiology |
Pulmonary disease |
(b) Top 5 cross-specialty referrals. |
Internal medicine → cardiovascular disease |
Cardiovascular disease → internal medicine |
Family practice → cardiovascular disease |
Cardiovascular disease → family practice |
Internal medicine → family practice |
Node position measure | Node feature about referral path | Correlation coefficient |
---|---|---|
Betweenness centrality | #paths with the node | 0.607 |
PageRank centrality | #paths with the node | 0.852 |
PageRank centrality | #paths with multiple occurrences | 0.740 |
h-index | #paths with the node | 0.783 |
h-index | #cross-PHN referral proposed by the physician | 0.640 |
Year | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 |
---|---|---|---|---|---|---|
Random network | 3.60E−03 | 3.10E−03 | 3.00E−03 | 2.90E−03 | 2.80E−03 | 2.80E−03 |
Referral network | 25.13 | 24.64 | 24.95 | 24.97 | 24.95 | 24.96 |
Three illustrative analyses
Teaching status classification
Feature Group | Features |
---|---|
PHN level network measures | #nodes, #edges, gini coefficient of indegree distribution, gini coefficient of outdegree distribution, alpha of indegree power law test, alpha of outdegree power law test, diameter, global clustering coefficient, local clustering coefficient, (in, in) assortativity, self degree correlation, reciprocity of # referral, reciprocity of RVUs |
Difference (in - out) of edge weights on PHN traffic map | Degree, #different referred patients, #referral, geometric mean of #visit, geometric mean of RVUs, ranking index based weight |
PHN position on PHN traffic map | Local clustering coefficient, PageRank, h-index |
average feature of referral paths in the PHN | Length, avg-time-gap, avg-time-range, recurrent node, # nodes before recurrence, phy-entropy, PHN-entropy, HRR-entropy, common connected nodes between neighbors, bidirectional pairs |
Average node position of the PHN in the national referral network | Local clustering coefficient, PageRank, h-index |
LR | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | average F-score |
---|---|---|---|---|---|---|---|
Recall | 0.844 | 0.902 | 0.805 | 0.882 | 0.830 | 0.792 | |
Precision | 0.704 | 0.712 | 0.733 | 0.667 | 0.780 | 0.792 | |
F-score | 0.768 | 0.796 | 0.767 | 0.759 | 0.804 | 0.792 | 0.781 |
SVM | |||||||
Recall | 0.791 | 0.762 | 0.717 | 0.774 | 0.825 | 0.914 | |
Precision | 0.756 | 0.780 | 0.805 | 0.750 | 0.750 | 0.762 | |
F-score | 0.773 | 0.771 | 0.759 | 0.762 | 0.786 | 0.831 | 0.780 |
Feature Name | Estimated Coefficient | 95% Confidence Interval | P-value |
---|---|---|---|
Gini coefficient of degree distribution in PHN network | 2.823 | (0.844 4.802) | 5.18E-03 |
Global clustering coefficient of PHN network | -10.693 | (-13.218 -8.167) | < 2E-16 |
(in, in) degree assortativity | 4.813 | (2.981 6.646) | 2.63E-07 |
Difference (in-out) of # referrals on PHN traffic map | 3.678 | (1.630 5.726) | 4.32E-04 |
h-index of a hospital on the PHN traffic map | 7.862 | (5.877 9.847) | 8.37E-15 |
Avg time range of a referral path | 5.138 | (2.157 8.119) | 7.29E-04 |
ratio of referral paths with recurrent nodes | -12.950 | (-16.614 -9.286) | 4.29E-12 |
Avg #nodes before recurrent nodes | 6.139 | (3.844 8.434) | 1.58E-07 |
Avg #bidirected pairs on referral paths in the PHN | 6.459 | (2.407 10.512) | 1.78E-03 |
Patient clinical outcome and treatment received classification
Group of Features | Features and ID |
---|---|
Network measures in the dominant HRR | 1:#nodes, 2:#edges, 3:indegree gini coefficient, 4:outdegree gini coefficient, 5:indegree power law test alpha, 6:outdegree power law test alpha, 7: diameter, 8:global clustering coefficient, 9:local clustering coefficient, 10: (in, in) assortativity, 11:self in/out degree coefficient, 12:referral reciprocity, 13:RVU reciprocity |
Referral path sequence | 14:#nodes, 15:average time gap, 16: time range, 17:indicator of recurrence, 18: #nodes before recurrence, 19:physician distribution entropy, 20: PHN distribution entropy, 21:HRR distribution entropy, 22:average #common connected nodes between neighbors, 23:#pairs of nodes with reciprocal referrals, 37:#change points, 38:#previous referral path in the same year, 39:distance between the first visited hospital and the end one, 40:total RVU, 41:month of the first visit, 42:#visited teaching hospitals, 43:specialty of the key physician, 44:specialty of the last physician, 45:#visited PHN with negative (in-out) degree on PHN traffic map, 46:#visited PHN with positive (in-out) degree on PHN traffic map, 47:sum of (in-out) degree for all PHN on the referral path, 60:indicator of admitted by emergency department for the first node |
Average node positions on the referral path | 24:local clustering coefficient, 25:PageRank, 26:h-index, 27:#paths which contains the node, 28:#paths where the node is the starting one, 29:#paths where the node is the end one, 30:index of the first-time occurrence, 31:#paths where the node occurs multiple times, 32:#cross-HRR referrals proposed by the node, 33:#cross-PHN referrals proposed by the node |
Average weights of edges in the national referral network covered by the referral path | 34:#referrals, 35:RVU, 36:ranking based weight |
Last physician on the referral path | 48:RVU, 49:month of visit, 50:local clustering coefficient, 51:PageRank, 52:h-index, 53:#paths which contains the node, 54:#paths where the node is the starting one, 55:#paths where the node is the end one, 56:average index of the first-time occurrence, 57:#paths where the node occurs multiple times, 58:#cross-HRR referrals proposed by the node, 59:#cross-PHN referrals proposed by the node |
Patient history information | 61:age, 62:indicator of HIV, 63:indicator of asthmatic lung disease, 64:indicator of cancer, 65:indicator of dementia, 66:indicator of diabetes, 67:indicator of liver disease, 68:indicator of chronic non-asthmatic lung disease, 69:indicator of chronic renal disease |
-
Feature engineering. Encoding categorical attributes, such as specialty of the key physician and the month of admission date. Features are extracted using both the exact matching referral path with the AMI record and the immediately preceding referral path within the 90 day period before the exact matching one, in order to capture the association between referral path features and subsequent treatment outcomes.
-
10-fold cross validations. Accomplished by partitioning the original sample into a training set and a test set in rotation.
-
Undersampling. Undersample some training cases to balance the ratio of positive/negative in training set.
-
Feature selection. Apply Random Forest (RF) to sort features by their importance (Genuer et al. 2010), and pick up a subset of important features for classification models. Here the importance of a given feature is the increase in mean error of a tree in the forest when the observed values of this feature are randomly permuted.
-
Voting for the final label. Collect prediction result of each classification model and vote for the final prediction result of a test case.
-
Xgboost (Chen and Guestrin 2016). Upgrade the gradient boosting model from GBDT to Xgboost, which aims to strengthen regularization of trees and control overfitting.
PCI | 2007 | 2008 | 2009 | 2010 | 2011 | Average F-score |
---|---|---|---|---|---|---|
Recall | 0.703 | 0.700 | 0.702 | 0.695 | 0.694 | |
Precision | 0.572 | 0.574 | 0.585 | 0.597 | 0.607 | |
F-score | 0.631 | 0.630 | 0.638 | 0.642 | 0.647 | 0.638 |
death1yr | ||||||
Recall | 0.702 | 0.698 | 0.710 | 0.704 | 0.682 | |
Precision | 0.640 | 0.632 | 0.639 | 0.650 | 0.633 | |
F-score | 0.669 | 0.663 | 0.672 | 0.675 | 0.657 | 0.667 |
Death1yr | PCI | |
---|---|---|
Age<=75 | 0.592 | 0.695 |
Age>75 | 0.687 | 0.565 |
Rank | Death1yr | PCI |
---|---|---|
1 | Total RVU of the referral path | Average time gap on the referral path |
2 | Total RVU of the previous referral path | Indicator of patient’s age in 66-70 |
3 | Average time gap on the referral path | Average PageRank values of all physicians on the referral path |
4 | Time range of the referral path | Indicator of the key physician’s specialty on the referral path as “interventional cardiology” |
5 | Average index of the first-time occurrence on a referral path for the last physician | Indicator of patient’s age in 76+ |
6 | Local clustering coefficient of the last physician on the referral path | The number of referral paths that include the last physician |
7 | Times of being the end node on a referral path of the last physician on the referral path | Indicator of the key physician’s specialty on the referral path as “interventional cardiology” |
8 | Times of being the first node on a referral path for the last physician | Average #involved paths among physicians on the referral path |
9 | indicator of patient’s age in 76+ | Average times of being the first node on a referral path for all physicians on the referral path |
10 | Average times of being the end node on a referral path for all physicians on the referral path | Times of being the first node on a referral path for the last physician |
(a) death1yr | |||
Feature | Estimate | 95% CI | p-value |
#nodes in domain HRR | −0.243 | (−0.389−0.098) | 1.03E−03 |
Physician distribution entropy | −0.313 | (−0.625−0.0013) | 0.049 |
PHN distribution entropy | −0.528 | (−0.692−0.365) | 2.34E−10 |
#pairs of nodes with reciprocal referrals | −2.496 | (−3.666−1.325) | 2.93E−05 |
Avg. PageRank values on a referral path | −2.290 | (−2.803−1.778) | <2E−16 |
Avg. index of first occurrence | −0.569 | (−0.974−0.164) | 0.0059 |
Avg. proposed #cross-PHN referrals | 1.628 | (0.961 2.295) | 1.73E−06 |
Avg. #referrals on the corresponding edges | 8.696 | (4.771 12.620) | 1.41E−05 |
Avg. ranking-based weight on the corresponding edges | −3.973 | (−6.426−1.519) | 0.0015 |
#previous paths | 2.204 | (1.908 2.500) | <2E−16 |
Total RVU | 11.414 | (10.461 12.367) | <2E−16 |
Times of being the end node of the last physician | −2.985 | (−4.075−1.896) | 7.89E−08 |
Avg. first occurrence index of the last physician | 4.869 | (4.176 5.562) | <2E−16 |
Times of occurring multiple times of the last physician | 1.778 | (1.041 2.514) | 2.23E−06 |
(b) PCI | |||
Feature | Estimate | 95% CI | p-value |
Physician distribution entropy | −0.368 | (−0.678−0.058) | 0.019 |
PHN distribution entropy | 0.547 | (0.359 0.734) | 1.08E−08 |
Avg. #common connected nodes between neighbors | 0.487 | (0.097 0.877) | 0.014 |
Avg. PageRank values on a referral path | 3.874 | (3.337 4.411) | <2E−16 |
Avg. proposed #cross-PHN referrals | −1.738 | (−2.278−1.197) | 2.89E−10 |
Avg. #referrals on the corresponding edges | −2.222 | (−3.822−0.622) | 0.0065 |
#previous paths | −1.845 | (−2.155−1.533) | <2E−16 |
Total RVU | −2.113 | (−2.909−1.315) | 2.02E−07 |
Local clustering coefficient of the last physician | −1.352 | (−1.969−0.735) | 1.76E−05 |
Avg. first occurrence index of the last physician | −3.024 | (−4.034−2.013) | 4.48E−09 |
Linear regression analysis of log(total 1yr payments)
Feature | Estimate | CI | p-value |
---|---|---|---|
#nodes in domain HRR | 0.121 | (0.099 0.142) | < 2E−16 |
referral reciprocity in domain HRR | 0.209 | (0.167 0.251) | < 2E−16 |
#nodes ∗ | - 2.588 | (−2.992−2.183) | < 2E−16 |
Physician distribution entropy | 1.365 | (1.321 1.410) | < 2E−16 |
PHN distribution entropy ∗ | 0.413 | (0.347 0.480) | < 2E−16 |
Avg. #common connected nodes between neighbors | − 0.357 | (−0.432−0.282) | < 2E−16 |
#pairs of nodes with reciprocal referrals | 2.618 | (2.374 2.863) | < 2E−16 |
Avg. local clustering coefficient on the referral path | − 1.222 | (−1.326−1.117) | < 2E−16 |
Avg. PageRank values on the referral path | 0.983 | (0.888 1.077) | < 2E−16 |
Avg. index of first occurrence on the referral path | 0.341 | (0.235 0.447) | 3.05E−10 |
Avg. proposed #cross-PHN referrals | − 0.592 | (−0.685−0.498) | < 2E−16 |
Avg. #referrals on the corresponding edges | −0.567 | (−0.902−0.232) | 9.25E−04 |
Avg. ranking-based weight on the corresponding edges ∗ | 0.775 | (0.485 1.064) | 1.59E−07 |
#previous paths ∗ | 0.304 | (0.212 0.396) | 9.28E−11 |
Total RVU ∗ | 5.028 | (4.604 5.451) | < 2E−16 |
Month of the first visit | categorical | vary for groups | < 2E−16 |
Specialty of the key physician | categorical | vary for groups | < 2E−16 |
Month of the last visit | categorical | vary for groups | < 2E−16 |
Avg. first occurrence index of the last physician ∗ | − 0.433 | (−0.686−0.179) | 7.99E−04 |
In-depth study of a hospital
(s, t, w) | (s, t, w) | (s, t, w) | (s, t, w) | (s, t, w) |
(20, 7, 2) | (1, 6, 2) | (22, 6, 2) | (6, 7, 11) | (10, 7, 2) |
(7, 6, 10) | (6, 10, 3) | (7, 19, 2) | (3, 12, 2) | (12, 2, 2) |
(17, 6, 2) | (17, 11, 3) | (6, 11, 4) | (7, 17, 3) | (14, 3, 2) |
(6, 17, 5) | (8, 10, 2) | (14, 10, 2) | (19, 6, 3) | (4, 2, 2) |
(2, 4, 4) | (10, 11, 3) | (11, 24, 2) | (3, 14, 2) | (11, 6, 2) |