Introduction
Contributions
-
1. The SARL heuristic with the divide and conquer methodology can increase the efficiency and scalability of association rule mining but still maintain reasonable accuracy.
-
2. The rule ranking algorithm calculates the importance and ranks the rules, so the investigator does not have to search through millions of rules.
-
3. We consider gene-disease associations. Each important association rule between genes and disease can be identified and highlighted in the result.
Related work
Our solution
Data preprocessing
Dataset reduction
Converting into transactional datasets
Extracting disease information
Calculating gene importance
The SARL-heuristic
Definitions
A scalable heuristic algorithm—SARL-heuristic
-
Step 1: Find frequent one and two itemsets using the Apriori algorithm (when minsup is high) or the direct generation method (when minsup is low).
-
Step 2: Construct the item association graph (IAG) from the result of step 1.
-
Step 3: Partition the IAG using the multilevel k-way partitioning algorithm (MLkP).
-
Step 4: Partition the dataset according to the result of step 3.
-
Step 5: Call the modified Apriori algorithm or the FP-Growth algorithm to mine frequent itemsets on each transaction partition.
-
Step 6: Find the union of the results found from each partition.
-
Step 7: Generate association rules by running the Apriori-ap-genrules on the frequent itemsets found from step 6.
An example
Assay 1 | Assay 2 | Assay 3 | Assay 4 | Assay 5 | Assay 6 | Assay 7 | Assay 8 | |
---|---|---|---|---|---|---|---|---|
Gene 1 | 0.11 | 0.03 | 1.51 | 0.34 | 10.21 | 0.01 | 0.28 | 1.33 |
Gene 2 | 5.23 | 5.78 | 1.37 | 1.44 | 7.65 | 21.35 | 1.98 | 1.28 |
Gene 3 | 1.32 | 4.89 | 1.05 | 1.37 | 8.45 | 17.56 | 1.79 | 1.79 |
Gene 4 | 1.56 | 0.97 | 0.05 | 0.12 | 1.02 | 1.34 | 0.19 | 1.12 |
Gene 5 | 6.33 | 1.12 | 0.13 | 0.46 | 0.89 | 1.88 | 0.3 | 0.98 |
Assay 1 | Assay 2 | Assay 3 | Assay 4 | Assay 5 | Assay 6 | Assay 7 | |
---|---|---|---|---|---|---|---|
Gene 1 | 12.09091 | 44.33333 | 0.880795 | 3.911765 | 0.130264 | 133 | 4.75 |
Gene 2 | 0.244742 | 0.221453 | 0.934307 | 0.888889 | 0.16732 | 0.059953 | 0.646465 |
Gene 3 | 1.356061 | 0.366053 | 1.704762 | 1.306569 | 0.211834 | 0.101936 | 1 |
Gene 4 | 0.717949 | 1.154639 | 22.4 | 9.333333 | 1.098039 | 0.835821 | 5.894737 |
Gene 5 | 0.154818 | 0.875 | 7.538462 | 2.130435 | 1.101124 | 0.521277 | 3.266667 |
Assay 1 | Assay 2 | Assay 3 | Assay 4 | Assay 5 | Assay 6 | Assay 7 | |
---|---|---|---|---|---|---|---|
Gene 1 | 3.595851 | 5.47032 | − 0.18312 | 1.96782 | − 2.94048 | 7.055282 | 2.247928 |
Gene 2 | − 2.03067 | − 2.17493 | − 0.09803 | − 0.16993 | − 2.57932 | − 4.06002 | − 0.62936 |
Gene 3 | 0.439422 | − 1.44987 | 0.76957 | 0.385784 | − 2.23899 | − 3.29426 | 0 |
Gene 4 | − 0.47805 | 0.207442 | 4.485427 | 3.222392 | 0.13493 | − 0.25873 | 2.559427 |
Gene 5 | − 2.69135 | − 0.19265 | 2.91427 | 1.091148 | 0.138976 | − 0.93988 | 1.707819 |
Assay 1 | Assay 2 | Assay 3 | Assay 4 | Assay 5 | Assay 6 | Assay 7 | |
---|---|---|---|---|---|---|---|
Gene 1 | 1 | 1 | 0 | 1 | − 1 | 1 | 1 |
Gene 2 | − 1 | − 1 | 0 | 0 | − 1 | − − 1 | 0 |
Gene 3 | 0 | − 1 | 0 | 0 | − 1 | − 1 | 0 |
Gene 4 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
Gene 5 | − 1 | 0 | 1 | 1 | 0 | 0 | 1 |
Gene 1 | Gene 2 | Gene 3 | Gene 4 | Gene 5 | |
---|---|---|---|---|---|
Assay 1 | 1 | − 1 | 0 | 0 | − 1 |
Assay 2 | 1 | − 1 | − 1 | 0 | 0 |
Assay 3 | 0 | 0 | 0 | 1 | 1 |
Assay 4 | 1 | 0 | 0 | 1 | 1 |
Assay 5 | − 1 | − 1 | − 1 | 0 | 0 |
Assay 6 | 1 | − 1 | − 1 | 0 | 0 |
Assay 7 | 1 | 0 | 0 | 1 | 1 |
TID | Items |
---|---|
T000 | − 1, + 2, + 5 |
T001 | − 1, + 2, + 3 |
T002 | − 4, − 5 |
T003 | − 1, − 4, − 5 |
T004 | + 1, + 2, + 3 |
T005 | − 1, + 2, + 3 |
T006 | − 1, − 4, − 5 |
Frequent Itemsets | Support |
---|---|
{− 1} | 5 |
{+ 2} | 4 |
{+ 3} | 3 |
{− 4} | 3 |
{− 5} | 3 |
Frequent Itemsets | Support |
---|---|
{− 1, + 2} | 3 |
{− 1, + 3} | 2 |
{− 1, − 4} | 2 |
{− 1, − 5} | 2 |
{+ 2, + 3} | 3 |
{− 4, − 5} | 2 |
TID | Items |
---|---|
T001 | − 1, + 2, + 3 |
T005 | − 1, + 2, + 3 |
TID | Items |
---|---|
None | None |
Frequent Itemsets | Support |
---|---|
{− 1, + 2, + 3} | 2 |
Frequent Itemsets | Support |
---|---|
None | N/A |
Frequent Itemsets | Support |
---|---|
{− 1} | 5 |
{+ 2} | 4 |
{+ 3} | 3 |
{− 4} | 3 |
{− 5} | 3 |
{− 1, + 2} | 3 |
{− 1, + 3} | 2 |
{− 1, − 4} | 2 |
{− 1, − 5} | 2 |
{+ 2, + 3} | 3 |
{− 4, − 5} | 2 |
{− 1, + 2, + 3} | 2 |
Rules | Confidence |
---|---|
{+ 2} {− 1} | 0.75 |
{+ 3} {+ 2} | 1 |
{− 5} {− 1} | 1 |
{+ 2} {+ 3} | 0.75 |
{− 5} {− 4} | 1 |
{− 4} {− 5} | 1 |
{− 1, + 3} {+ 2} | 1 |
The SARL (scalable association rule learning) heuristic
Ranking of association rules
Experiments and results
-
OS: macOS Big Sur
-
CPU: Apple M1
-
Memory: 8 GB
-
Disk: 256 GB, SSD
-
Programming Language: Python 3.7
E-MTAB-9030—microRNA profiling of muscular dystrophies dataset
-
File size: 4 KB
-
Number of genes: 29
-
Number of assays: 15
SARL | Apriori | |
---|---|---|
5 | 0.083 | 1.746 |
4 | 0.176 | 4.035 |
3 | 0.457 | 10.932 |
2 | 1.233 | 32.054 |
E-MTAB-8615
Molecular characterisation of TP53 mutated squamous cell carcinomas of the lung identifies BIRC5 as a putative target for therapy
-
File size: 73.4 MB
-
Number of genes: 58,202
-
Number of assays: 209
SARL | Apriori | |
---|---|---|
10 | 1.93 | 0.01 |
9 | 1.78 | 0.01 |
8 | 1.65 | 0.01 |
7 | 1.8 | 0.01 |
6 | 1.67 | 0.01 |
5 | 1.72 | 0.01 |
4 | 1.84 | 0.02 |
3 | 1.69 | 0.77 |
2 | 44.14 | 278.98 |
1 | Interrupted | Interrupted |
E-MTAB-6703—A microarray meta-dataset of breast cancer
-
File size: 780.2 MB
-
Number of genes: 20,546
-
Number of assays: 2302
SARL | Apriori | |
---|---|---|
5 | 0.023 | 0.011 |
4 | 0.017 | 0.02 |
3 | 0.034 | 1.446 |
2 | 0.142 | 99.756 |
Discussion and conclusions
-
Develop a parallel version of the SARL heuristic and its implementation. The transaction partitions can be considered as independent datasets, and we can easily run the modified Apriori algorithm or FP-Growth algorithm on each of the transaction partition in parallel and then merge the results (frequent three or higher itemsets) together along with the frequent one and two itemsets to obtain the total frequent itemsets. Each parallel processor does not need to communicate with others during the computation since all the information needed is already included in the local dataset. This would result in maximum utilization of each processor.
-
A better algorithm may be used to predict the proper thresholds of the ternary discretization. The current ternary discretization is based on empirical methods and may need several tests to find the best thresholds that reduce the dataset to a smaller size while keeping enough information. A statistical analysis of the dataset may help to decide the boundaries. Furthermore, we should consider incorporating deep neural network approach in this process to predict the best threshold.
-
Incremental Learning on Multiple Datasets: Nowadays, new microarray datasets are added to databases around the world on a daily basis. Among them, some of them focus on the same sets of genes, and others may have overlapping gene components. This brings an interesting question: can we learn association rules across multiple microarray datasets to get a larger number of more convincing rules? The answer is yes. It is possible and quite useful to learn association rules from multiple datasets. In fact, a prominent advantage of the SARL algorithm is the ability to do incremental learning across multiple datasets.
-
\(c1=\frac{A1}{B1}\)
-
\(c2=\frac{A2}{B2}\)
-
\(c3=\frac{A3}{B3}\)
Assay 1 | Assay 2 | |
---|---|---|
G1 | 5 | 7 |
G2 | 6 | 2 |
G3 | 3 | 1 |
Assay 1 | Assay 2 | |
---|---|---|
G1 | 5 | 2 |
G2 | 3 | 5 |
G3 | 8 | 7 |
G4 | 3 | 5 |
Assay 1 | Assay 2 | Assay 3 | |
---|---|---|---|
G1 | 5 | 7 | 1.43 |
G2 | 6 | 2 | 7.5 |
G3 | 3 | 1 | 57.14 |