LCA based cluster + code clean up #93

Yanay1 · 2021-03-24T23:42:30Z

No description provided.

mattjones315

Thanks for doing this work!

The LCA-based clustering code looks good though I think it'd be nice to have a test or two showing it works. The Hotspot refactor looks very nice! This is exactly on the lines I was thinking and will be nice to re-enter the pipeline to recluster the genes.

There's still quite a bit of code cleanup that needs to happen though, especially around the tree clustering. On one hand we'll need to name these functions more expressively, and on the second we'll have to choose one or two options for clustering. I tend to prefer a depth based clustering as I've said before because it's very clean to implement and clear in interpretation.

Let's get these comments resolved before merging in!

mattjones315 · 2021-03-29T18:27:35Z

R/Microclusters.R

-treeCluster <- function(tree, reach=10) {
+#' 
+#' @export
+treeCluster1 <- function(tree, reach=10) {


Let's have names that are more informative than treeCluster1, treeCluster2, etc etc.

I'm not sure what the difference is off the bat. Ideally we'd have one clustering approach implemented.

Also, I'd prefer to use a more informative name than reach to designate the number of intended clusters.

Is this supposed to be a depth-based clustering technique? What is depth(u,v)? Is this the depth of the LCA or the number of edges separating them?

Yep, forgot the LCA call!

+renamed the functions

mattjones315 · 2021-03-29T18:35:26Z

R/Projections.R

+    function(object, K = round(sqrt(length(object$tip.label))), lcaKNN=FALSE, minSize=20) {
+        if (lcaKNN) {
+          k <- lcaBasedTreeKNN(object, minSize = minSize)
+        } else {
+          k <- find_knn_parallel_tree(object, K)
+        }


It's unclear -- why do we have two separate options here?

I propose we populate the KNN before this step and pass it in with the object (either as a slot or as an extra argument). It makes less sense to me to have a boolean argument here specifying the approach because it can cause downstream inconsistencies if another function uses a different approach for computing the KNN graph.

updating docstring to:
#' @param lcaKNN whether to use LCA based KNN (cluster by minimum size), if false defaults to cophenetic distance (random tie breaking).
#' WARNING: lcaKNN doesn't perform well with broad multifurcations

this is if you want to use the lcaKNN where you use all neighbors from clade if clade size > min size. This function is the first time the knn are calculated for the object.

This docstring is a lot more clear now!

I'm curious if we don't want to add a new step in the pipeline that just calculates & populates the KNN. That way we can just access this object again if we ever need to get the KNN graph.

mattjones315 · 2021-03-29T19:07:38Z

R/Utilities.R

+
+
+
+#' Depth of tip1 parent immediately after LCA(tip1, tip2)


Why do we want the depth of the parent immediately after the LCA? Why can't we work with the LCA depth?

Trying to use this so that merged clusters appear next to each other in the plotly graph.

If clades A, B and C all have LCA D, when we merge A and B for example we need to choose how to merge between A and B or A and C or B and C. The plotly tree is sorted by depth, so I was trying to use the depths of the node after LCA so one can distinguish between the three child clades. I am kind of souring on this idea though since it really is arbitrary, I still don't really know what the best solution for dealing with multifurcations is. Maybe we can have really small clusters (1/2/3 cells)?

mattjones315 · 2021-03-29T19:08:08Z

R/methods-Module.R

+#' @return the modified Vision object
+#' 
+#' @export
+hsAnalyze <- function(object, model="normal", tree=FALSE, 


Let's call this something more informative like runHotspot or something

I went with hsAnalyze since it matches the VISION pattern of main analysis function just being called analyze

I was more saying that using the name hsAnalyze is not clear because it's not clear a priori that hs means Hotspot. Since hotspotAnalyze sounds a bit burdensome I suggested runHotspot.

It doesn't really matter to me what you name it as long as you don't use the abbreviation hs.

mattjones315 · 2021-03-29T19:08:49Z

R/methods-Module.R

+    # Init Hotspot
+    hs <- hsInit(object, model, tree, num_umi)
+    # Init Hotspot KNN
+    hs <- hsCreateKnnGraph(hs, object, n_neighbors=NULL, nn_precomp=NULL, wt_precomp=NULL)


If we haven't gotten nn_precomp and wt_precomp to work here, let's remove this from the argument list (for clarity)

I think we can remove it if it's not working when merging into master but is it ok to keep it for this pr into yr-cass?

No, let's remove it from this PR because we'll want to just merge in this PR to master when we're ready.

mattjones315 · 2021-03-29T19:09:12Z

R/methods-Module.R

+    # Init Hotspot KNN
+    hs <- hsCreateKnnGraph(hs, object, n_neighbors=NULL, nn_precomp=NULL, wt_precomp=NULL)
+    # perform Hotspot analysis and store results in R
+    hs_genes <- hsComputeAutoCorrelations(hs, number_top_genes=1000, autocorrelation_fdr=0.05)


You're not propagating the number_top_genes and autocorrelation_fdr arguments passed into the function.

mattjones315

This is looking a lot better!

What's our default tree-clustering method? Is there a way for a user to choose which one to use? Because if not, I don't think we need to have four different tree clustering methods.

Let's resolve my leftover comments and then we can merge it in to the branch!

mattjones315 · 2021-03-31T01:06:16Z

R/Microclusters.R

 #'
 #' @param tree object of class phylo
 #' @param reach number of clusters to attempt to generate
 #' @return List of clusters, each entry being a vector of indices representing
 #' samples in the cluster.
 #' 
 #' @export
-treeCluster1 <- function(tree, reach=10) {
+depthBasedTreeCluster <- function(tree, reach=10) {


This function name is a lot more clear! But I don't like the argument reach -- let's use something more informative.

How about target?

No, I think depth would be more reasonable.

We should have informative names for the other clustering algorithms too.

What about numClusters?

No - because this parameter does not control the number of clusters, but rather the depth at which you cut. numClusters would be unclear to a user.

The depth isn't the parameter though, we're doing a binary search for the depth to yield the specified number of clusters.

mattjones315 · 2021-03-31T01:06:54Z

R/methods-Module.R

@@ -33,9 +33,9 @@ hsAnalyze <- function(object, model="normal", tree=FALSE,
    # Init Hotspot
    hs <- hsInit(object, model, tree, num_umi)
    # Init Hotspot KNN
-    hs <- hsCreateKnnGraph(hs, object, n_neighbors=NULL, nn_precomp=NULL, wt_precomp=NULL)
+    hs <- hsCreateKnnGraph(hs, object, n_neighbors=n_neighbors, nn_precomp=nn_precomp, wt_precomp=wt_precomp)


To my point above, let's get rid of this in this PR until it works.

mattjones315 · 2021-04-01T22:24:32Z

This is looking great! I believe the only thing left is to rename some of the arguments in the tree-based clustering algorithms. Thanks for all the hard work here!

mattjones315

Looking good! I left some comments on the vignettes.

I don't think we need two separate vignettes for PhyloVision - one should do.

And when it comes to the Hotspot vignette, let's add some more discussion around what parameters you can modulate, etc.

There are some other things we'll have to change here regarding the relative filepaths. For example, you're loading VISION from your desktop and assuming we have signatures stored in a certain place. (In fact, let's assume that the users have this package installed and can load it in using library(VISION).) Let's make sure this will work when we distribute it also by including toy data with the package or uploading it somewhere safe like Google Drive and download it for the purpose of the vignette.

mattjones315 · 2021-04-23T21:35:41Z

vignettes/metastasisPhyloVision.Rmd

+lg7_tree <- read.tree("/data/yosef2/users/mattjones/projects/PhyloVision/data/metastasis_data/lg7_tree_hybrid_priors.alleleThresh.processed.txt")
+lg4_tree <- read.tree("/data/yosef2/users/mattjones/projects/PhyloVision/data/metastasis_data/lg4_tree_hybrid_priors.alleleThresh.processed.txt")
+


No need to do the analysis for both trees -- let's only use the LG4 tree.

Also, can we add this tree to the data directory so it's distributed with the package?

For file locations should I just put like "..." or something else? How do we want to distribute the data?

Updated to have users set a path variable-- still not sure how to distribute the data though?

^ @mattjones315

mattjones315 · 2021-04-23T21:36:29Z

vignettes/phlyoVision.Rmd

+title: "VisCas Vignette"
+author: "Yanay Rosen"
+date: "9/30/2020"
+output: html_document


What's the difference between the document metastasisPhyloVision and this one phyloVision.Rmd?

I think we only need one vignette showcasing Phylovision.

phyloVision.rmd replicates chan 2019 example from Hotspot paper, metastasisPhyloVision is for the lg4 and lg7 trees, I included it for replicability for the paper but can remove!

mattjones315 · 2021-04-23T21:37:27Z

vignettes/phlyoVision.Rmd

+```{r inspectModules}
+hs <- loadHotspotObject(bytes=vis@Hotspot[[1]])
+library(reticulate)
+use_python('/usr/bin/python3')
+```
+```{python modulesPlot}
+import matplotlib.pyplot as plt
+import hotspot
+hs.plot_local_correlations()
+plt.show()


Can we move this section to before we view results in browser? And point people to the Hotspot vignette for more in depth examples of how to work with parameters?

Fixed in latest commit!

mattjones315 · 2021-04-23T21:38:03Z

vignettes/spatialHotspot.Rmd

+vis <- Vision(expr, signatures=c(sig), latentSpace = pos, meta=meta) # TODO add relevant signatures
+```
+
+Next, we can perform the normal Vision analysis using the tree as the latent space. We need to tell Vision to use the Tree as the latent space and to calculate neighbors. 


We're not using the tree as the latent space here - can you be more specific about what this is doing with the spatial barcodes?

Let's be careful to make it clear that you don't need to run Hotspot on spatial data, and that we have applications on phylogenies & RNA-seq datasets.

Fixed in latest commit!

Yanay1 · 2021-06-15T05:12:04Z

Merging into staging!

code clean up and trying new tree cluster

a154302

mattjones315 requested changes Mar 29, 2021

View reviewed changes

Yanay1 added 3 commits March 30, 2021 13:36

code clean up and trying new tree cluster

40d718f

code clean up and trying new tree cluster

161b1b0

code clean up and trying new tree cluster

682c1d3

mattjones315 requested changes Mar 31, 2021

View reviewed changes

code clean up and trying new tree cluster

59828bf

Yanay1 added 2 commits April 1, 2021 16:21

code clean up and trying new tree cluster

37fbca0

Added vignettes

ef8814e

mattjones315 requested changes Apr 23, 2021

View reviewed changes

Yanay1 added 18 commits April 28, 2021 12:17

updating vignettes

a1427c3

updating vignettes

91183ff

updating vignettes

1f077ad

updating namespace

ca55791

updating vignettes pathing

efa0ba2

updating vignettes

ffd1296

adding spatial data

fea62b0

adding embryo data

5a49955

adding embryo data

42d3fd8

updating main

f973b3b

updating upper left

c9b50bc

updating modules

d16b718

updating vision methods and all classes

8bbe1ce

Updating heatmap

67b26cb

renamed pv.rmd

7310ca9

updating docs

2aeaa03

updating docs

415283c

updating docs

1ba5a45

Yanay1 merged commit a2de46b into yr-cass Jun 15, 2021

		lg7_tree <- read.tree("/data/yosef2/users/mattjones/projects/PhyloVision/data/metastasis_data/lg7_tree_hybrid_priors.alleleThresh.processed.txt")
		lg4_tree <- read.tree("/data/yosef2/users/mattjones/projects/PhyloVision/data/metastasis_data/lg4_tree_hybrid_priors.alleleThresh.processed.txt")

LCA based cluster + code clean up #93

LCA based cluster + code clean up #93

Uh oh!

Conversation

Yanay1 commented Mar 24, 2021

Uh oh!

mattjones315 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattjones315 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattjones315 commented Apr 1, 2021

Uh oh!

mattjones315 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment