-
Notifications
You must be signed in to change notification settings - Fork 33
LCA based cluster + code clean up #93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this work!
The LCA-based clustering code looks good though I think it'd be nice to have a test or two showing it works. The Hotspot refactor looks very nice! This is exactly on the lines I was thinking and will be nice to re-enter the pipeline to recluster the genes.
There's still quite a bit of code cleanup that needs to happen though, especially around the tree clustering. On one hand we'll need to name these functions more expressively, and on the second we'll have to choose one or two options for clustering. I tend to prefer a depth based clustering as I've said before because it's very clean to implement and clear in interpretation.
Let's get these comments resolved before merging in!
R/Microclusters.R
Outdated
treeCluster <- function(tree, reach=10) { | ||
#' | ||
#' @export | ||
treeCluster1 <- function(tree, reach=10) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have names that are more informative than treeCluster1
, treeCluster2
, etc etc.
I'm not sure what the difference is off the bat. Ideally we'd have one clustering approach implemented.
Also, I'd prefer to use a more informative name than reach
to designate the number of intended clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be a depth-based clustering technique? What is depth(u,v)
? Is this the depth of the LCA or the number of edges separating them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, forgot the LCA call!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+renamed the functions
function(object, K = round(sqrt(length(object$tip.label))), lcaKNN=FALSE, minSize=20) { | ||
if (lcaKNN) { | ||
k <- lcaBasedTreeKNN(object, minSize = minSize) | ||
} else { | ||
k <- find_knn_parallel_tree(object, K) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear -- why do we have two separate options here?
I propose we populate the KNN before this step and pass it in with the object (either as a slot or as an extra argument). It makes less sense to me to have a boolean argument here specifying the approach because it can cause downstream inconsistencies if another function uses a different approach for computing the KNN graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updating docstring to:
#' @param lcaKNN whether to use LCA based KNN (cluster by minimum size), if false defaults to cophenetic distance (random tie breaking).
#' WARNING: lcaKNN doesn't perform well with broad multifurcations
this is if you want to use the lcaKNN where you use all neighbors from clade if clade size > min size. This function is the first time the knn are calculated for the object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This docstring is a lot more clear now!
I'm curious if we don't want to add a new step in the pipeline that just calculates & populates the KNN. That way we can just access this object again if we ever need to get the KNN graph.
|
||
|
||
|
||
#' Depth of tip1 parent immediately after LCA(tip1, tip2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want the depth of the parent immediately after the LCA? Why can't we work with the LCA depth?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to use this so that merged clusters appear next to each other in the plotly graph.
If clades A, B and C all have LCA D, when we merge A and B for example we need to choose how to merge between A and B or A and C or B and C. The plotly tree is sorted by depth, so I was trying to use the depths of the node after LCA so one can distinguish between the three child clades. I am kind of souring on this idea though since it really is arbitrary, I still don't really know what the best solution for dealing with multifurcations is. Maybe we can have really small clusters (1/2/3 cells)?
R/methods-Module.R
Outdated
#' @return the modified Vision object | ||
#' | ||
#' @export | ||
hsAnalyze <- function(object, model="normal", tree=FALSE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's call this something more informative like runHotspot
or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with hsAnalyze since it matches the VISION pattern of main analysis function just being called analyze
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was more saying that using the name hsAnalyze
is not clear because it's not clear a priori that hs
means Hotspot
. Since hotspotAnalyze
sounds a bit burdensome I suggested runHotspot
.
It doesn't really matter to me what you name it as long as you don't use the abbreviation hs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed!
R/methods-Module.R
Outdated
# Init Hotspot | ||
hs <- hsInit(object, model, tree, num_umi) | ||
# Init Hotspot KNN | ||
hs <- hsCreateKnnGraph(hs, object, n_neighbors=NULL, nn_precomp=NULL, wt_precomp=NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we haven't gotten nn_precomp
and wt_precomp
to work here, let's remove this from the argument list (for clarity)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove it if it's not working when merging into master but is it ok to keep it for this pr into yr-cass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, let's remove it from this PR because we'll want to just merge in this PR to master when we're ready.
R/methods-Module.R
Outdated
# Init Hotspot KNN | ||
hs <- hsCreateKnnGraph(hs, object, n_neighbors=NULL, nn_precomp=NULL, wt_precomp=NULL) | ||
# perform Hotspot analysis and store results in R | ||
hs_genes <- hsComputeAutoCorrelations(hs, number_top_genes=1000, autocorrelation_fdr=0.05) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're not propagating the number_top_genes
and autocorrelation_fdr
arguments passed into the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking a lot better!
What's our default tree-clustering method? Is there a way for a user to choose which one to use? Because if not, I don't think we need to have four different tree clustering methods.
Let's resolve my leftover comments and then we can merge it in to the branch!
R/Microclusters.R
Outdated
#' | ||
#' @param tree object of class phylo | ||
#' @param reach number of clusters to attempt to generate | ||
#' @return List of clusters, each entry being a vector of indices representing | ||
#' samples in the cluster. | ||
#' | ||
#' @export | ||
treeCluster1 <- function(tree, reach=10) { | ||
depthBasedTreeCluster <- function(tree, reach=10) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function name is a lot more clear! But I don't like the argument reach
-- let's use something more informative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about target?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think depth
would be more reasonable.
We should have informative names for the other clustering algorithms too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about numClusters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - because this parameter does not control the number of clusters, but rather the depth at which you cut. numClusters
would be unclear to a user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The depth isn't the parameter though, we're doing a binary search for the depth to yield the specified number of clusters.
R/methods-Module.R
Outdated
@@ -33,9 +33,9 @@ hsAnalyze <- function(object, model="normal", tree=FALSE, | |||
# Init Hotspot | |||
hs <- hsInit(object, model, tree, num_umi) | |||
# Init Hotspot KNN | |||
hs <- hsCreateKnnGraph(hs, object, n_neighbors=NULL, nn_precomp=NULL, wt_precomp=NULL) | |||
hs <- hsCreateKnnGraph(hs, object, n_neighbors=n_neighbors, nn_precomp=nn_precomp, wt_precomp=wt_precomp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my point above, let's get rid of this in this PR until it works.
This is looking great! I believe the only thing left is to rename some of the arguments in the tree-based clustering algorithms. Thanks for all the hard work here! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! I left some comments on the vignettes.
I don't think we need two separate vignettes for PhyloVision - one should do.
And when it comes to the Hotspot vignette, let's add some more discussion around what parameters you can modulate, etc.
There are some other things we'll have to change here regarding the relative filepaths. For example, you're loading VISION from your desktop and assuming we have signatures stored in a certain place. (In fact, let's assume that the users have this package installed and can load it in using library(VISION)
.) Let's make sure this will work when we distribute it also by including toy data with the package or uploading it somewhere safe like Google Drive and download it for the purpose of the vignette.
vignettes/metastasisPhyloVision.Rmd
Outdated
lg7_tree <- read.tree("/data/yosef2/users/mattjones/projects/PhyloVision/data/metastasis_data/lg7_tree_hybrid_priors.alleleThresh.processed.txt") | ||
lg4_tree <- read.tree("/data/yosef2/users/mattjones/projects/PhyloVision/data/metastasis_data/lg4_tree_hybrid_priors.alleleThresh.processed.txt") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to do the analysis for both trees -- let's only use the LG4 tree.
Also, can we add this tree to the data
directory so it's distributed with the package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For file locations should I just put like "..." or something else? How do we want to distribute the data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to have users set a path variable-- still not sure how to distribute the data though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vignettes/phlyoVision.Rmd
Outdated
title: "VisCas Vignette" | ||
author: "Yanay Rosen" | ||
date: "9/30/2020" | ||
output: html_document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between the document metastasisPhyloVision
and this one phyloVision.Rmd
?
I think we only need one vignette showcasing Phylovision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
phyloVision.rmd replicates chan 2019 example from Hotspot paper, metastasisPhyloVision is for the lg4 and lg7 trees, I included it for replicability for the paper but can remove!
vignettes/phlyoVision.Rmd
Outdated
```{r inspectModules} | ||
hs <- loadHotspotObject(bytes=vis@Hotspot[[1]]) | ||
library(reticulate) | ||
use_python('/usr/bin/python3') | ||
``` | ||
```{python modulesPlot} | ||
import matplotlib.pyplot as plt | ||
import hotspot | ||
hs.plot_local_correlations() | ||
plt.show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this section to before we view results in browser? And point people to the Hotspot vignette for more in depth examples of how to work with parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in latest commit!
vignettes/spatialHotspot.Rmd
Outdated
vis <- Vision(expr, signatures=c(sig), latentSpace = pos, meta=meta) # TODO add relevant signatures | ||
``` | ||
|
||
Next, we can perform the normal Vision analysis using the tree as the latent space. We need to tell Vision to use the Tree as the latent space and to calculate neighbors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not using the tree as the latent space here - can you be more specific about what this is doing with the spatial barcodes?
Let's be careful to make it clear that you don't need to run Hotspot on spatial data, and that we have applications on phylogenies & RNA-seq datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in latest commit!
Merging into staging! |
No description provided.