Skip to content

Conversation

Yanay1
Copy link
Collaborator

@Yanay1 Yanay1 commented Mar 24, 2021

No description provided.

Copy link
Collaborator

@mattjones315 mattjones315 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this work!

The LCA-based clustering code looks good though I think it'd be nice to have a test or two showing it works. The Hotspot refactor looks very nice! This is exactly on the lines I was thinking and will be nice to re-enter the pipeline to recluster the genes.

There's still quite a bit of code cleanup that needs to happen though, especially around the tree clustering. On one hand we'll need to name these functions more expressively, and on the second we'll have to choose one or two options for clustering. I tend to prefer a depth based clustering as I've said before because it's very clean to implement and clear in interpretation.

Let's get these comments resolved before merging in!

treeCluster <- function(tree, reach=10) {
#'
#' @export
treeCluster1 <- function(tree, reach=10) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have names that are more informative than treeCluster1, treeCluster2, etc etc.

I'm not sure what the difference is off the bat. Ideally we'd have one clustering approach implemented.

Also, I'd prefer to use a more informative name than reach to designate the number of intended clusters.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be a depth-based clustering technique? What is depth(u,v)? Is this the depth of the LCA or the number of edges separating them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, forgot the LCA call!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+renamed the functions

Comment on lines +396 to +401
function(object, K = round(sqrt(length(object$tip.label))), lcaKNN=FALSE, minSize=20) {
if (lcaKNN) {
k <- lcaBasedTreeKNN(object, minSize = minSize)
} else {
k <- find_knn_parallel_tree(object, K)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear -- why do we have two separate options here?

I propose we populate the KNN before this step and pass it in with the object (either as a slot or as an extra argument). It makes less sense to me to have a boolean argument here specifying the approach because it can cause downstream inconsistencies if another function uses a different approach for computing the KNN graph.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updating docstring to:
#' @param lcaKNN whether to use LCA based KNN (cluster by minimum size), if false defaults to cophenetic distance (random tie breaking).
#' WARNING: lcaKNN doesn't perform well with broad multifurcations

this is if you want to use the lcaKNN where you use all neighbors from clade if clade size > min size. This function is the first time the knn are calculated for the object.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring is a lot more clear now!

I'm curious if we don't want to add a new step in the pipeline that just calculates & populates the KNN. That way we can just access this object again if we ever need to get the KNN graph.




#' Depth of tip1 parent immediately after LCA(tip1, tip2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want the depth of the parent immediately after the LCA? Why can't we work with the LCA depth?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to use this so that merged clusters appear next to each other in the plotly graph.

If clades A, B and C all have LCA D, when we merge A and B for example we need to choose how to merge between A and B or A and C or B and C. The plotly tree is sorted by depth, so I was trying to use the depths of the node after LCA so one can distinguish between the three child clades. I am kind of souring on this idea though since it really is arbitrary, I still don't really know what the best solution for dealing with multifurcations is. Maybe we can have really small clusters (1/2/3 cells)?

#' @return the modified Vision object
#'
#' @export
hsAnalyze <- function(object, model="normal", tree=FALSE,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call this something more informative like runHotspot or something

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with hsAnalyze since it matches the VISION pattern of main analysis function just being called analyze

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more saying that using the name hsAnalyze is not clear because it's not clear a priori that hs means Hotspot. Since hotspotAnalyze sounds a bit burdensome I suggested runHotspot.

It doesn't really matter to me what you name it as long as you don't use the abbreviation hs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed!

# Init Hotspot
hs <- hsInit(object, model, tree, num_umi)
# Init Hotspot KNN
hs <- hsCreateKnnGraph(hs, object, n_neighbors=NULL, nn_precomp=NULL, wt_precomp=NULL)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we haven't gotten nn_precomp and wt_precomp to work here, let's remove this from the argument list (for clarity)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove it if it's not working when merging into master but is it ok to keep it for this pr into yr-cass?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, let's remove it from this PR because we'll want to just merge in this PR to master when we're ready.

# Init Hotspot KNN
hs <- hsCreateKnnGraph(hs, object, n_neighbors=NULL, nn_precomp=NULL, wt_precomp=NULL)
# perform Hotspot analysis and store results in R
hs_genes <- hsComputeAutoCorrelations(hs, number_top_genes=1000, autocorrelation_fdr=0.05)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not propagating the number_top_genes and autocorrelation_fdr arguments passed into the function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

Copy link
Collaborator

@mattjones315 mattjones315 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking a lot better!

What's our default tree-clustering method? Is there a way for a user to choose which one to use? Because if not, I don't think we need to have four different tree clustering methods.

Let's resolve my leftover comments and then we can merge it in to the branch!

#'
#' @param tree object of class phylo
#' @param reach number of clusters to attempt to generate
#' @return List of clusters, each entry being a vector of indices representing
#' samples in the cluster.
#'
#' @export
treeCluster1 <- function(tree, reach=10) {
depthBasedTreeCluster <- function(tree, reach=10) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function name is a lot more clear! But I don't like the argument reach -- let's use something more informative.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about target?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think depth would be more reasonable.

We should have informative names for the other clustering algorithms too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about numClusters?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - because this parameter does not control the number of clusters, but rather the depth at which you cut. numClusters would be unclear to a user.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The depth isn't the parameter though, we're doing a binary search for the depth to yield the specified number of clusters.

@@ -33,9 +33,9 @@ hsAnalyze <- function(object, model="normal", tree=FALSE,
# Init Hotspot
hs <- hsInit(object, model, tree, num_umi)
# Init Hotspot KNN
hs <- hsCreateKnnGraph(hs, object, n_neighbors=NULL, nn_precomp=NULL, wt_precomp=NULL)
hs <- hsCreateKnnGraph(hs, object, n_neighbors=n_neighbors, nn_precomp=nn_precomp, wt_precomp=wt_precomp)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my point above, let's get rid of this in this PR until it works.

@mattjones315
Copy link
Collaborator

This is looking great! I believe the only thing left is to rename some of the arguments in the tree-based clustering algorithms. Thanks for all the hard work here!

Copy link
Collaborator

@mattjones315 mattjones315 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! I left some comments on the vignettes.

I don't think we need two separate vignettes for PhyloVision - one should do.

And when it comes to the Hotspot vignette, let's add some more discussion around what parameters you can modulate, etc.

There are some other things we'll have to change here regarding the relative filepaths. For example, you're loading VISION from your desktop and assuming we have signatures stored in a certain place. (In fact, let's assume that the users have this package installed and can load it in using library(VISION).) Let's make sure this will work when we distribute it also by including toy data with the package or uploading it somewhere safe like Google Drive and download it for the purpose of the vignette.

Comment on lines 39 to 41
lg7_tree <- read.tree("/data/yosef2/users/mattjones/projects/PhyloVision/data/metastasis_data/lg7_tree_hybrid_priors.alleleThresh.processed.txt")
lg4_tree <- read.tree("/data/yosef2/users/mattjones/projects/PhyloVision/data/metastasis_data/lg4_tree_hybrid_priors.alleleThresh.processed.txt")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to do the analysis for both trees -- let's only use the LG4 tree.

Also, can we add this tree to the data directory so it's distributed with the package?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For file locations should I just put like "..." or something else? How do we want to distribute the data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to have users set a path variable-- still not sure how to distribute the data though?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 2 to 5
title: "VisCas Vignette"
author: "Yanay Rosen"
date: "9/30/2020"
output: html_document
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between the document metastasisPhyloVision and this one phyloVision.Rmd?

I think we only need one vignette showcasing Phylovision.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phyloVision.rmd replicates chan 2019 example from Hotspot paper, metastasisPhyloVision is for the lg4 and lg7 trees, I included it for replicability for the paper but can remove!

Comment on lines 69 to 78
```{r inspectModules}
hs <- loadHotspotObject(bytes=vis@Hotspot[[1]])
library(reticulate)
use_python('/usr/bin/python3')
```
```{python modulesPlot}
import matplotlib.pyplot as plt
import hotspot
hs.plot_local_correlations()
plt.show()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this section to before we view results in browser? And point people to the Hotspot vignette for more in depth examples of how to work with parameters?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in latest commit!

vis <- Vision(expr, signatures=c(sig), latentSpace = pos, meta=meta) # TODO add relevant signatures
```

Next, we can perform the normal Vision analysis using the tree as the latent space. We need to tell Vision to use the Tree as the latent space and to calculate neighbors.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not using the tree as the latent space here - can you be more specific about what this is doing with the spatial barcodes?

Let's be careful to make it clear that you don't need to run Hotspot on spatial data, and that we have applications on phylogenies & RNA-seq datasets.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in latest commit!

@Yanay1
Copy link
Collaborator Author

Yanay1 commented Jun 15, 2021

Merging into staging!

@Yanay1 Yanay1 merged commit a2de46b into yr-cass Jun 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants