API

Wrapper functions

RACCOON (Resolution-Adaptive for Coarse-to-fine Clustering OptimizatiON) F. Comitani @2018-2022 A. Maheshwari @2019

raccoon.main.classify(new_data, old_data, membership, refpath='./rc_data', **kwargs)

Wrapper function to classify new data with KNN on a previous IterativeClustering output.

Parameters:
  • new_data (matrix or pandas dataframe) – data to classify in dataframe-compatible format.

  • old_data (matrix or pandas dataframe) – reference data on which the hierarchy was built.

  • membership (matrix or pandas dataframe) – one-hot-encoded clusters assignment table from the original run.

  • refpath (string) – path to the location where trained umap files (pkl) are stored (default subdirectory raraccoon_data of current folder).

  • kwargs (dict) – keyword arguments for KNN.

Returns:

one-hot-encoded clusters membership of the

projected data.

Return type:

(pandas dataframe)

raccoon.main.cluster(data, **kwargs)
Wrapper function to setup, create a IterativeClustering object,

run the top-down iterations and logging.

Parameters:
  • data (pandas dataframe) – dataframe with sampels as rows and features as columns.

  • kwargs (dict) – keyword arguments for IterativeClustering.

Returns:

one-hot-encoded clusters membership of data.
tree (anytree object): anytree structure with information on the clusters

hierarchy.

Return type:

clus_opt (pandas dataframe)

raccoon.main.resume(data, refpath='./rc_data', lab=None, **kwargs)

Wrapper function to resume a IterativeClustering run from checkpoint files.

Parameters:
  • data (pandas dataframe) – dataframe with sampels as rows and features as columns.

  • refpath (string) – path to checkpoint files parent folder (default subdirectory raraccoon_data of current folder).

  • lab (list, array or pandas series) – list of labels corresponding to each sample (for plotting only).

  • kwargs (dict) – keyword arguments for KNN and IterativeClustering.

Returns:

one-hot-encoded clusters membership of the

whole data.

tree (anytree object): anytree structure with information on the clusters

hierarchy.

Return type:

new_clus (pandas dataframe)

raccoon.main.update(new_data, old_data, membership, tolerance=0.1, prob_cut=0.25, refpath='./rc_data', out_path='./', **kwargs)

Wrapper function to update a previous IterativeClustering output with new data. Runs KNN furst on the new data points to identify the closest matching clusters. These points are then added to each cluster along the heirarchy and the objective function is recalculated. If this score is lowered beyond the given threshold, the cluster under scrutiny is scrapped, together with its offspring, and re-built from scrach.

Parameters:
  • new_data (matrix or pandas dataframe) – data to classify in dataframe-compatible format.

  • old_data (matrix or pandas dataframe) – reference data on which the hierarchy was built.

  • membership (matrix or pandas dataframe) – one-hot-encoded clusters assignment table from the original run.

  • tolerance (float) – objective score change threshold, beyond which clusters will have to be recalculated.

  • prob_cut (float) – prubability cutoff, when running the KNN, samples with less than this value of probability to any assigned class will be treated as noise and won’t impact the clusters score review.

  • refpath (string) – path to the location where trained umap files (pkl) are stored (default subdirectory raraccoon_data of current folder).

  • out_path (string) – path to the location where output files will be saved (default current folder).

  • kwargs (dict) – keyword arguments for KNN and IterativeClustering.

Returns:

one-hot-encoded perturbed clusters membership.

Return type:

(pandas dataframe)

CPU/GPU interface

Parallelizable functions interface for RACCOON F. Comitani @2020-2022

class raccoon.interface.Interface

Bases: object

Interface for parallelizable functions.

__init__()
cluster()
decompose()
dim_red()
dunn()
static filter_key(dct, keys)

Remove entry from dictionary by key.

Parameters:
  • dct (dict) – dictionary to change.

  • key (obj) – key or list of keys to filter.

Returns

(dict): filtered dictionary.

get_value(var)

Returns value of given variable.

Parameters:

(any) – input variable.

Returns:

value of the input variable.

Return type:

(any)

label_bin()
louvain()
n_neighbor()
one_hot()
silhouette()
class raccoon.interface.InterfaceCPU

Bases: Interface

Interface for CPU functions.

Load the required CPU libraries.

__init__()

Load the required CPU libraries.

cluster(pj, **kwargs)

Sets up clusters identification object with DBSCAN.

Parameters:
  • pj (DataFrame) – projected data to cluster.

  • (dict) – keyword arguments for clusters identification.

Returns:

clusters identification object.

Return type:

(obj)

cluster_louvain(pj, **kwargs)

Sets up clusters identification object with Louvain..

Parameters:
  • pj (DataFrame) – adjacency matrix to cluster.

  • (dict) – keyword arguments for clusters identification.

Returns:

clusters identification object.

Return type:

(obj)

decompose(**kwargs)

Sets up features filtering object.

Parameters:

(dict) – keyword arguments for features filtering.

Returns:

features filtering object.

Return type:

(obj)

dim_red(**kwargs)

Sets up dimensionality reduction object.

Parameters:

(dict) – keyword arguments for dimensionality reduction.

Returns:

dimensionality reduction object.

Return type:

(obj)

dunn(points, labels, **kwargs)
Calculates the dunn index score for a set of points

and clusters labels. WARNING: slow!

Parameters:
  • points (self.df.DataFrame) – points coordinates.

  • labels (self.df.Series) – clusters membership labels.

  • (dict) – keyword arguments for pairwise distances (e.g. metric).

Returns:

dunn index on given points.

Return type:

(int)

get_value(var, pandas=False)

Returns value of given variable,

Parameters:
  • var (any) – input variable.

  • pandas (bool) – if True, do nothing.

Returns:

value of the input variable.

Return type:

(any)

inv_cov(data)
Attempts to find the inverse of the covariance matrix

if the matrix is singular use the Moore-Penrose pseudoinverse.

Parameters:

data (self.df.Dataframe or ndarray) – matrix containing the datapoints.

Returns:

the (pseudo)inverted covariance matrix.

Return type:

(ndarray)

label_bin(**kwargs)

Sets up label binarizer object.

Parameters:

(dict) – keyword arguments for label binarizer.

Returns:

label binarizer object.

Return type:

(obj)

n_neighbor(**kwargs)

Sets up nearest neighbors object.

Parameters:

(dict) – keyword arguments for nearest neighbors.

Returns:

nearest neighbors object.

Return type:

(obj)

one_hot(**kwargs)

Sets up one-hot encoder object.

Parameters:

(dict) – keyword arguments for encoder.

Returns:

encoder object.

Return type:

(obj)

set(var)
Wrapper for python set,

GPU friendly.

Parameters:

(any) – input variable.

Returns:

set of the input variable.

Return type:

(set)

silhouette(points, labels, **kwargs)
Calculates the silhouette score for a set of points

and clusters labels.

Parameters:
  • points (self.df.DataFrame) – points coordinates.

  • labels (self.df.Series) – clusters membership labels.

  • (dict) – keyword arguments for silhouette score (e.g. metric).

Returns:

silhouette score on given points.

Return type:

(int)

class raccoon.interface.InterfaceGPU

Bases: Interface

Interface for GPU functions.

Load the required CPU libraries.

__init__()

Load the required CPU libraries.

build_graph(pj)

Builds a graph from an adjacency matrix

Parameters:

pj (DataFrame) – adjacency matrix to cluster.

Returns:

cuGraph undirected graph

Return type:

(Graph)

cluster(pj, **kwargs)

Sets up clusters identification object with DBSCAN.

Parameters:
  • pj (DataFrame) – projected data to cluster.

  • (dict) – keyword arguments for clusters identification.

Returns:

clusters identification object.

Return type:

(obj)

cluster_louvain(pj, **kwargs)

Sets up clusters identification object with Louvain.

Parameters:
  • pj (Graph) – cuGraph undirected graph from adjacency matrix

  • (dict) – keyword arguments for clusters identification.

Returns:

clusters identification object.

Return type:

(obj)

decompose(**kwargs)

Sets up features filtering object.

Parameters:

(dict) – keyword arguments for features filtering.

Returns:

features filtering object.

Return type:

(obj)

dim_red(**kwargs)

Sets up dimensionality reduction object.

Parameters:

(dict) – keyword arguments for dimensionality reduction.

Returns:

dimensionality reduction object.

Return type:

(obj)

dunn(points, labels, **kwargs)
Calculates the dunn index score for a set of points

and clusters labels. WARNING: slow!

Parameters:
  • points (self.df.DataFrame) – points coordinates.

  • labels (self.df.Series) – clusters membership labels.

  • (dict) – keyword arguments for pairwise distances (e.g. metric).

Returns:

dunn index on given points.

Return type:

(int)

get_value(var, pandas=False)
Returns value of given variable,

transferring it from GPU to CPU.

Parameters:
  • var (any) – input variable.

  • pandas (bool) – if True, transform cudf to pandas.

Returns:

value of the input variable.

Return type:

(any)

inv_cov(data)
Attempts to find the inverse of the covariance matrix

if the matrix is singular use the Moore-Penrose pseudoinverse.

Parameters:

data (self.df.Dataframe or ndarray) – matrix containing the datapoints.

Returns:

the (pseudo)inverted covariance matrix.

Return type:

(ndarray)

label_bin(**kwargs)

Sets up label binarizer object.

Parameters:

(dict) – keyword arguments for label binarizer.

Returns:

label binarizer object.

Return type:

(obj)

n_neighbor(**kwargs)

Sets up nearest neighbors object.

Parameters:

(dict) – keyword arguments for nearest neighbors.

Returns:

features nearest neighbors.

Return type:

(obj)

one_hot(**kwargs)

Sets up one-hot encoder object.

Parameters:

(dict) – keyword arguments for encoder.

Returns:

encoder object.

Return type:

(obj)

set(var)
Wrapper for python set,

GPU friendly..

Parameters:

(any) – input variable.

Returns:

set of the input variable.

Return type:

(set)

silhouette(points, labels, **kwargs)
Calculates the silhouette score for a set of points

and clusters labels.

Parameters:
  • points (self.df.DataFrame) – points coordinates.

  • labels (self.df.Series) – clusters membership labels.

  • (dict) – keyword arguments for silhouette score (e.g. metric).

Returns:

silhouette score on given points.

Return type:

(int)

Clustering

Clustering classes and functions for RACCOON F. Comitani @2018-2022 A. Maheshwari @2019

raccoon.clustering.DEBUG_R = 15

Suppress UMAP and numpy warnings.

class raccoon.clustering.DataGlobal

Bases: object

Static container for the input data to be filled by the user at the first iteration.

dataset = None
labels = None
class raccoon.clustering.IterativeClustering(data, lab=None, transform=None, supervised=False, supervised_weight=0.5, dim=2, epochs=5000, lr=0.05, nei_range='logspace', nei_points=25, nei_factor=1.0, neicap=100, skip_equal_dim=True, skip_dimred=False, metric_map='cosine', metric_clu='euclidean', pop_cut=50, filter_feat='variance', ffrange='logspace', ffpoints=25, optimizer='grid', search_candid=10, search_iter=10, tpe_patience=5, score='silhouette', baseline=- 1e-05, norm=None, dyn_mesh=False, max_mesh=20, min_mesh=4, clu_algo='SNN', cparm_range='guess', min_sam_dbscan=None, outliers='ignore', noise_ratio=0.3, min_csize=None, name='0', debug=False, max_depth=None, save_map=True, RPD=False, out_path='', depth=0, chk=False, gpu=False, _user=True)

Bases: object

To perform top-down iterative clustering on a samples x features matrix.

Initialize the the class.

Parameters:
  • data (matrix, pandas dataframe or pandas index) – if first call (_user==True), input data in pandas dataframe-compatible format (samples as row, features as columns), otherwise index of samples to carry downstream during the iteration calls.

  • lab (list, array or pandas series) – list of labels corresponding to each sample (for plotting only).

  • transform (list of Pandas DataFrame indices) – list of indices of the samples in the initial matrix that should be transformed-only and not used for training the dimensionality reduction map.

  • supervised (bool) – if true, use labels for supervised dimensionality reduction with UMAP (default False, works only if lab !=None).

  • supervised_weight (float) – how much weight is given to the labels in supervised UMAP (default 0.5).

  • true (if) – UMAP (default False, works only if lab !=None).

  • with (use labels for supervised dimensionality reduction) – UMAP (default False, works only if lab !=None).

  • dim (integer) – number of dimensions of the target projection (default 2).

  • epochs (integer) – number of UMAP epochs (default 5000).

  • lr (float) – UMAP learning rate (default 0.05).

  • nei_range (array, list of integers or string or function) – list of nearest neighbors values to be used in the search; if ‘logspace’ take an adaptive range based on the dataset size at each iteration with logarithmic spacing (reccomended), if a function is provided it will be used to define the neighbors range at each step (see the manual for more details).

  • nei_points (int or list of int) – number of grid points for the neighbors search, if list, each value will be subsequently used at the next iteration until all values are exhausted, (works only with optimizer=’grid’ and nei_range=’logspace’ default 25).

  • nei_factor (float) – scaling factor for ‘logspace’ and ‘sqrt’ selections in nei_range

  • neicap (int) – maximum number of neighbors (reccomended with low-memory systems, default 100).

  • skip_equal_dim (bool) – if True, whenever the target dimensionality corresponds to the dimensionality of the input data, the dimensionality reduction step will be skipped (saves time, default True).

  • skip_dimred (bool) – if True, skip the non-linear dimensionality reduction step (default False).

  • metric_map (string) – metric to be used in UMAP distance calculations (default cosine).

  • metric_clu (string) – metric to be used in clusters identification and clustering score calculations (default euclidean) Warning: ‘cosine’ does not work with HDBSCAN, normalize to ‘l2’ and use ‘euclidean’ instead.

  • pop_cut (integer) – minimum number of samples for a cluster to be considered valid, if a cluster is found with a lower population than this threshold, it will not be further explored (default 50).

  • filter_feat (string) – set the method to filter features in preprocessing; if ‘variance’ remove low variance genes if ‘MAD’ remove low median absolute deviation genes if ‘correlation’ remove correlated genes if ‘tSVD’ use truncated single value decomposition (LSA)

  • ffrange (array, list or string) – if filter_feat==’variance’/’MAD’/’correlation’, percentage values for the low-variance/correlation removal cufoff search; if ‘logspace’ (default) take a range between .3 and .9 with logarithmic spacing (reccomended, will take the extremes if optimizer==’de’); if ‘kde’ kernel density estimation will be used to find a single optimal low-variance cutoff (not compatible with optimizer==’de’) if filter_feat==’tSVD’, values for the number of output compontents search; if ‘logspace’ (default) take a range between number of features times .3 and .9 with logarithmic spacing (reccomended, will take the extremes if optimizer==’de’)

  • ffpoins (int or list of int) – number of grid points for the feature removal cutoff search if list, each value will be subsequently used at the next iteration until all values are exhausted, (works only with ffrange=’logspace’, default 25).

  • optimizer (string) – choice of parameters optimizer, can be either ‘grid’ for grid search, ‘de’ for differential evolution, ‘tpe’ for Tree-structured Parzen Estimators with Optuna, or ‘auto’ for automatic (default is ‘grid’). Automatic will chose between grid search and DE depending on the number of search points (de if >25), works only if dyn_mesh is True.

  • search_candid (int or list of int) – size of the candidate solutions population in DE or TPE. If list, each value will be subsequently used at the next iteration until all values are exhausted (this last option works only with optimizer=’de’ and ‘tpe’, default 10).

  • search_iter (int or list of int) – maximum number of iterations of differential evolution. If list, each value will be subsequently used at the next iteration until all values are exhausted (works only with optimizer=’de’, default 10).

  • tpe_patience (int) – number of tpe iteractions below the tolerance before interrupting the search.

  • score (string or function) – objective function of the optimization, to be provided as a string (currently only ‘dunn’ and ‘silhouette’ are available, default ‘silhouette’). Alternatively, a scoring function can be provided, it must take a feature array, an array-like list of labels and a metric, in the same format as sklearn.metrics.silhouette_score.

  • baseline (float) – baseline score. Candidate parameters below this score will be automatically excluded (defaul -1e5).

  • norm (string) – normalization factor before dimensionality reduction (default None), not needed if metric_map is cosine if None, don’t normalize.

  • dyn_mesh (bool) – if true, adapt the number of mesh points (candidates and iteration in DE, candidates in tpe) to the population, overrides nei_points, search_candid, search_iter and ffpoints (default False).

  • max_mesh (int) – maximum number of points for the dyn_mesh option (hit at 10000 samples, default 20), this is a single dimension, the actuall mesh will contain n*n points.

  • min_mesh (int) – minimum number of points for the dyn_mesh option (hit at 50 samples, default 4, must be >3 if optimizer=’de’), this is a single dimension, the actuall mesh will contain n*n points.

  • clu_algo (string) – selects which algorithm to use for clusters identification. Choose among ‘DBSCAN’, ‘SNN’ (Shared Nearest Neighbors DBSCAN, default), ‘HDBSCAN’, or ‘louvain’ (Louvain community detection with SNN).

  • cparm_range (array, list) – clusters identification parameter range to be explored (default ‘guess’). When ‘DBSCAN’ this corresponds to epsilon (if ‘guess’ attempts to identify it by the elbow method); When ‘HDBSCAN’ this corresponds to the minimum number of samples required by the clusters (if ‘guess’ adapts it on the dataset population).

  • min_sam_dbscan (int) – minimum number of samples to define a core used in DBSCAN and HDBSCAN. if None, set 2*target_dim (default None) (default is 10).

  • outliers (string) – selects how to deal with outlier points in the clusters assignment if ‘ignore’ discard them if ‘reassign’ try to assign them to other clusters with knn if more than 10% of the total population was flagged.

  • noise_ratio (float) – maximum percentage cutoff of samples that can be labelled as noise before discarding the result (relevant only for clustering algorithms that label border points as noise, default .3).

  • min_csize (int) – Minimum population size of clusters. If None, keep all clusters, else, clusters below this threshold will be discarded as soon as they are identified (default None).

  • name (string) – name of current clustering level (should be left as default, ‘0’, unless continuing from a previous run).

  • debug (boolean) – specifies whether algorithm is run in debug mode (default is False).

  • max_depth (int) – Specify the maximum number of search iterations, if None (default), keep going while possible. 0 stops the algorithm immediately, 1 stops it after the first level.

  • save_map (boolean) – if active, saves the trained maps to disk (default is True). Needed to run the k-NN classifier.

  • RPD (boolean) – specifies whether to save RPD distributions for each cluster (default is False). Warning: this option is unstable and not reccomended.

  • out_path (string) – path to the location where outputs will be saved (default, save to the current folder).

  • depth (integer) – current depth of search (should be left as default 0, unless continuing from a previous run).

  • chk (bool) – save checkpoints (default False, reccomended for big jobs).

  • gpu (bool) – Activate GPU version (requires RAPIDS).

  • _user (bool) – Boolean switch to distinguish initial user input versus iteration calls, do not change.

__init__(data, lab=None, transform=None, supervised=False, supervised_weight=0.5, dim=2, epochs=5000, lr=0.05, nei_range='logspace', nei_points=25, nei_factor=1.0, neicap=100, skip_equal_dim=True, skip_dimred=False, metric_map='cosine', metric_clu='euclidean', pop_cut=50, filter_feat='variance', ffrange='logspace', ffpoints=25, optimizer='grid', search_candid=10, search_iter=10, tpe_patience=5, score='silhouette', baseline=- 1e-05, norm=None, dyn_mesh=False, max_mesh=20, min_mesh=4, clu_algo='SNN', cparm_range='guess', min_sam_dbscan=None, outliers='ignore', noise_ratio=0.3, min_csize=None, name='0', debug=False, max_depth=None, save_map=True, RPD=False, out_path='', depth=0, chk=False, gpu=False, _user=True)

Initialize the the class.

Parameters:
  • data (matrix, pandas dataframe or pandas index) – if first call (_user==True), input data in pandas dataframe-compatible format (samples as row, features as columns), otherwise index of samples to carry downstream during the iteration calls.

  • lab (list, array or pandas series) – list of labels corresponding to each sample (for plotting only).

  • transform (list of Pandas DataFrame indices) – list of indices of the samples in the initial matrix that should be transformed-only and not used for training the dimensionality reduction map.

  • supervised (bool) – if true, use labels for supervised dimensionality reduction with UMAP (default False, works only if lab !=None).

  • supervised_weight (float) – how much weight is given to the labels in supervised UMAP (default 0.5).

  • true (if) – UMAP (default False, works only if lab !=None).

  • with (use labels for supervised dimensionality reduction) – UMAP (default False, works only if lab !=None).

  • dim (integer) – number of dimensions of the target projection (default 2).

  • epochs (integer) – number of UMAP epochs (default 5000).

  • lr (float) – UMAP learning rate (default 0.05).

  • nei_range (array, list of integers or string or function) – list of nearest neighbors values to be used in the search; if ‘logspace’ take an adaptive range based on the dataset size at each iteration with logarithmic spacing (reccomended), if a function is provided it will be used to define the neighbors range at each step (see the manual for more details).

  • nei_points (int or list of int) – number of grid points for the neighbors search, if list, each value will be subsequently used at the next iteration until all values are exhausted, (works only with optimizer=’grid’ and nei_range=’logspace’ default 25).

  • nei_factor (float) – scaling factor for ‘logspace’ and ‘sqrt’ selections in nei_range

  • neicap (int) – maximum number of neighbors (reccomended with low-memory systems, default 100).

  • skip_equal_dim (bool) – if True, whenever the target dimensionality corresponds to the dimensionality of the input data, the dimensionality reduction step will be skipped (saves time, default True).

  • skip_dimred (bool) – if True, skip the non-linear dimensionality reduction step (default False).

  • metric_map (string) – metric to be used in UMAP distance calculations (default cosine).

  • metric_clu (string) – metric to be used in clusters identification and clustering score calculations (default euclidean) Warning: ‘cosine’ does not work with HDBSCAN, normalize to ‘l2’ and use ‘euclidean’ instead.

  • pop_cut (integer) – minimum number of samples for a cluster to be considered valid, if a cluster is found with a lower population than this threshold, it will not be further explored (default 50).

  • filter_feat (string) – set the method to filter features in preprocessing; if ‘variance’ remove low variance genes if ‘MAD’ remove low median absolute deviation genes if ‘correlation’ remove correlated genes if ‘tSVD’ use truncated single value decomposition (LSA)

  • ffrange (array, list or string) – if filter_feat==’variance’/’MAD’/’correlation’, percentage values for the low-variance/correlation removal cufoff search; if ‘logspace’ (default) take a range between .3 and .9 with logarithmic spacing (reccomended, will take the extremes if optimizer==’de’); if ‘kde’ kernel density estimation will be used to find a single optimal low-variance cutoff (not compatible with optimizer==’de’) if filter_feat==’tSVD’, values for the number of output compontents search; if ‘logspace’ (default) take a range between number of features times .3 and .9 with logarithmic spacing (reccomended, will take the extremes if optimizer==’de’)

  • ffpoins (int or list of int) – number of grid points for the feature removal cutoff search if list, each value will be subsequently used at the next iteration until all values are exhausted, (works only with ffrange=’logspace’, default 25).

  • optimizer (string) – choice of parameters optimizer, can be either ‘grid’ for grid search, ‘de’ for differential evolution, ‘tpe’ for Tree-structured Parzen Estimators with Optuna, or ‘auto’ for automatic (default is ‘grid’). Automatic will chose between grid search and DE depending on the number of search points (de if >25), works only if dyn_mesh is True.

  • search_candid (int or list of int) – size of the candidate solutions population in DE or TPE. If list, each value will be subsequently used at the next iteration until all values are exhausted (this last option works only with optimizer=’de’ and ‘tpe’, default 10).

  • search_iter (int or list of int) – maximum number of iterations of differential evolution. If list, each value will be subsequently used at the next iteration until all values are exhausted (works only with optimizer=’de’, default 10).

  • tpe_patience (int) – number of tpe iteractions below the tolerance before interrupting the search.

  • score (string or function) – objective function of the optimization, to be provided as a string (currently only ‘dunn’ and ‘silhouette’ are available, default ‘silhouette’). Alternatively, a scoring function can be provided, it must take a feature array, an array-like list of labels and a metric, in the same format as sklearn.metrics.silhouette_score.

  • baseline (float) – baseline score. Candidate parameters below this score will be automatically excluded (defaul -1e5).

  • norm (string) – normalization factor before dimensionality reduction (default None), not needed if metric_map is cosine if None, don’t normalize.

  • dyn_mesh (bool) – if true, adapt the number of mesh points (candidates and iteration in DE, candidates in tpe) to the population, overrides nei_points, search_candid, search_iter and ffpoints (default False).

  • max_mesh (int) – maximum number of points for the dyn_mesh option (hit at 10000 samples, default 20), this is a single dimension, the actuall mesh will contain n*n points.

  • min_mesh (int) – minimum number of points for the dyn_mesh option (hit at 50 samples, default 4, must be >3 if optimizer=’de’), this is a single dimension, the actuall mesh will contain n*n points.

  • clu_algo (string) – selects which algorithm to use for clusters identification. Choose among ‘DBSCAN’, ‘SNN’ (Shared Nearest Neighbors DBSCAN, default), ‘HDBSCAN’, or ‘louvain’ (Louvain community detection with SNN).

  • cparm_range (array, list) – clusters identification parameter range to be explored (default ‘guess’). When ‘DBSCAN’ this corresponds to epsilon (if ‘guess’ attempts to identify it by the elbow method); When ‘HDBSCAN’ this corresponds to the minimum number of samples required by the clusters (if ‘guess’ adapts it on the dataset population).

  • min_sam_dbscan (int) – minimum number of samples to define a core used in DBSCAN and HDBSCAN. if None, set 2*target_dim (default None) (default is 10).

  • outliers (string) – selects how to deal with outlier points in the clusters assignment if ‘ignore’ discard them if ‘reassign’ try to assign them to other clusters with knn if more than 10% of the total population was flagged.

  • noise_ratio (float) – maximum percentage cutoff of samples that can be labelled as noise before discarding the result (relevant only for clustering algorithms that label border points as noise, default .3).

  • min_csize (int) – Minimum population size of clusters. If None, keep all clusters, else, clusters below this threshold will be discarded as soon as they are identified (default None).

  • name (string) – name of current clustering level (should be left as default, ‘0’, unless continuing from a previous run).

  • debug (boolean) – specifies whether algorithm is run in debug mode (default is False).

  • max_depth (int) – Specify the maximum number of search iterations, if None (default), keep going while possible. 0 stops the algorithm immediately, 1 stops it after the first level.

  • save_map (boolean) – if active, saves the trained maps to disk (default is True). Needed to run the k-NN classifier.

  • RPD (boolean) – specifies whether to save RPD distributions for each cluster (default is False). Warning: this option is unstable and not reccomended.

  • out_path (string) – path to the location where outputs will be saved (default, save to the current folder).

  • depth (integer) – current depth of search (should be left as default 0, unless continuing from a previous run).

  • chk (bool) – save checkpoints (default False, reccomended for big jobs).

  • gpu (bool) – Activate GPU version (requires RAPIDS).

  • _user (bool) – Boolean switch to distinguish initial user input versus iteration calls, do not change.

_elbow(pj)

Estimates the point of flex of a pairwise distances plot.

Parameters:

pj (pandas dataframe/numpy matrix) – projection of saxmples in the low-dimensionality space obtained with UMAP, or adjacency matrix if SNN.

Returns:

elbow value.

Return type:

(float)

_features_removal(cutoff)

Either remove features with low variance/MAD, or high correlation from dataset according to a specified threshold (cutoff) or apply truncated SVD to reduce features to a certain number (cutoff).

Parameters:

cutoff (string or float) – if filter_feat==’variance’/’MAD’/’correlation’, percentage value for the low-variance/MAD/high-correlation removal cufoff, if ‘kde’ kernel density estimation will be used to find a single optimal low-variance/MAD/high-correlation cutoff; if filter_feat==’tSVD’, dimensionality of the output data.

Returns:

reduced-dimensionality input data. (tsvd object): trained tsvd instance, None if ‘variance’/’MAD’/’correlation’.

Return type:

(pandas dataframe)

_find_clusters(pj, cparm, cse=None, algorithm=None)

Runs the selected density-based clusters identification algorithm.

Parameters:
  • pj (dataframe or matrics) – points coordinates.

  • cparm (float) – clustering parameter.

  • cse (int) – value of clustering_selection_epsilon for HDBSCAN.

  • algorithm (string) – value of algorithm for HDBSCAN.

Returns:

list of assigned clusters.

Return type:

(list of int)

_guess_parm(pj)

Estimate a range for the clustering identification parameter.

Parameters:

pj (pandas dataframe) – projection of saxmples in the low-dimensionality space obtained with UMAP.

Returns:

estimated range.

Return type:

(numpy range)

_level_check()

Stop the iterative search if a given max_depth parameter has been reached.

_objective_function(params)

Objective function for Differential Evolution.

Parameters:

params (list) – a list containing a single feature cutoff and a UMAP nearest neighbors parameter.

Returns:

a tuple containing the loss value for the given set of parameters; a series with the cluster membership identified for each sample; the optimal clustering parameter value found; a low dimensionality data projection from UMAP; a set of genes kept after low variance/MAD removal, nan if tSVD; the trained tsvd instance, None if ‘variance’/’MAD’.

Return type:

(tuple (float, pd.Series, float, pd.DataFrame, pd.Index, tsvd object))

_optimize_params()

Wrapper function for the parameters optimization.

Returns:

(tuple (float, pd.Series, float, int, int, pd.DataFrame,

float, pd.Index, tsvd onject, float, list of floats)): a tuple containing the silhoutte score corresponding to the best set of parameters; a series with the cluster membership identified for each sample; the optimal clusters identification parameter value found; the total number of clusters determined by the search; the optimal number of nearest neighbors used with UMAP; a low dimensionality data projection from UMAP; the optimal cutoff value used for the features removal step; the set of genes kept after low variance/MAD removal, nan if tSVD; the trained tsvd instance, None if ‘variance’/’MAD’; the percentage of points forecefully assigned to a class if outliers=’reassign’; the list of all scores evaluated and their parameters.

_plot(n_nei, proj, cut_opt, keepfeat, decomposer, clus_opt, scoreslist)

Produce a number of plots to visualize the clustering outcome at each stage of the iterative search.

Parameters:
  • n_nei (integer) – optimal number of nearest neighbors (used in UMAP) that was found through grid search.

  • proj (pandas dataframe of floats) – optimal reduced dimensionality data matrix.

  • cut_opt (int or float) – optimal features removal cutoff.

  • keepfeat (pandas index) – set of genes kept after low /MAD removal, nan if tSVD.

  • decomposer (tsvd object) – trained tsvd instance.

  • clus_opt (pandas series) – cluster membership series.

  • scoreslist (list of float) – list of all scores evaluated and their parameters.

_run_grid_instances(nnrange)

Run Grid Search to find the optimal set of parameters by maximizing the clustering score.

Parameters:

nnrange (numpy range) – UMAP nearest neighbors range.

Returns:

a tuple containing the list of best parameters; a list containing score, labels, clustering parameter, projected points, trained maps, filtered features and trained low-information filter from the best scoring model; a matrix containing all the explored models’ parameters and their scores (useful for plotting the hyperspace).

Return type:

(tuple (list of floats, list of objects, list of floats))

_run_single_instance(cutoff, nn)

Run a single instance of clusters search for a given features cutoff and UMAP nearest neighbors number.

Parameters:
  • cutoff (float) – features cutoff.

  • nn (int) – UMAP nearest neighbors value.

Returns:

a tuple containing the silhoutte score corresponding to the best set of parameters; a series with the cluster membership identified for each sample; the optimal clustering parameter value found; a low dimensionality data projection from UMAP; a set of genes kept after low /MAD removal, nan if ‘tSVD’; the trained tsvd instance, None if ‘variance’/’MAD’.

Return type:

(tuple (float, pd.Series, float, pd.DataFrame, pd.Index, tsvd object))

iterate()

Iteratively clusters the input data, by first optimizing the parameters, binarizing the resulting labels, plotting and repeating.

snn(points, num_neigh)

Calculates Shared Nearest Neighbor (SNN) matrix

Parameters:
  • points (dataframe or matrix) – points coordinates.

  • num_neigh (int) – number of neighbors considered to define the similarity of two points.

Returns:

SNN matrix as input for DBSCAN.

Return type:

(matrix)

Classification

Basic k-nearest neighbours classifier for RACCOON F. Comitani @2020-2022

class raccoon.classification.KNN(data, ori_data, ori_clust, refpath='./rc_data/', out_path='', root='0', debug=False, gpu=False)

Bases: object

To perform a basic distance-weighted k-nearest neighbours classification.

Initialize the the class.

Parameters:
  • data (matrix or pandas dataframe) – input data in pandas dataframe-compatible format (samples as row, features as columns).

  • ori_data (matrix or pandas dataframe) – original data clustered with RACCOON in pandas dataframe-compatible format (samples as row, features as columns).

  • ori_clust (matrix or pandas dataframe) – original RACCOON output one-hot-encoded class membership in pandas dataframe-compatible format (samples as row, classes as columns).

  • refpath (string) – path to the location where trained umap files (pkl) are stored (default subdirectory raraccoon_data of current folder).

  • out_path (string) – path to the location where outputs will be saved (default save to the current folder).

  • root (string) – name of the root node, parent of all the classes within the first clustering level. Needed to identify the appropriate pkl file (default ‘0’).

  • debug (boolean) – specifies whether algorithm is run in debug mode (default is False).

  • gpu (bool) – activate GPU version (requires RAPIDS).

__init__(data, ori_data, ori_clust, refpath='./rc_data/', out_path='', root='0', debug=False, gpu=False)

Initialize the the class.

Parameters:
  • data (matrix or pandas dataframe) – input data in pandas dataframe-compatible format (samples as row, features as columns).

  • ori_data (matrix or pandas dataframe) – original data clustered with RACCOON in pandas dataframe-compatible format (samples as row, features as columns).

  • ori_clust (matrix or pandas dataframe) – original RACCOON output one-hot-encoded class membership in pandas dataframe-compatible format (samples as row, classes as columns).

  • refpath (string) – path to the location where trained umap files (pkl) are stored (default subdirectory raraccoon_data of current folder).

  • out_path (string) – path to the location where outputs will be saved (default save to the current folder).

  • root (string) – name of the root node, parent of all the classes within the first clustering level. Needed to identify the appropriate pkl file (default ‘0’).

  • debug (boolean) – specifies whether algorithm is run in debug mode (default is False).

  • gpu (bool) – activate GPU version (requires RAPIDS).

_build_hierarchy()

Builds a dictionary with information on the classess hierarchy.

_dampen_child_prob()

Renormalize the probabilities of a child class according to that of its parent.

assign_membership()

Identifies class membership probabilities with a distance-weighted k-nearest neighbours algorith.

gpu

Set up for CPU or GPU run.

membership

Configure log.

raccoon.classification.local_KNN(proj, labs, nnei, metric, interface, as_series=False)
Performs a k-nearest neighbours search and assigns a single-level

clusters memebership based on the neighbours

Parameters:
  • proj (dataframe) – data projection onto which nearest neighbours will be searched. Must include all data, original plus the new datapoints to be searched.

  • labs (dataframe) – class assignment for the original data only.

  • nnei (int) – number of nearest neighbours to consuder.

  • metric (string) – metric to measure neighbours distances.

  • interface (obj) – CPU/GPU numeric functions interface.

  • as_series (bool) – if true, return result as series, else return as one-hot-encoded matrix (default, False)

Returns:

class assignment for the new data of the data.

Return type:

(array or matrix)

Update

To update previous RACCOON clustering runs with new data. F. Comitani @2021-2022

class raccoon.update.UpdateClusters(data, ori_data, ori_clu, refpath='./rc_data/', out_path='./', tolerance=0.1, prob_cut=0.25, min_csize=None, score='silhouette', metric_clu='cosine', root='0', debug=False, gpu=False, **kwargs)

Bases: object

Adds new data to the dataset and identifies clusters that need to be updated. Runs KNN furst on the new data points to identify the closest matching clusters. These points are then added to each cluster along the heirarchy and the objective function is recalculated. If this score is lowered beyond the given threshold, the cluster under scrutiny is scrapped, together with its offspring, and re-built from scrach.

Initialize the the class.

Parameters:
  • data (matrix or pandas dataframe) – input data in pandas dataframe-compatible format (samples as row, features as columns).

  • ori_data (matrix or pandas dataframe) – original data clustered with RACCOON in pandas dataframe-compatible format (samples as row, features as columns).

  • ori_clu (matrix or pandas dataframe) – original RACCOON output one-hot-encoded class membership in pandas dataframe-compatible format (samples as row, classes as columns).

  • refpath (string) – path to the location where trained umap files (pkl) are stored (default subdirectory raraccoon_data of current folder).

  • out_path (string) – path to the location where outputs will be saved (default save to the current folder).

  • tolerance (float) – objective score change threshold, beyond which clusters will have to be recalculated (default 1e-1).

  • prob_cut (float) – prubability cutoff, when running the KNN, samples with less than this value of probability to any assigned class will be treated as noise and won’t impact the clusters score review (default 0.25).

  • min_csize (int) – minimum number of samples in a cluster, if None keep all clusters (default is None).

  • score (string) – objective function of the optimization (currently only ‘dunn’ and ‘silhouette’ are available, default ‘silhouette’).

  • metric_clu (string) – metric to be used in clusters identification and clustering score calculations (default euclidean) Warning: ‘cosine’ does not work with HDBSCAN, normalize to ‘l2’ and use ‘euclidean’ instead.

  • root (string) – name of the root node, parent of all the classes within the first clustering level (default ‘0’).

  • debug (boolean) – specifies whether algorithm is run in debug mode (default is False).

  • gpu (bool) – activate GPU version (requires RAPIDS).

  • kwargs (dict) – keyword arguments for IterativeClustering.

__init__(data, ori_data, ori_clu, refpath='./rc_data/', out_path='./', tolerance=0.1, prob_cut=0.25, min_csize=None, score='silhouette', metric_clu='cosine', root='0', debug=False, gpu=False, **kwargs)

Initialize the the class.

Parameters:
  • data (matrix or pandas dataframe) – input data in pandas dataframe-compatible format (samples as row, features as columns).

  • ori_data (matrix or pandas dataframe) – original data clustered with RACCOON in pandas dataframe-compatible format (samples as row, features as columns).

  • ori_clu (matrix or pandas dataframe) – original RACCOON output one-hot-encoded class membership in pandas dataframe-compatible format (samples as row, classes as columns).

  • refpath (string) – path to the location where trained umap files (pkl) are stored (default subdirectory raraccoon_data of current folder).

  • out_path (string) – path to the location where outputs will be saved (default save to the current folder).

  • tolerance (float) – objective score change threshold, beyond which clusters will have to be recalculated (default 1e-1).

  • prob_cut (float) – prubability cutoff, when running the KNN, samples with less than this value of probability to any assigned class will be treated as noise and won’t impact the clusters score review (default 0.25).

  • min_csize (int) – minimum number of samples in a cluster, if None keep all clusters (default is None).

  • score (string) – objective function of the optimization (currently only ‘dunn’ and ‘silhouette’ are available, default ‘silhouette’).

  • metric_clu (string) – metric to be used in clusters identification and clustering score calculations (default euclidean) Warning: ‘cosine’ does not work with HDBSCAN, normalize to ‘l2’ and use ‘euclidean’ instead.

  • root (string) – name of the root node, parent of all the classes within the first clustering level (default ‘0’).

  • debug (boolean) – specifies whether algorithm is run in debug mode (default is False).

  • gpu (bool) – activate GPU version (requires RAPIDS).

  • kwargs (dict) – keyword arguments for IterativeClustering.

clu

Apply pobability cutoff and assign each sample to a unique path along the hierarchy.

find_and_update()

Update the clusters by adding the new data and rebuilding them if needed.

gpu

Set up for CPU or GPU run.

kwargs

Setup logging.

paramdata

Run KNN.

run_knn()
Run a single KNN instance to project the new

datapoint onto the old hierarchy.

Returns:

the cluster assignment one-hot-encoded

dataframe for the new dataset.

Return type:

(dataframe)

single_update(clu_name)
Update the clusters by adding the new data and rebuilding

them if needed.

Args:i

clu_name (str): name of the cluster to update.

Returns:

the updated cluster assignment one-hot-encoded

dataframe for the given cluster.

Return type:

(dataframe)

Optimizers

Basic differential evolution implementation for RACCOON F. Comitani @2019-2022

Based on https://nathanrooy.github.io/posts/2017-08-27/simple-differential-evolution-with-python

Storn, R.; Price, K. (1997). “Differential evolution - a simple and efficient heuristic for global optimization over continuous spaces”. Journal of Global Optimization. 11 (4): 341–359. doi:10.1023/A:1008202821328.

optim.de._clamp(x, min_val, max_val)

Force a number between bounds.

Parameters:
  • x (float) – input value to be clamped.

  • min_val (float) – lower bound.

  • max_val (float) – upper bound.

Returns:

clamped value.

Return type:

(float)

optim.de._differential_evolution(loss_fun, bounds, integers=None, n_candidates=10, mutation=0.6, recombination=0.7, maxiter=20, tol=0.0001, seed=None)

Basic Differential Evolution implementation.

Parameters:
  • loss_fun (function) – objective function, takes a set of parameters to be optimized and returns a single float value.

  • bounds (tuple) – minimum and maximum boundaries for the parameters to optimize.

  • integers (list of booleans or None) – list with information on which parameters are integers, if None (default) treat every parameter as float.

  • n_candidates (int) – size of the candidate solutions population.

  • mutation (float) – scaling factor for the mutation step.

  • recombination (float) – recombination (crossover) rate.

  • maxiter (float) – maximum number of generations.

  • tol (float) – solution improvement tolerance, if after 3 generations the best solution is not improved by at least this value, stop the iteration.

  • seed (int) – seed for the random numbers generator.

Returns:

tuple containing

the list of best parameters; a list containing score, labels, clustering parameter, projected points, trained maps, filtered features and trained low-information filter from the best scoring model; a matrix containing all the explored models’ parameters and their scores (useful for plotting the hyperspace).

Return type:

(tuple (list of floats, list of objects, list of floats))

optim.de._tostring(x)

Conbine a list of numbers into a single string with underscore as separator.

Parameters:

x (list of floats) – list of numbers to combine.

Returns:

combined string.

Return type:

(str)

Tree-structured Parzen Estimators optimization for RACCOON F. Comitani @2021

class optim.tpe.EarlyStoppingCallback(patience=5, tolerance=0.0001, direction='minimize')

Bases: object

Early stopping callback for Optuna.

Initialize early stopping.

Parameters:
  • patience (int) – number of rounds to wait after reaching the plateau before stopping the study (default 5).

  • tolerance (float) – solution improvement tolerance (default 1e-4).

  • direction (str) – Direction of the optimization, it can be either “minimize” or “minimize” in accordance to Optuna’s format (default “minimize”).

__init__(patience=5, tolerance=0.0001, direction='minimize')

Initialize early stopping.

Parameters:
  • patience (int) – number of rounds to wait after reaching the plateau before stopping the study (default 5).

  • tolerance (float) – solution improvement tolerance (default 1e-4).

  • direction (str) – Direction of the optimization, it can be either “minimize” or “minimize” in accordance to Optuna’s format (default “minimize”).

class optim.tpe.Objective(bounds, obj_func)

Bases: object

Objective function class for Optuna.

Initialize the objective object.

Parameters:
  • obj_func (function) – objective function; takes a set of parameters to be optimized and returns a single float value.

  • bounds (tuple) – minimum and maximum boundaries for the parameters to optimize.

__init__(bounds, obj_func)

Initialize the objective object.

Parameters:
  • obj_func (function) – objective function; takes a set of parameters to be optimized and returns a single float value.

  • bounds (tuple) – minimum and maximum boundaries for the parameters to optimize.

callback(study, trial)

Stores the best results.

Parameters:
  • study (optuna.Study) – the study to interrupt.

  • trial (optuna.Trial) – the current trial.

optim.tpe._optuna_tpe(obj_func, bounds, n_candidates=20, patience=5, tol=0.0001, seed=None)

Tree-structured Parzen Estimators optimization with Optuna.

Parameters:
  • obj_func (function) – objective function; takes a set of parameters to be optimized and returns a single float value.

  • bounds (tuple) – minimum and maximum boundaries for the parameters to optimize.

  • candidates (int) – maximum number of candidate points in the hyperspace to explore (default 20).

  • patience (int) – number of rounds to wait after reaching the plateau before stopping the study (default 5).

  • tol (float) – solution improvement tolerance (default 1e-4).

  • seed (int) – seed for the random numbers generator (default None).

Returns:

tuple containing

the list of best parameters; a list containing score, labels, clustering parameter, projected points, trained maps, filtered features and trained low-information filter from the best scoring model; a matrix containing all the explored models’ parameters and their scores (useful for plotting the hyperspace).

Return type:

(tuple (list of floats, list of objects, list of floats))

Utils

Utility functions for RACCOON F. Comitani @2018-2022 A. Maheshwari @2019

utils.functions._calc_RPD(mh, labs, interface, plot=True, name='rpd', path='')
Calculate and plot the relative pairwise distance (RPD) distribution for each cluster.

See XXX for the definition. DEPRECATED: UNSTABLE, only works with cosine.

Parameters:
  • mh (pandas dataframe) – dataframe containing reduced dimensionality data.

  • labs (pandas series) – clusters memebership for each sample.

  • interface (obj) – CPU/GPU numeric functions interface.

  • plot (boolean) – True to generate plot, saves the RPD values only otherwise.

  • name (string) – name of output violin plot .png file.

Returns:

each internal array represents the RPD values of the corresponding cluster #.

Return type:

vals (array of arrays of floats)

utils.functions._drop_collinear(data, interface, thresh=0.75)
Drop collinear features above the ‘thresh’ % of correlation.

WARNING: very slow! Use tSVD instead!

Parameters:
  • data (pandas dataframe) – input pandas dataframe (samples as row, features as columns).

  • interface (obj) – CPU/GPU numeric functions interface.

  • thresh (float) – percentage threshold for the correlation.

utils.functions._drop_min_KDE(data, interface, type='variance')

Use kernel density estimation to guess the optimal cutoff for low-variance removal.

Parameters:
  • data (pandas dataframe) – input pandas dataframe (samples as row, features as columns).

  • interface (obj) – CPU/GPU numeric functions interface.

  • type (string) – measure of variability, to be chosen between variance (‘variance’) or median absolute deviation (‘MAD’).

utils.functions._near_zero_var_drop(data, interface, thresh=0.99, type='variance')
Drop features with low variance/MAD based on a threshold after sorting them,

converting to a cumulative function and keeping the ‘thresh’ % most variant features.

Parameters:
  • data (pandas dataframe) – input pandas dataframe (samples as row, features as columns).

  • interface (obj) – CPU/GPU numeric functions interface.

  • thresh (float) – percentage threshold for the cumulative variance/MAD.

  • type (string) – measure of variability, to be chosen between variance (‘variance’) or median absolute deviation (‘MAD’).

utils.functions.calc_score(points, labels, score, metric_clu, interface)

Select and calculate scoring function for optimization.

Parameters:
  • points (dataframe or matrix) – points coordinates.

  • labels (series or matrix) – clusters assignment.

  • score (str) – score type.

  • metric_clu (str) – metric to use in the scoring functions.

  • interface (obj) – CPU/GPU numeric functions interface.

Returns:

clustering score.

Return type:

(float)

utils.functions.loc_cat(labels, indices, supervised)
Selects labels in

supervised UMAP and transform them to categories.

Parameters:
  • indices (array-like) – list of indices.

  • supervised (bool) – True if running superived UMAP.

Returns:

sliced labels series as categories if it exists.

Return type:

(Series)

utils.functions.one_hot_encode(labs_opt, name, interface, min_pop=None, rename=True)

Build and return a one-hot-encoded clusters membership dataframe.

Parameters:
  • labs_opt (pandas series) – cluster membership series or list.

  • name (str) – parent cluster name.

  • interface (obj) – CPU/GPU numeric functions interface.

  • min_pop (int) – population threshold for clusters, if None, keep all produced clusters.

  • rename (bool) – rename columns expanding the parent cluster name (default True).

Returns:

one-hot-encoded cluster membership dataframe.

Return type:

tmplab (pandas dataframe)

utils.functions.setup(out_path=None, paramdata=True, chk=False, RPD=False, suffix='', delete=True)
Set up folders that are written to during clustering,

as well as a log file where all standard output is sent. If such folders are already present in the path, delete them.

Parameters:
  • out_path (string) – path where output files will be saved.

  • paramdata (bool) – if true create parameters csv table (default True).

  • chk (bool) – if true create checkpoints subdirectory (default False).

  • RPD (bool) – deprecated, if true created RPD distributions base pickle (default False).

  • suffix (string) – suffix to add to the log file

  • delete (bool) – if true delete folders if already present, user confirmation will always be required before deleting folders (default True).

utils.functions.setup_log(out_path, suffix='')

Set up logging.

Parameters:
  • out_path (string) – path where output files will be saved.

  • suffix (string) – suffix to add to the log file

utils.functions.sigmoid(x, interface, a=0, b=1)

Sigmoid function

Parameters:
  • x (float) – position at which to evaluate the function

  • interface (obj) – CPU/GPU numeric functions interface.

  • a (float) – center parameter

  • b (float) – slope parameter

Returns:

sigmoid function evaluated at position x

Return type:

(float)

utils.functions.sort_len_num(lista)

Sort elements of a list by length first, then by numbers.

Parameters:

lista (list) – the list to sort.

Returns:

the sorted list.

Return type:

(list)

utils.functions.unique_assignment(tab, root, interface)
Assigns samples to their maximum probability class-path along the hierarchy.

Starting from a probability matrix.

Parameters:
  • tab (pandas dataframe) – original cluster membership probabilities table.

  • root (str) – name of the root class.

  • interface (obj) – CPU/GPU numeric functions interface.

Returns:

one-hot-encoded cluster membership dataframe.

Return type:

tab (pandas dataframe)

Auxiliary classes for RACCOON F. Comitani @2022

class utils.classes.IdentityProjection(**kwargs)

Bases: object

To be used when the target space dimensionality corresponds to the input space and the dimensionality reduction step should be skipped.

Initialize the the class.

Parameters:

kwargs – keyword arguments will be ignored.

__init__(**kwargs)

Initialize the the class.

Parameters:

kwargs – keyword arguments will be ignored.

fit(data, *args, **kwargs)

Initialize the the class, set the number of neighbors as square root of the dataset size and dimensionality of the dataset.

args: arguments will be ignored. kwargs: keyword arguments will be ignored.

fit_transform(data, **kwargs)

Empty fit_transform function

Parameters:

data (any) – object to be returned.

Returns:

object to be returned. kwargs: keyword arguments will be ignored.

Return type:

data (any)

identity(data)

Identity function. Returns the input data.

Parameters:

data (any) – object to be returned.

Returns:

object to be returned.

Return type:

data (any)

transform(data)

Empty transform function

Parameters:

data (any) – object to be returned.

Returns:

object to be returned.

Return type:

data (any)

Plotting functions for RACCOON F. Comitani @2018-2022

class utils.plots.Palettes

Bases: object

midpal = ['#355C7D', '#6C5B7B', '#C06C84', '#F67280', '#F8B195']
midpalmap = <matplotlib.colors.LinearSegmentedColormap object>
nupal = ['#247ba0', '#70c1b3', '#b2dbbf', '#f3ffbd', '#ff7149']
nupalmap = <matplotlib.colors.LinearSegmentedColormap object>
utils.plots._plot_cut(df, df_cut, name='./gene_cut.png', path='')

Plot variance distributions before and after of the low-variance removal step.

Parameters:
  • df (pandas dataframe) – original data before cutting low variance columns.

  • df_cut (pandas dataframe) – data after cutting low variance columns.

  • name (string) – name of resulting .png file.

utils.plots._plot_score(scores, parm_opt, xlab, name='./scores.png', path='')

Plot optimization score through iterations and highlights the optimal choice.

Parameters:
  • scores (list of float) – list of scores through the clustering parameter iterations.

  • parm_opt (float) – optimal parameter.

  • xname (string) – x axis label.

  • name (string) – name of resulting .png file.

  • path (string) – path where output pictures should be saved.

utils.plots._plot_score_surf(scores, parm_opt, name='./scores_surf.png', path='')

Plot parameters optimization surface.

Parameters:
  • scores (list of float) – list of scores through the clustering parameter iterations.

  • parm_opt (list of float) – optimal parameters.

  • name (string) – name of resulting .png file.

  • path (string) – path where output pictures should be saved.

utils.plots.plot_homogeneity(df1, df2, name='./homogeneity.png', path='')
Plot a heatmap of the homogeneity score between two sets

of clusters.

Parameters:
  • df1 (pandas dataframe) – first one-hot-encoded clusters membership table.

  • df2 (pandas dataframe) – second one-hot-encoded clusters membership table.

  • name (string) – name of output plot .png file.

  • path (string) – path where output pictures should be saved.

utils.plots.plot_map(df, labels, name='./projection.png', path='')

Generate a 2-dimensional scatter plot with color-coded labels.

Parameters:
  • df (pandas dataframe) – 2d input data.

  • labels (series) – label for each sample.

  • name (string) – name of output plot .png file.

  • path (string) – path where output pictures should be saved.

utils.plots.plot_violin(vals, name='./rpdd.png', path='')

Generate a set of separate violin plot from given values.

Parameters:
  • vals (array of arrays of floats) – each internal array contains the values of a single violin plot.

  • name (string) – name of output plot .png file.

  • path (string) – path where output pictures should be saved.