adobo package¶
Subpackages¶
Submodules¶
adobo.IO module¶
adobo.bio module¶
Summary¶
Functions related to biology.
-
adobo.bio.
cell_cycle_predict
(obj, clf, tr_features, name=(), verbose=False)¶ Predicts cell cycle phase
Notes
The classifier is trained on mouse data, so it should _only_ be used on mouse data unless it is trained on something else. Gene identifiers must use ensembl identifiers (prefixed with ‘ENSMUSG’); pure gene symbols are not enough. Results are returned as a column in the data frame meta_cells of the passed object. Does not return probability scores.
- Parameters
obj (
adobo.data.dataset
) – A data class object.clf (sklearn.linear_model.SGDClassifier) – The classifier.
tr_features (list) – Training features.
name (tuple) – A tuple of normalization to use. If it has the length zero, then all available normalizations will be used.
verbose (bool) – Be verbose. Default: False
- Returns
- Return type
Modifies the passed object.
-
adobo.bio.
cell_cycle_train
(verbose=False)¶ Trains a cell cycle classifier using Stochastic Gradient Descent with data from Buettner et al.
Notes
Genes are selected from GO:0007049
Does only need to be trained once; the second time it is serialized from disk.
- Parameters
verbose (bool) – Be verbose or not. Default: False
References
- 1
Buettner et al. (2015) Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat Biotech.
- Returns
sklearn.linear_model.SGDClassifier – A trained classifier.
list – Containing training features.
-
adobo.bio.
cell_type_predict
(obj, name=(), clustering=(), min_cluster_size=10, cell_type_markers=None, verbose=False)¶ Predicts cell types using the expression of marker genes
Notes
Gene identifiers should be in symbol form, not ensembl identifiers, etc.
- Parameters
obj (
adobo.data.dataset
) – A data class object.name (tuple) – A tuple of normalization to use. If it has the length zero, then all available normalizations will be used.
clustering (tuple, optional) – Specifies the clustering outcomes to work on.
min_cluster_size (int) – Minimum number of cells per cluster; clusters smaller than this are ignored. Default: 10
cell_type_markers (pandas.DataFrame) – Source of gene markers used to define cell types. This is set to None as default, indicating that PanglaoDB markers will be used. To use custom markers, set this to a pandas data frame where the first column is a gene and the second column is the name of the cell type (every cell type will have multiple rows). Default: None
Default (None) –
verbose (bool) – Be verbose or not. Default: False
- Returns
- Return type
Modifies the passed object.
adobo.bulk module¶
adobo.clustering module¶
Summary¶
This module contains functions to cluster data.
-
adobo.clustering.
generate
(obj, k=10, name=None, distance='euclidean', graph='snn', clust_alg='leiden', prune_snn=0.067, res=0.8, save_graph=True, seed=42, verbose=False)¶ A wrapper function for generating single cell clusters from a shared nearest neighbor graph with the Leiden algorithm
- Parameters
obj (
adobo.data.dataset
) – A dataset class object.k (int) – Number of nearest neighbors. Default: 10
name (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on all normalizations available.
distance (str) – Distance metric to use. See here for valid choices: https://tinyurl.com/y4bckf7w Default: ‘euclidean’
target ({‘irlb’, ‘svd’}) – The dimensionality reduction result to run on. Default: irlb
graph ({‘snn’}) – Type of graph to generate. Only shared nearest neighbor (snn) supported at the moment.
clust_alg (`{‘leiden’, ‘louvain’, ‘walktrap’, ‘spinglass’, ‘multilevel’,) – ‘infomap’, ‘label_prop’, ‘leading_eigenvector’}` Clustering algorithm to be used.
prune_snn (float) – Threshold for pruning the SNN graph, i.e. the edges with lower value (Jaccard index) than this will be removed. Set to 0 to disable pruning. Increasing this value will result in fewer edges in the graph. Default: 0.067
res (float) – Resolution parameter for the Leiden algorithm _only_; change to modify cluster resolution. Default: 0.8
save_graph (bool) – To save the graph or not. Default: True
seed (int) – For reproducibility.
verbose (bool) – Be verbose or not.
References
- 1
Yang et al. (2016) A Comparative Analysis of Community Detection Algorithms on Artificial Networks. Scientific Reports
- Returns
A dict containing cluster sizes (number of cells), only retx is set to True.
- Return type
dict
-
adobo.clustering.
igraph
(snn_graph, clust_alg)¶ Runs clustering functions within igraph
- Parameters
snn_graph (
pandas.DataFrame
) – Source and target nodes.clust_alg (`{‘walktrap’, ‘spinglass’, ‘multilevel’, ‘infomap’,) – ‘label_prop’, ‘leading_eigenvector’}` Specifies the community detection algorithm.
References
- 1
Pons & Latapy (2006) Computing Communities in Large NetworksUsing Random Walks, Journal of Graph Algorithms and Applications
- 2
Reichardt & Bornholdt (2006) Statistical mechanics of community detection, Physical Review E
- Returns
- Return type
Nothing. Modifies the passed object.
-
adobo.clustering.
knn
(comp, k=10, distance='euclidean')¶ Nearest Neighbour Search. Finds the k number of near neighbours for each cell.
- Parameters
comp (
pandas.DataFrame
) – A pandas data frame containing PCA components.k (int) – Number of nearest neighbors. Default: 10
target ({‘irlb’, ‘svd’}) – The dimensionality reduction result to run the NN search on. Default: irlb
distance (str) – Distance metric to use. See here for valid choices: https://tinyurl.com/y4bckf7w
- Returns
Array containing indices.
- Return type
numpy.ndarray
-
adobo.clustering.
leiden
(snn_graph, res=0.8, seed=42)¶ Runs the Leiden algorithm
- Parameters
snn_graph (
pandas.DataFrame
) – Source and target nodes.res (float) – Resolution parameter, change to modify cluster resolution. Default: 0.8
seed (int) – For reproducibility.
References
- 1
- 2
Traag et al. (2018) https://arxiv.org/abs/1810.08473
- Returns
- Return type
Nothing. Modifies the passed object.
-
adobo.clustering.
louvain
(snn_graph, res=0.8, seed=42)¶ Runs the Louvain algorithm
- Parameters
snn_graph (
pandas.DataFrame
) – Source and target nodes.res (float) – Resolution parameter, change to modify cluster resolution. Default: 0.8
seed (int) – For reproducibility.
References
- 1
- 2
https://perso.uclouvain.be/vincent.blondel/research/louvain.html
- 3
Blondel et al., Fast unfolding of communities in large networks (2008), Journal of Statistical Mechanics: Theory and Experiment
- Returns
- Return type
Nothing. Modifies the passed object.
-
adobo.clustering.
snn
(nn_idx, k=10, prune_snn=0.067, verbose=False)¶ Computes a Shared Nearest Neighbor (SNN) graph
Notes
Link weights are number of shared nearest neighbors. The sum of SNN similarities over all KNNs is retrieved with linear algebra.
- Parameters
nn_idx (
numpy.ndarray
) – Numpy array generated using knn()k (int) – Number of nearest neighbors. Default: 10
prune_snn (float) – Threshold for pruning the SNN graph, i.e. the edges with lower value (Jaccard index) than this will be removed. Set to 0 to disable pruning. Increasing this value will result in fewer edges in the graph. Default: 0.067
verbose (bool) – Be verbose or not.
References
- Returns
- Return type
Nothing. Modifies the passed object.
adobo.data module¶
Summary¶
This module contains a data storage class.
-
class
adobo.data.
dataset
(raw_mat, desc='no desc set', output_file=None, input_file=None, sparse=True, verbose=False)¶ Bases:
object
Storage container for raw, imputed and normalized data as well as analysis results.
-
_assays
¶ Holding information about what functions have been applied.
- Type
dict
-
count_data
¶ Raw read count matrix.
- Type
pandas.DataFrame
-
imp_count_data
¶ Raw data after imputing dropouts.
- Type
pandas.DataFrame
-
_low_quality_cells
¶ Low quality cells identified with
adobo.preproc.find_low_quality_cells()
.- Type
list
-
_norm_data
¶ Stores all analysis results. A nested dictionary.
- Type
dict
-
meta_cells
¶ A data frame containing meta data for cells.
- Type
pandas.DataFrame
-
meta_genes
¶ A data frame containing meta data for genes.
- Type
pandas.DataFrame
-
desc
¶ A string describing the dataset.
- Type
str
-
sparse
¶ Represent the data in a sparse data structure. Will save memory at the expense of time. Default: True
- Type
bool
-
output_file
¶ A filename that will be used when calling save().
- Type
str, optional
-
version
¶ The adobo package version used to create this data object.
- Type
str
-
add_meta_data
(axis, key, data, type_='cat')¶ Add meta data to the adobo object
Notes
Meta data can be added to cells or genes.
The parameter name ‘type_’ has an underscore to avoid conflict with Python’s internal type keyword.
- Parameters
axis ({‘cells’, ‘genes’}) – Are the data for cells or genes?
key (str) – The variable name for your data. No whitespaces and special characters.
data (numpy.ndarray, list or pandas.Series) – Data to add. Can be a basic Python list, a numpy array or a Pandas Series with an index. If the data type is numpy array or list, then the length must match the length of cells or genes. If the data type is a Pandas series, then the length does not need to match as long as the index is there. Data can be continuous or categorical and this must be specified with type_.
type_ ({‘cat’, ‘cont’}) – Specify if data are categorical or continuous. cat means categorical data and cont means continuous data. Default: ‘cat’
- Returns
- Return type
Nothing.
-
assays
()¶ Displays a basic summary of the dataset and what analyses have been performed on it.
-
delete
(what)¶ Deletes analysis results from the norm_data dictionary
- Parameters
what (str, tuple) – A string (or tuple of strings) specifying keys of what you want to delete. Each string should be a key in norm_data.
Example
>>> import adobo as ad >>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True) >>> ad.normalize.norm(exp) >>> ad.hvg.find_hvg(exp) >>> ad.dr.pca(exp) >>> ad.clustering.generate(exp) >>> exp.norm_data['standard']['clusters']['leiden'] {'membership': V1 6 V2 0 V3 5 V4 1 V5 0 .. V8377 1 V8378 10 V8379 11 V8380 3 V8381 2 Length: 8381, dtype: int64} >>> exp.delete(what=('clusters',)) >>> exp.norm_data['standard']['clusters']['leiden'] Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: 'leiden'
- Returns
- Return type
Nothing.
-
df_mem_usage
(var)¶ Memory usage for a data frame in mega bytes
- Parameters
var (str) – Variable name as a string.
- Returns
Mega bytes used with two decimals.
- Return type
float
-
get_assay
(name, lang=False)¶ Get info if a function has been applied.
-
is_normalized
()¶ Checks if normalized data can be found
- Returns
True if it is normalized otherwise False.
- Return type
bool
-
property
low_quality_cells
¶
-
property
norm_data
¶
-
print_dict
()¶
-
save
(filename=None, compress=True, verbose=False)¶ Serializes the object
Notes
This is a method so that it is not needed to memorize the filename, instead the filename was already specified when the object was created with the output_file parameter. Load the object data with joblib.load.
- Parameters
filename (str) – Output filename. Default: None
compress (bool) – Save with data compression or not. Default: True
verbose (bool) – Be verbose or not. Default: False
- Returns
- Return type
Nothing.
-
set_assay
(name, key=1)¶ Set the assay that was applied.
-
adobo.de module¶
adobo.dr module¶
Summary¶
Functions for dimensional reduction.
-
adobo.dr.
force_graph
(obj, name=(), iterations=1000, edgeWeightInfluence=1.0, jitterTolerance=1.0, barnesHutOptimize=True, scalingRatio=2.0, gravity=1.0, strongGravityMode=False, verbose=False)¶ Generates a force-directed graph
- Parameters
obj (
adobo.data.dataset
) – A data class object.name (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on the last normalization that was applied.
iterations (int) – Number of iterations. Default: 1000
edgeWeightInfluence (float) – How much influence to edge weights. 0 is no influence and 1 is normal. Default: 1.0
jitterTolerance (float) – Amount swing. Lower gives less speed and more precision. Default: 1.0
barnesHutOptimize (bool) – Run Barnes Hut optimization. Default: True
scalingRatio (float) – Amount of repulsion, higher values make a more sparse graph. Default: 2.0
gravity (float) – Attracts nodes to the center. Prevents islands from drifting away. Default: 1.0
strongGravityMode (bool) – A stronger gravity view. Default: False
verbose (bool) – Be verbose or not.
References
- Returns
- Return type
None
-
adobo.dr.
genes2scores
(obj, normalization=None, genes=[], bins=25, ctrl=100, retx=True, metadata=None)¶ Create cell scores from a list of genes
- Parameters
obj (
adobo.data.dataset
) – A dataset class object.normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on all normalizations available.
genes (list) – A list of genes to compute scores from.
bins (int) – Number of expression bins to be used. Default: 25
ctrl (int) – Number of control genes in each bin. Default: 100
retx (bool) – Return scores. Default: True
metadata (str) – If this is set to a string, then the scores will be set as a meta data variable with this column name. Default: None
References
- 1
Tirosh et al. (2016) Science. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq
- Returns
- Return type
Nothing. Modifies the passed object.
-
adobo.dr.
irlb
(data_norm, scale=True, ncomp=75, var_weigh=True, seed=None)¶ Truncated SVD by implicitly restarted Lanczos bidiagonalization
Notes
The augmented implicitly restarted Lanczos bidiagonalization algorithm (IRLBA) finds a few approximate largest singular values and corresponding singular vectors using a method of Baglama and Reichel.
Cells should be rows and genes as columns.
- Parameters
data_norm (
pandas.DataFrame
) – A pandas data frame containing normalized gene expression data.scale (bool) – Scales input data prior to PCA. Default: True
ncomp (int) – Number of components to return. Default: 75
var_weigh (bool) – Weigh by the variance of each component. Default: True
seed (int) – For reproducibility. Default: None
References
- 1
Baglama et al (2005) Augmented Implicitly Restarted Lanczos Bidiagonalization Methods SIAM Journal on Scientific Computing
- 2
- Returns
pd.DataFrame – A py:class:pandas.DataFrame containing the components (columns).
pd.DataFrame – A py:class:pandas.DataFrame containing the contributions of every gene (rows).
-
adobo.dr.
jackstraw
(obj, normalization=None, permutations=500, ncomp=None, subset_frac_genes=0.05, score_thr=0.001, fdr=0.01, retx=True, verbose=False)¶ Determine the number of relevant PCA components.
Notes
Permutes a subset of the data matrix and compares PCA scores with the original. The final output is a p-value for each component generated using a Chi-sq test.
- Parameters
obj (
adobo.data.dataset
) – A dataset class object.normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on all normalizations available.
permutations (int) – Number of permutations to run. Default: 500
ncomp (int) – Number of principal components to calculate significance for. If None, then will calculate for all components previously saved from py:func:adobo.dr.pca. Default: None
subset_frac_genes (float) – Proportion genes to use. Default: 0.10
score_thr (float) – Threshold for significance. Default: 1e-05
fdr (float) – Acceptable false discovery rate. Default: 0.01
retx (bool) – In addition to also modifying the object, also return results. Default: True
verbose (bool) – Be verbose. Default: False
References
- 1
Chung & Storey (2015) Statistical significance of variables driving systematic variation in high-dimensional data, Bioinformatics https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4325543/
- Returns
pandas.DataFrame – A genes by principal component data frame containing empirical p-values for the significance of every gene of the PC.
pandas.DataFrame – A data frame containing a single p-value for every PC generated from a Chi^2 test. Can be used to select the number of components to include by examinng p-values.
-
adobo.dr.
pca
(obj, method='irlb', normalization=None, ncomp=75, genes='hvg', scale=True, var_weigh=True, use_combat=False, verbose=False, seed=42)¶ Runs Principal Component Analysis (PCA)
Notes
Scaling of the data is achieved by setting scale=True (default), which will center (subtract the column mean) and scale columns (divide by their standard deviation).
- Parameters
obj (
adobo.data.dataset
) – A dataset class object.method ({‘irlb’, ‘svd’}) – Method to use for PCA. This does not matter much. Default: irlb
normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on all normalizations available.
ncomp (int) – Number of components to return. Default: 75
genes ({‘hvg’, ‘all’} or list) – If a string, the allowed values are ‘hvg’ to use only the highly variable genes or ‘all’ to use all genes. If a list, then the list specifies the list of genes to use. Default: ‘hvg’
scale (bool) – Scales input data prior to PCA. Default: True
use_combat (bool) – Use ComBat corrected data or not. Default: False
var_weigh (bool) – Weigh by the variance of each component. Default: True
verbose (bool) – Be noisy or not. Default: False
seed (int) – For reproducibility (only irlb). Default: 42
References
- 1
- 2
Baglama et al (2005) Augmented Implicitly Restarted Lanczos Bidiagonalization Methods SIAM Journal on Scientific Computing
- 3
- 4
- Returns
- Modifies the passed object. Results are stored in two
dictonaries in the passed object: dr (containing the components)
and dr_gene_contr (containing gene loadings).
- Return type
None
-
adobo.dr.
regress
(obj, target_vars=[], normalization=None)¶ Regress out the effects of certain meta data variables.
Notes
This function can be used to remove known confounding variables such as ambient gene expression modules, cell cycle genes or known experimental batches. It fits a linear model using numpy’s least square method (numpy.linalg.lstsq), predicts expression values from the model and then extracts the residuals, which become the new expression values.
- Parameters
obj (
adobo.data.dataset
) – A dataset class object.target_vars (list) – A list of target meta data variables.
normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on the last normalization used.
- Returns
- Return type
Nothing. Modifies the passed object.
-
adobo.dr.
svd
(data_norm, scale=True, ncomp=75, only_sdev=False)¶ Principal component analysis via singular value decomposition
- Parameters
data_norm (
pandas.DataFrame
) – A pandas data frame containing normalized gene expression data. Preferrably this should be a subset of the normalized gene expression matrix containing highly variable genes.scale (bool) – Scales input data prior to PCA. Default: True
ncomp (int) – Number of components to return. Default: 75
only_sdev (bool) – Only return the standard deviation of the components. Default: False
References
- Returns
pd.DataFrame – A py:class:pandas.DataFrame containing the components (columns). Only if only_sdev=False.
pd.DataFrame – A py:class:pandas.DataFrame containing the contributions of every gene (rows). Only if only_sdev=False.
pd.DataFrame – A py:class:pandas.DataFrame containing standard deviations of components. Only if only_sdev is set to True.
-
adobo.dr.
tsne
(obj, run_on_PCA=True, name=None, perplexity=30, n_iter=2000, seed=None, verbose=False, **args)¶ Projects data to a two dimensional space using the tSNE algorithm.
Notes
It is recommended to perform this function on data in PCA space. This function calls
sklearn.manifold.TSNE()
, and any additional parameters will be passed to it.- Parameters
obj (
adobo.data.dataset
) – A dataset class object.run_on_PCA (bool) – To run tSNE on PCA components or not. If False then runs on the entire normalized gene expression matrix. Default: True
name (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on all normalizations available.
perplexity (float) – From [1]: The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. Different values can result in significanlty different results. Default: 30
n_iter (int) – Number of iterations. Default: 2000
seed (int) – For reproducibility. Default: None
verbose (bool) – Be verbose. Default: False
References
- 1
van der Maaten, L.J.P.; Hinton, G.E. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9:2579-2605, 2008.
- 2
https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
- Returns
- Return type
Nothing. Modifies the passed object.
-
adobo.dr.
umap
(obj, run_on_PCA=True, name=None, n_neighbors=15, distance='euclidean', n_epochs=None, learning_rate=1.0, min_dist=0.1, spread=1.0, seed=None, verbose=False, **args)¶ Projects data to a low-dimensional space using the Uniform Manifold Approximation and Projection (UMAP) algorithm
Notes
UMAP is a non-linear data reduction algorithm.
- Parameters
obj (
adobo.data.dataset
) – A dataset class object.run_on_PCA (bool) – To run tSNE on PCA components or not. If False then runs on the entire normalized gene expression matrix. Default: True
name (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on all normalizations available.
n_neighbors (int) – The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100. Default: 15
distance (str) – The metric to use to compute distances in high dimensional space. Default: ‘euclidean’
n_epochs (int) – The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If None is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small). Default: None
learning_rate (float) – The initial learning rate for the embedding optimization. Default: 1.0
min_dist (float) – The effective minimum distance between embedded points. Default: 0.1
spread (float) – The effective scale of embedded points. Default: 1.0
seed (int) – For reproducibility. Default: None
verbose (bool) – Be verbose. Default: False
References
- 1
McInnes L, Healy J, Melville J (2018) UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, https://arxiv.org/abs/1802.03426
- 2
- 3
- Returns
- Return type
Nothing. Modifies the passed object.
adobo.hvg module¶
Summary¶
Functions for detection of highly variable genes.
-
adobo.hvg.
brennecke
(data_norm, log, ercc=None, fdr=0.1, ngenes=1000, minBiolDisp=0.5, verbose=False)¶ Implements the method of Brennecke et al. (2013) to identify highly variable genes
Notes
Fits data using GLM with Fisher Scoring. GLM code copied from (credits to @madrury for this code): https://github.com/madrury/py-glm
- Parameters
data_norm (
pandas.DataFrame
) – A pandas data frame containing normalized gene expression data.log (bool) – If normalized data were log transformed or not.
ercc (
pandas.DataFrame
) – A pandas data frame containing normalized ercc spikes.fdr (float) – False Discovery Rate considered significant.
minBiolDisp (float) – Minimum percentage of variance due to biological factors.
ngenes (int) – Number of top highly variable genes to return.
verbose (bool) – Be verbose or not.
References
- 1
Brennecke et al. (2013) Nature Methods https://doi.org/10.1038/nmeth.2645
- Returns
A list containing highly variable genes.
- Return type
list
-
adobo.hvg.
chen2016
(data_norm, log, fdr=0.1, ngenes=1000)¶ This function implements the approach from Chen (2016) to identify highly variable genes.
Notes
Expression counts should be normalized and on a log scale.
- Parameters
data_norm (
pandas.DataFrame
) – A pandas data frame containing normalized gene expression data.log (bool) – If normalized data were log transformed or not.
fdr (float) – False Discovery Rate considered significant.
ngenes (int) – Number of top highly variable genes to return.
References
- Returns
A list containing highly variable genes.
- Return type
list
-
adobo.hvg.
find_hvg
(obj, method='seurat', normalization=None, ngenes=1000, fdr=0.1, use_combat=False, verbose=False)¶ Finding highly variable genes
Notes
A wrapper function around the individual HVG functions, which can also be called directly.
The method ‘brennecke’ should not be applied on ‘fqn’ normalized data.
- Parameters
obj (
adobo.data.dataset
) – A dataset class object.method ({‘seurat’, ‘brennecke’, ‘scran’, ‘chen2016’, ‘mm’}) – Specifies the method to be used.
normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on all normalizations available.
ngenes (int) – Number of genes to return.
fdr (float) – False Discovery Rate threshold for significant genes applied to those methods that use it (brennecke, chen2016, mm). Note that the number of returned genes might be fewer than specified by ngenes because of FDR consideration.
use_combat (bool) – Use combat-adjusted data. Default: False
verbose (bool) – Be verbose or not.
Example
>>> import adobo as ad >>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True) >>> ad.normalize.norm(exp) >>> ad.hvg.find_hvg(exp)
References
- 1
Yip et al. (2018) Briefings in Bioinformatics https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bby011/4898116
- Returns
- Return type
Nothing. Modifies the passed object.
-
adobo.hvg.
mm
(data_norm, log, fdr=0.1, ngenes=1000)¶ This function implements the approach from Andrews (2018).
Notes
Input should be normalized but nog log’ed.
- Parameters
data (
pandas.DataFrame
) – A pandas data frame containing normalized counts.fdr (float) – False Discovery Rate considered significant.
ngenes (int) – Number of top highly variable genes to return.
References
- Returns
A list containing highly variable genes.
- Return type
list
-
adobo.hvg.
scran
(data_norm, log, ngenes=1000, ercc=None)¶ This function implements the approach from the scran R package
Notes
Expression counts should be normalized and on a log scale.
Outline of the steps:
fits a polynomial regression model to mean and variance of the technical genes
decomposes the total variance of the biological genes by subtracting the technical variance predicted by the fit
sort based on biological variance
- Parameters
data_norm (
pandas.DataFrame
) – A pandas data frame containing normalized gene expression data.log (bool) – If normalized data were log transformed or not.
ercc (
pandas.DataFrame
) – A pandas data frame containing normalized ercc spikes.ngenes (int) – Number of top highly variable genes to return.
References
- 1
Lun ATL, McCarthy DJ, Marioni JC (2016). “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.” F1000Research, https://doi.org/10.12688/f1000research.9501.2
- Returns
A list containing highly variable genes.
- Return type
list
-
adobo.hvg.
seurat
(data, ngenes=1000, num_bins=20)¶ Retrieves a list of highly variable genes using Seurat’s strategy
Notes
The function bins the genes according to average expression, then calculates dispersion for each bin as variance to mean ratio. Within each bin, Z-scores are calculated and returned. Z-scores are ranked and the top 1000 are selected. Input data should be normalized first.
- Parameters
obj (
pandas.DataFrame
) – A pandas data frame object containing raw read counts (rows=genes, columns=cells).ngenes (int) – Number of top highly variable genes to return.
num_bins (int) – Number of bins to use.
References
- Returns
A list containing highly variable genes.
- Return type
list
adobo.normalize module¶
Summary¶
This module contains functions to normalize raw read counts.
-
adobo.normalize.
ComBat
(obj, normalization=None, meta_cells_var=None, mean_only=True, par_prior=True, verbose=False)¶ Adjust for batch effects in datasets where the batch covariate is known
Notes
ComBat is a classical method for batch correction and it has been shown to perform well on single cell data. The drawback of using ComBat is that all cells in a batch is used for estimating model parameters. This implementation follows the ComBat function in the R package SVA.
Commands should run in this order: >>> ad.normalize.norm(exp) >>> exp.add_meta_data(axis=’cells’, key=’batch’, data=batch_vector) >>> ad.normalize.ComBat(exp, meta_cells_var=’batch’, verbose=True) >>> ad.hvg.find_hvg(exp, use_combat=True) >>> ad.dr.pca(exp, use_combat=True, verbose=True)
- Parameters
obj (
adobo.data.dataset
) – A data class object.normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on all normalizations available.
meta_cells_var (str) – Meta data variable. Should be a column name in
data.dataset.meta_cells
.mean_only (bool) – Mean only version of ComBat. Default: True
par_prior (bool) – True indicates parametric adjustments will be used, False indicates non-parametric adjustments will be used. Default: True
verbose (bool) – Be verbose or not. Default: False
References
- 1
Johnson et al. (2007) Biostatistics. Adjusting batch effects in microarray expression data using empirical Bayes methods.
- 2
Buttner et al. (2019) Nat Met. A test metric for assessing single-cell RNA-seq batch correction
- Returns
- Return type
Modifies the passed object.
-
adobo.normalize.
clean_matrix
(data, obj, remove_low_qual=True, remove_mito=True, meta=False)¶
-
adobo.normalize.
clr
(data, axis='genes')¶ Performs centered log ratio normalization similar to Seurat
- Parameters
data (
pandas.DataFrame
) – A pandas data frame object containing raw read counts (rows=genes, columns=cells).axis ({'genes', 'cells'}) – Normalize over genes or cells. Default: ‘genes’
References
- 1
Hafemeister et al. (2019) https://www.biorxiv.org/content/10.1101/576827v1
Example
>>> import adobo as ad >>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True) >>> ad.normalize.norm(exp, method='clr')
- Returns
A normalized data matrix with same dimensions as before.
- Return type
pandas.DataFrame
-
adobo.normalize.
fqn
(data)¶ Performs full quantile normalization (FQN)
Notes
FQN has been shown to perform well on single cell data and was a popular normalization scheme for microarray data. The present function does not handle ties well.
- Parameters
data (
pandas.DataFrame
) – A pandas data frame object containing raw read counts (rows=genes, columns=cells).
References
- 1
Bolstad et al. (2003) Bioinformatics https://academic.oup.com/bioinformatics/article/19/2/185/372664
- 2
Cole et al. (2019) Cell Systems https://www.biorxiv.org/content/10.1101/235382v2
Example
>>> import adobo as ad >>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True) >>> ad.normalize.norm(exp, method='fqn')
- Returns
A normalized data matrix with same dimensions as before.
- Return type
pandas.DataFrame
-
adobo.normalize.
norm
(obj, method='standard', name=None, use_imputed=False, log=True, log_func=<ufunc 'log2'>, small_const=1, remove_low_qual=True, remove_mito=True, gene_lengths=None, scaling_factor=10000, axis='genes', ngenes=2000, nworkers='auto', retx=False, verbose=False)¶ Normalizes gene expression data
Notes
A wrapper function around the individual normalization functions, which can also be called directly.
- Parameters
obj (
adobo.data.dataset
) – A dataset class object.method ({‘standard’, ‘rpkm’, ‘fqn’, ‘clr’, ‘vsn’}) – Specifies the method to use. standard refers to the simplest normalization strategy involving scaling genes by total number of reads per cell. rpkm performs RPKM normalization and requires the gene_lengths parameter to be set. fqn performs a full-quantile normalization. clr performs centered log ratio normalization. vsn performs a variance stabilizing normalization. Default: standard
name (str) – A choosen name for the normalization. It is used for storing and retrieving this normalization for plotting later. If None or an empty string, then it is set to the value of method.
use_imputed (bool) – Use imputed data. If set to True, then
adobo.preproc.impute()
must have been run previously. Default: Falselog (bool) – Perform log transformation. Default: True
log_func (numpy.func) – Logarithmic function to use. For example: np.log2, np.log1p, np.log10, etc. Default: np.log2
small_const (float) – A small constant to add to expression values to avoid log’ing genes with zero expression. Default: 1
remove_low_qual (bool) – Remove low quality cells and uninformative genes identified by prior steps. Default: True
remove_mito (bool) – Remove mitochondrial genes (if these have been detected with adobo.preproc.find_mitochondrial_genes). Default: True
gene_lengths (
pandas.Series
or str) – Apandas.Series
containing the gene lengths in base pairs and gene names set as index. The names must match the gene names used in data (the order does not need to match and any symbols not found in the data will be discarded). Normally gene lengths should be the combined length of exons for every gene. If gene_lengths is a str then it is taken as a filename and loaded; first column is gene names and second column is the length, field separator is one space. gene_lengths needs to be set _only_ if method=’rpkm’. Default: Nonescaling_factor (int) – Scaling factor used to multiply the scaled counts with. Only used for method=”depth”. Default: 10000
axis ({‘genes’, ‘cells’}) – Only applicable when method=”clr”, defines the axis to normalize across. Default: ‘genes’
ngenes (int) – For method=’vsn’, number of genes to use when estimating parameters. Default: 2000
nworkers (int or {‘auto’}) – For method=’vsn’. If a string, then the only accepted value is ‘auto’, and the number of worker processes will be the total number of detected physical cores. If an integer then it specifies the number of worker processes. Default: ‘auto’
retx (bool) – Return the normalized data as well. Default: False
verbose (bool) – Be verbose or not. Default: False
References
- 1
Cole et al. (2019) Cell Systems https://www.biorxiv.org/content/10.1101/235382v2
Example
>>> import adobo as ad >>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True) >>> ad.normalize.norm(exp)
- Returns
- Return type
Nothing. Modifies the passed object.
-
adobo.normalize.
rpkm
(data, gene_lengths)¶ Normalize expression values as RPKM
Notes
This method should be used if you need to adjust for gene length, such as in a SMART-Seq2 protocol.
- Parameters
obj (
pandas.DataFrame
) – A pandas data frame object containing raw read counts (rows=genes, columns=cells).gene_lengths (
pandas.Series
or str) – Should contain the gene lengths in base pairs and gene names set as index. The names must match the gene names used in data. Normally gene lengths should be the combined length of exons for every gene. If gene_lengths is a str then it is taken as a file path and loads it; first column is gene names and second column is the length, field separator is one space; an alternative format is a single column of combined exon lengths where the total number of rows matches the number of rows in the raw read counts matrix and with the same order.
References
- 1
Conesa et al. (2016) Genome Biology https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8
- Returns
A normalized data matrix with same dimensions as before.
- Return type
pandas.DataFrame
-
adobo.normalize.
standard
(data, scaling_factor=10000)¶ Performs a standard normalization by scaling with the total read depth per cell and then multiplying with a scaling factor.
- Parameters
data (
pandas.DataFrame
) – A pandas data frame object containing raw read counts (rows=genes, columns=cells).scaling_factor (int) – Scaling factor used to multiply the scaled counts with. Default: 10000
References
- 1
Evans et al. (2018) Briefings in Bioinformatics https://academic.oup.com/bib/article/19/5/776/3056951
Example
>>> import adobo as ad >>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True) >>> ad.normalize.norm(exp, method='standard')
- Returns
A normalized data matrix with same dimensions as before.
- Return type
pandas.DataFrame
-
adobo.normalize.
vsn
(data, min_cells=5, gmean_eps=1, ngenes=2000, nworkers='auto', verbose=False)¶ Performs variance stabilizing normalization based on a negative binomial regression model with regularized parameters
Notes
Use only with UMI counts. Adopts a subset of the functionality of vst in the R package sctransform.
- Parameters
data (
pandas.DataFrame
) – A pandas data frame object containing raw read counts (rows=genes, columns=cells).min_cells (int) – Minimum number of cells expressing a gene for the gene to be used. Default: 10
gmean_eps (float) – A small constant to avoid log(0)=-Inf. Default: 1
ngenes (int) – Number of genes to use when estimating parameters. Default: 2000
nworkers (int or {‘auto’}) – If a string, then the only accepted value is ‘auto’, and the number of worker processes will be the total number of detected physical cores. If an integer then it specifies the number of worker processes. Default: ‘auto’
verbose (bool) – Be verbose or not. Default: False
References
Example
>>> import adobo as ad >>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True) >>> ad.normalize.norm(exp, method='vsn')
- Returns
A data matrix with adjusted counts.
- Return type
pandas.DataFrame
adobo.plotting module¶
Summary¶
Functions for plotting scRNA-seq data.
-
adobo.plotting.
cell_viz
(obj, reduction=None, normalization=(), clustering=(), metadata=(), genes=(), highlight=None, highlight_color=('black', 'red'), selection_mode=False, edges=False, cell_types=False, trajectory=None, filename=None, marker_size=0.8, font_size=8, colors='adobo', title=None, legend=True, legend_marker_scale=10, legend_position=(1, 1), min_cluster_size=10, figsize=(10, 10), margins=None, dark=False, aspect_ratio='equal', verbose=False, **args)¶ Generates a 2d scatter plot from an embedding
- Parameters
obj (
adobo.data.dataset
) – A data class objectreduction ({‘tsne’, ‘umap’, ‘pca’, ‘force_graph’}) – The dimensional reduction to use. Default is to use the last one generated.
normalization (tuple) – A tuple of normalization to use. If it has the length zero, then the last generated will be used.
clustering (tuple) – Specifies the clustering outcomes to plot. If None, then the last generated clustering is plotted.
metadata (tuple, optional) – Specifies the metadata variables to plot.
genes (tuple, optional) – Specifies genes to plot. Can also be a regular expression matching a single gene name.
highlight (int or str) – Highlight a cluster or a single cell. Integer if cluster and string if a cell.
highlight_color (tuple) – The colors to use when highlighting a cluster. Should be a tuple of length two. First item is the color of all other cluster than the selected, the second item is the color of the highlighted cluster.
selection_mode (bool) – Enables interactive selection of cells. Prints the IDs of the cells inside the rectangle. Default: False
edges (bool) – Draw edges (only applicable if reduction=’force_graph’). Default: False
cell_types (bool) – Print cell type predictions, applicable if
adobo.bio.cell_type_predict()
has been run. Default: Falsetrajectory (str, optional) – The trajectory to plot. For example ‘slingshot’. Default: None
filename (str, optional) – Name of an output file instead of showing on screen.
marker_size (float) – The size of the markers. Default: 0.8
font_size (float) – Font size. Default: 8
colors ({‘default’, ‘random’} or list) – Can be: (i) “adobo” or “random”; or (ii) a list of colors with the same length as the number of factors. If colors is set to “adobo”, then colors are retrieved from
adobo._constants.CLUSTER_COLORS_DEFAULT
(but if the number of clusters exceed 50, then random colors will be used). Default: adobotitle (str) – An optional title of the plot.
legend (bool) – Add legend or not. Default: True
legend_marker_scale (int) – Scale the markers in the legend. Default: 10
legend_position (tuple) – A tuple of length two describing the position of the legend. Default: (1,1)
min_cluster_size (int) – Can be used to prevent clusters below a certain number of cells to be plotted. Default: 10
figsize (tuple) – Figure size in inches. Default: (10, 10)
margins (dict) – Can be used to adjust margins. Should be a dict with one or more of the keys: ‘left’, ‘bottom’, ‘right’, ‘top’, ‘wspace’, ‘hspace’. Set verbose=True to figure out the present values. Default: None
dark (bool) – Make the background color black. Default: False
aspect_ratio ({‘equal’, ‘auto’}) – Set the aspect of the axis scaling, i.e. the ratio of y-unit to x-unit. Default: ‘equal’
verbose (bool) – Be verbose or not. Default: True
- Returns
- Return type
None
-
adobo.plotting.
exp_genes
(obj, normalization=None, clust_alg=None, cluster=None, min_cluster_size=10, violin=True, scale='width', fontsize=10, figsize=(10, 5), linewidth=0.5, filename=None, title=None, **args)¶ Compare number of expressed genes across clusters
- Parameters
obj (
adobo.data.dataset
) – A data class objectnormalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on the last normalization that was applied.
clust_alg (str) – Name of the clustering strategy. If empty or None, the last one will be used.
cluster (list or int) – List of cluster identifiers to plot. If a list, then expecting a list of cluster indices. An integer specifies only one cluster index. If None, then shows the expression across all clusters. Default: None
min_cluster_size (int) – Can be used to prevent clusters below a certain number of cells to be plotted. Default: 10
violin (bool) – Draws a violin plot (otherwise a box plot). Default: True
scale ({‘width’, ‘area’}) – If area, each violin will have the same area. If
width
, each violin will have the same width. Default: ‘width’fontsize (int) – Specifies font size. Default: 6
figsize (tuple) – Figure size in inches. Default: (10, 10)
linewidth (float) – Border width. Default: 0.5
filename (str, optional) – Write to a file instead of showing the plot on screen.
title (str) – Title of the plot. Default: None
**args – Passed on into seaborn’s violinplot and boxplot functions
Example
>>> import adobo as ad >>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True) >>> ad.normalize.norm(exp, method='standard') >>> ad.hvg.find_hvg(exp) >>> ad.dr.pca(exp) >>> ad.clustering.generate(exp, clust_alg='leiden') >>> ad.plotting.exp_genes(obj)
- Returns
- Return type
Nothing
-
adobo.plotting.
genes_violin
(obj, normalization='', clust_alg=None, cluster=None, gene=None, rank_func=<function median>, top=10, violin=True, scale='width', fontsize=10, figsize=(10, 5), linewidth=0.5, filename=None, **args)¶ Plot individual genes using violin plot (or box plot). Can be used to plot the top genes in the total dataset or top genes in individual clusters. Specific genes can also be selected using the parameter genes.
- Parameters
obj (
adobo.data.dataset
) – A data class objectnormalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on the last normalization that was applied.
clust_alg (str) – Name of the clustering strategy. If empty or None, the last one will be used.
cluster (list or int) – List of cluster identifiers to plot. If a list, then expecting a list of cluster indices. An integer specifies only one cluster index. If None, then shows the expression across all clusters. Default: None
gene (str) – Compare a single gene across all clusters (can also be a regular expression, but it must match a single gene). If this is None, then the top is plotted based on the ranking function specified below. Default: None
rank_func (np.median) – Ranking function. numpy’s median is the default.
top (int) – Specifies the number of top scoring genes to include. Default: 10
violin (bool) – Draws a violin plot (otherwise a box plot). Default: True
scale ({‘width’, ‘area’}) – If area, each violin will have the same area. If
width
, each violin will have the same width. Default: ‘width’fontsize (int) – Specifies font size. Default: 6
figsize (tuple) – Figure size in inches. Default: (10, 10)
linewidth (float) – Border width. Default: 0.5
filename (str, optional) – Write to a file instead of showing the plot on screen.
**args – Passed on into seaborn’s violinplot and boxplot functions
Example
>>> import adobo as ad >>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True) >>> ad.normalize.norm(exp, method='standard') >>> ad.hvg.find_hvg(exp) >>> ad.dr.pca(exp) >>> ad.clustering.generate(exp, clust_alg='leiden') >>> >>> # top 10 genes in cluster 0 >>> ad.plotting.genes_violin(exp, top=10, cluster=0) >>> >>> # top 10 genes across all clusters >>> ad.plotting.genes_violin(exp, top=10) >>> >>> # plotting one gene across all clusters >>> ad.plotting.genes_violin(exp, gene='ENSG00000163220') >>> >>> # same, but using a box plot >>> ad.plotting.genes_violin(exp, gene='ENSG00000163220', violin=False)
- Returns
- Return type
Nothing
-
adobo.plotting.
jackstraw_barplot
(obj, normalization=None, fontsize=12, figsize=(15, 6), filename=None, title=None, **args)¶ Make a barplot of jackstraw p-values for principal components
- Parameters
obj (
adobo.data.dataset
) – A data class objectnormalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on the last normalization that was applied.
fontsize (int) – Specifies font size. Default: 12
figsize (tuple) – Figure size in inches. Default: (10, 10)
linewidth (float) – Border width. Default: 0.5
filename (str, optional) – Write to a file instead of showing the plot on screen.
title (str) – Title of the plot. Default: None
**args – Passed on into seaborn’s violinplot and boxplot functions
- Returns
- Return type
Nothing
-
adobo.plotting.
overall
(obj, what='cells', how='histogram', bin_size=100, cut_off=None, color='#E69F00', title=None, filename=None, **args)¶ Generates a plot of read counts per cell or expressed genes per cell
- Parameters
obj (
adobo.data.dataset
) – A data class object.what ({‘cells’, ‘genes’}) – If ‘cells’ then plots the number of reads per cell. If ‘genes’, then plots the number of expressed genes per cell. Default: ‘cells’
how ({‘histogram’, ‘boxplot’, ‘barplot’, ‘violin’}) – Type of plot to generate. Default: ‘histogram’
bin_size (int) – If how is a histogram, then this is the bin size. Default: 100
cut_off (int) – Set a cut off for genes or reads by drawing a red line and print the number of cells over and under the cut off. Only valid if how=’histogram’. Default: None
color (str) – Color of the plot. Default: ‘#E69F00’
title (str) – Change the default title of the plot. Default: None
filename (str, optional) – Write plot to file instead of showing it on the screen. Default: None
- Returns
- Return type
None
-
adobo.plotting.
overall_scatter
(obj, color_kept='#E69F00', color_filtered='red', title=None, filename=None, **args)¶ - Generates a scatter plot showing the total number of reads on
one axis and the number of detected genes on the other axis
- Parameters
obj (
adobo.data.dataset
) – A data class object.color_kept (str) – Color of the plot. Default: ‘#E69F00’
color_filtered (str) – Color of the cells that have been filtered out. Default: ‘red’
title (str) – Title of the plot. Default: None
filename (str, optional) – Write plot to file instead of showing it on the screen. Default: None
- Returns
- Return type
None
-
adobo.plotting.
pca_contributors
(obj, normalization=None, how='heatmap', clust_alg=None, cluster=None, all_genes=False, dim=[0, 1, 2], top=20, color='#E69F00', fontsize=6, figsize=(10, 5), filename=None, verbose=False, **args)¶ Examine the top contributing genes to each PCA component. Optionally, one can examine the PCA components of a cell cluster instead.
Note
The function takes half the genes with top negative scores and the other half from genes with positive scores. Additional parameters are passed into
matplotlib.pyplot.savefig()
.- Parameters
obj (
adobo.data.dataset
) – A data class objectnormalization (str) – The name of the normalization to operate on. If empty or None, the last one generated is be used. Default: None
how ({‘heatmap’, ‘barplot’}) – How to visualize, can be barplot or heatmap. If ‘barplot’, then shows the PCA scores. If ‘heatmap’, then visualizes the expression of genes with top PCA scores. Default: ‘barplot’
clust_alg (str) – Name of the clustering strategy. If empty or None, the last one generated is be used. Default: None
cluster (int) – Name of the cluster.
all_genes (bool) – If cluster is set, then indicates if PCA should be computed on all genes or only on the highly variable genes. Default: False
dim (list or int) – If list, then it specifies indices of components to plot. If integer, then it specifies the first components to plot. First component has index zero. Default: [0, 1, 2]
top (int) – Specifies the number of top scoring genes to include (i.e. will use this many positive/negative scoring genes). Default: 20
color (str) – Color of the bars. Default: “#fcc603”
fontsize (int) – Specifies font size. Default: 6
figsize (tuple) – Figure size in inches. Default: (10, 10)
filename (str, optional) – Write to a file instead of showing the plot on screen. File type is determined by the filename extension.
verbose (bool) – Be verbose or not. Default: False
Example
>>> import adobo as ad >>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True) >>> ad.normalize.norm(exp, method='standard') >>> ad.hvg.find_hvg(exp) >>> ad.dr.pca(exp) >>> ad.plotting.pca_contributors(exp, dim=4) >>> # decomposition of a specific cluster >>> ad.clustering.generate(exp, clust_alg='leiden') >>> ad.plotting.pca_contributors(exp, dim=4, cluster=0)
- Returns
- Return type
Nothing
-
adobo.plotting.
pca_elbow
(obj, normalization=None, comp_max=100, all_genes=False, filename=None, font_size=8, figsize=(6, 4), color='#E69F00', title='PCA elbow plot', **args)¶ Generates a PCA elbow plot
Notes
Can be useful for determining the number of components to include. Here, PCA is computed using singular value decomposition.
- Parameters
obj (
adobo.data.dataset
) – A data class objectnormalization (str) – The name of the normalization to operate on. If empty or None, the last one generated is be used. Default: None
comp_max (int) – Maximum number of components to include. Default: 100
all_genes (bool) – Run on all genes, i.e. not only highly variable genes. Default: False
filename (str, optional) – Name of an output file instead of showing on screen.
font_size (float) – Font size. Default: 8
figsize (tuple) – Figure size in inches. Default: (10, 10)
color (str) – Color of the line. Default: #fcc603
title (str) – A plot title.
- Returns
- Return type
Nothing.
-
adobo.plotting.
tree
(obj, normalization='', clust_alg=None, method='complete', cell_types=True, min_cluster_size=10, fontsize=8, figsize=(10, 5), filename=None, title=None, **args)¶ Generates a dendrogram of cluster relationships
- Parameters
obj (
adobo.data.dataset
) – A data class objectnormalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on the last normalization that was applied.
clust_alg (str) – Name of the clustering strategy. If empty or None, the last one will be used.
method (‘{‘complete’, ‘single’, ‘average’, ‘weighted’, ‘centroid’, ‘median’, ‘ward’}’) – The linkage algorithm to use. Default: ‘complete’
cell_types (bool) – Add putative cell type annotations (if available). Default: True
min_cluster_size (int) – Can be used to prevent clusters below a certain number of cells to be plotted. Default: 10
fontsize (int) – Specifies font size. Default: 6
figsize (tuple) – Figure size in inches. Default: (10, 10)
filename (str) – Write to a file instead of showing the plot on screen. Default: None
title (str) – Plot title.
Example
>>> import adobo as ad >>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True) >>> ad.preproc.symbol_switch(exp, species='human') >>> ad.normalize.norm(exp, method='standard') >>> ad.hvg.find_hvg(exp) >>> ad.dr.pca(exp) >>> ad.clustering.generate(exp, clust_alg='leiden') >>> ad.bio.cell_type_predict(exp, verbose=True) >>> ad.plotting.tree(exp)
- Returns
- Return type
Nothing
adobo.preproc module¶
Summary¶
Functions for pre-processing scRNA-seq data.
-
adobo.preproc.
find_ercc
(obj, ercc_pattern='^ERCC[_-]\\S+$', verbose=False)¶ Flag ERCC spikes
- Parameters
obj (
adobo.data.dataset
) – A data class object.ercc_pattern (str, optional) – A regular expression matching ercc gene symbols. Default: “ercc[_-]S+$”
verbose (bool, optional) – Be verbose or not. Default: False
- Returns
Number of detected ercc spikes.
- Return type
int
-
adobo.preproc.
find_low_quality_cells
(obj, rRNA_genes, sd_thres=3, seed=42, verbose=False)¶ Statistical detection of low quality cells using Mahalanobis distances
Notes
Mahalanobis distances are computed from five quality metrics. A robust estimate of covariance is used in the Mahalanobis function. Cells with Mahalanobis distances of three standard deviations from the mean are by default considered outliers. The five metrics are:
log-transformed number of molecules detected
the number of genes detected
the percentage of reads mapping to ribosomal
mitochondrial genes
ercc recovery (if available)
- Parameters
obj (
adobo.data.dataset
) – A data class object.rRNA_genes (list or str) – Either a list of rRNA genes or a string containing the path to a file containing the rRNA genes (one gene per line).
sd_thres (float) – Number of standard deviations to consider significant, i.e. cells are low quality if this. Set to higher to remove fewer cells. Default: 3
seed (float) – For the random number generator. Default: 42
verbose (bool) – Be verbose or not. Default: False
- Returns
A list of low quality cells that were identified, and also modifies the passed object.
- Return type
list
-
adobo.preproc.
find_mitochondrial_genes
(obj, mito_pattern='^mt-', genes=None, verbose=False)¶ Find mitochondrial genes and adds percent mitochondrial expression of total expression to the cellular meta data
- Parameters
obj (
adobo.data.dataset
) – A data class object.mito_pattern (str) – A regular expression matching mitochondrial gene symbols. Default: “^mt-“
genes (list, optional) – Instead of using mito_pattern, specify a list of genes that are mitochondrial.
verbose (boolean) – Be verbose or not. Default: False
Example
>>> import adobo as ad >>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True) >>> ad.preproc.find_mitochondrial_genes(exp)
- Returns
Number of mitochondrial genes detected.
- Return type
int
-
adobo.preproc.
impute
(obj, filtered=True, res=0.5, drop_thre=0.5, nworkers='auto', verbose=True)¶ Impute dropouts using the method described in Li (2018) Nature Communications
Notes
Dropouts are artifacts in scRNA-seq data. One method to alleviate the problem with dropouts is to perform imputation (i.e. replacing missing data points with predicted values).
The present method uses a different procedure for subpopulation identification as compared with the original paper.
- Parameters
obj (
adobo.data.dataset
) – A data class object.filtered (bool) – If data have been filtered using
adobo.preproc.simple_filter()
, run imputation on filtered data; otherwise runs on the entire raw read count matrix. Default: Trueres (float) – Resolution parameter for the Leiden clustering, change to modify cluster resolution. Default: 0.5
drop_thre (float) – Drop threshold. Default: 0.5
nworkers (int or {‘auto’}) – If a string, then the only accepted value is ‘auto’, and the number of worker processes will be the total number of detected physical cores. If an integer then it specifies the number of worker processes. Default: ‘auto’
verbose (bool) – Be verbose or not. Default: True
References
- 1
Li & Li (2018) An accurate and robust imputation method scImpute for single-cell RNA-seq data https://www.nature.com/articles/s41467-018-03405-7
- 2
- Returns
- Return type
Modifies the passed object.
-
adobo.preproc.
mad_outlier
(obj, nmads=3, verbose=False)¶ Outlier detection based on median absolute deviation
Notes
Removes cells with a number of median absolute deviations below the median of either of two quality metrics. The quality metrics are the log of the library size and the log of number of detected genes. The principle is similar to Lun et al. Three mads is the default.
- Parameters
obj (
adobo.data.dataset
) – A data class object.nmads (int) – Number of median absolute deviations below the median for the cell to be considered an outlier. Default: 3
verbose (bool) – Be verbose or not. Default: False
References
- 1
Lun et al. (2016) F1000Res, A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5112579/
- Returns
- Return type
Modifies the passed object.
-
adobo.preproc.
reset_filters
(obj)¶ Resets cell and gene filters
- Parameters
obj (
adobo.data.dataset
) – A data class object.- Returns
- Return type
Nothing. Modifies the passed object.
-
adobo.preproc.
simple_filter
(obj, what='cells', minreads=1000, maxreads=None, mingenes=None, maxgenes=None, min_exp=0.001, verbose=False)¶ Removes cells with too few reads or genes with very low expression
Notes
Default is to remove cells.
- Parameters
obj (
adobo.data.dataset
) – A data class object.what ({‘cells’, ‘genes’}) – Determines what should be filtered from the expression matrix. If ‘cells’, then cells are filtered. If ‘genes’, then genes are filtered. Default: ‘cells’
minreads (int, optional) – When filtering cells, defines the minimum number of reads per cell needed to keep the cell. Default: 1000
maxreads (int, optional) – When filtering cells, defines the maximum number of reads allowed to keep the cell. Useful for filtering out suspected doublets. Default: None
mingenes (float, int) – When filtering cells, defines the minimum number of genes that must be expressed in a cell to keep it. Default: None
maxgenes (float, int) – When filtering cells, defines the maximum number of genes that a cell is allowed to express to keep it. Default: None
min_exp (float, int) – Used to set a threshold for how to filter out genes. If integer, defines the minimum number of cells that must express a gene to keep the gene. If float, defines the minimum fraction of cells must express the gene to keep the gene. Set to None to ignore this option. Default: 0.001
verbose (bool, optional) – Be verbose or not. Default: False
Example
>>> import adobo as ad >>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True) >>> ad.preproc.simple_filter(exp, what='cells', minreads=1500) >>> ad.preproc.simple_filter(exp, what='genes')
- Returns
Number of removed cells or genes.
- Return type
int
-
adobo.preproc.
symbol_switch
(obj, species)¶ Changes gene symbol format
Notes
If gene symbols are in the format ENS[0-9]+, this function changes gene identifiers to symbol_ENS[0-9]+.
- Parameters
obj (
adobo.data.dataset
) – A data class object.species ('{'human', 'mouse'}') – Species. Default: ‘human’
Example
>>> import adobo as ad >>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True) >>> ad.preproc.symbol_switch(exp, species='human')
- Returns
- Return type
Modifies the passed object.
adobo.traj module¶
Summary¶
Functions for trajectory analysis.
-
adobo.traj.
slingshot
(obj, name=(), min_cluster_size=10, verbose=False)¶ Trajectory analysis on the cluster level following the strategy in the R package slingshot
Notes
Slingshot’s approach takes cells in a low dimensional space (UMAP is used below) and a clustering to generate a graph where vertices are clusters.
Only slingthot’s ‘getLineages’ method is used at the moment.
References
- 1
Street et al. (2018) BMC Genomics. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics
- 2
https://bioconductor.org/packages/release/bioc/html/slingshot.html
- Parameters
obj (
adobo.data.dataset
) – A data class object.name (tuple) – A tuple of normalization to use. If it has the length zero, then all available normalizations will be used.
min_cluster_size (int) – Minimum number of cells per cluster to include the cluster. Default: 10
verbose (bool, optional) – Be verbose or not. Default: False
- Returns
- Return type
Nothing modifies the passed object.