adobo package¶

Subpackages¶

Submodules¶

adobo.IO module¶

adobo.bio module¶

Summary¶

Functions related to biology.

adobo.bio.cell_cycle_predict(obj, clf, tr_features, name=(), verbose=False)¶

Predicts cell cycle phase

Notes

The classifier is trained on mouse data, so it should _only_ be used on mouse data unless it is trained on something else. Gene identifiers must use ensembl identifiers (prefixed with ‘ENSMUSG’); pure gene symbols are not enough. Results are returned as a column in the data frame meta_cells of the passed object. Does not return probability scores.

Parameters

obj (adobo.data.dataset) – A data class object.
clf (sklearn.linear_model.SGDClassifier) – The classifier.
tr_features (list) – Training features.
name (tuple) – A tuple of normalization to use. If it has the length zero, then all available normalizations will be used.
verbose (bool) – Be verbose. Default: False

Returns

Return type

Modifies the passed object.

adobo.bio.cell_cycle_train(verbose=False)¶

Trains a cell cycle classifier using Stochastic Gradient Descent with data from Buettner et al.

Notes

Genes are selected from GO:0007049

Does only need to be trained once; the second time it is serialized from disk.

Parameters: verbose (bool) – Be verbose or not. Default: False

References

1: Buettner et al. (2015) Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat Biotech.

Returns

sklearn.linear_model.SGDClassifier – A trained classifier.
list – Containing training features.

adobo.bio.cell_type_predict(obj, name=(), clustering=(), min_cluster_size=10, cell_type_markers=None, verbose=False)¶

Predicts cell types using the expression of marker genes

Notes

Gene identifiers should be in symbol form, not ensembl identifiers, etc.

Parameters

obj (adobo.data.dataset) – A data class object.
name (tuple) – A tuple of normalization to use. If it has the length zero, then all available normalizations will be used.
clustering (tuple, optional) – Specifies the clustering outcomes to work on.
min_cluster_size (int) – Minimum number of cells per cluster; clusters smaller than this are ignored. Default: 10
cell_type_markers (pandas.DataFrame) – Source of gene markers used to define cell types. This is set to None as default, indicating that PanglaoDB markers will be used. To use custom markers, set this to a pandas data frame where the first column is a gene and the second column is the name of the cell type (every cell type will have multiple rows). Default: None
Default (None) –
verbose (bool) – Be verbose or not. Default: False

Returns

Return type

Modifies the passed object.

adobo.bulk module¶

adobo.clustering module¶

Summary¶

This module contains functions to cluster data.

adobo.clustering.generate(obj, k=10, name=None, distance='euclidean', graph='snn', clust_alg='leiden', prune_snn=0.067, res=0.8, save_graph=True, seed=42, verbose=False)¶

A wrapper function for generating single cell clusters from a shared nearest neighbor graph with the Leiden algorithm

Parameters

obj (adobo.data.dataset) – A dataset class object.
k (int) – Number of nearest neighbors. Default: 10
name (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on all normalizations available.
distance (str) – Distance metric to use. See here for valid choices: https://tinyurl.com/y4bckf7w Default: ‘euclidean’
target ({‘irlb’, ‘svd’}) – The dimensionality reduction result to run on. Default: irlb
graph ({‘snn’}) – Type of graph to generate. Only shared nearest neighbor (snn) supported at the moment.
clust_alg (`{‘leiden’, ‘louvain’, ‘walktrap’, ‘spinglass’, ‘multilevel’,) – ‘infomap’, ‘label_prop’, ‘leading_eigenvector’}` Clustering algorithm to be used.
prune_snn (float) – Threshold for pruning the SNN graph, i.e. the edges with lower value (Jaccard index) than this will be removed. Set to 0 to disable pruning. Increasing this value will result in fewer edges in the graph. Default: 0.067
res (float) – Resolution parameter for the Leiden algorithm _only_; change to modify cluster resolution. Default: 0.8
save_graph (bool) – To save the graph or not. Default: True
seed (int) – For reproducibility.
verbose (bool) – Be verbose or not.

References

1: Yang et al. (2016) A Comparative Analysis of Community Detection Algorithms on Artificial Networks. Scientific Reports

Returns: A dict containing cluster sizes (number of cells), only retx is set to True.
Return type: dict

adobo.clustering.igraph(snn_graph, clust_alg)¶

Runs clustering functions within igraph

Parameters

snn_graph (pandas.DataFrame) – Source and target nodes.
clust_alg (`{‘walktrap’, ‘spinglass’, ‘multilevel’, ‘infomap’,) – ‘label_prop’, ‘leading_eigenvector’}` Specifies the community detection algorithm.

References

1: Pons & Latapy (2006) Computing Communities in Large NetworksUsing Random Walks, Journal of Graph Algorithms and Applications
2: Reichardt & Bornholdt (2006) Statistical mechanics of community detection, Physical Review E

Returns
Return type: Nothing. Modifies the passed object.

adobo.clustering.knn(comp, k=10, distance='euclidean')¶

Nearest Neighbour Search. Finds the k number of near neighbours for each cell.

Parameters

comp (pandas.DataFrame) – A pandas data frame containing PCA components.
k (int) – Number of nearest neighbors. Default: 10
target ({‘irlb’, ‘svd’}) – The dimensionality reduction result to run the NN search on. Default: irlb
distance (str) – Distance metric to use. See here for valid choices: https://tinyurl.com/y4bckf7w

Returns

Array containing indices.

Return type

numpy.ndarray

adobo.clustering.leiden(snn_graph, res=0.8, seed=42)¶

Runs the Leiden algorithm

Parameters

snn_graph (pandas.DataFrame) – Source and target nodes.
res (float) – Resolution parameter, change to modify cluster resolution. Default: 0.8
seed (int) – For reproducibility.

References

1: https://github.com/vtraag/leidenalg
2: Traag et al. (2018) https://arxiv.org/abs/1810.08473

Returns
Return type: Nothing. Modifies the passed object.

adobo.clustering.louvain(snn_graph, res=0.8, seed=42)¶

Runs the Louvain algorithm

Parameters

snn_graph (pandas.DataFrame) – Source and target nodes.
res (float) – Resolution parameter, change to modify cluster resolution. Default: 0.8
seed (int) – For reproducibility.

References

1: https://github.com/taynaud/python-louvain
2: https://perso.uclouvain.be/vincent.blondel/research/louvain.html
3: Blondel et al., Fast unfolding of communities in large networks (2008), Journal of Statistical Mechanics: Theory and Experiment

Returns
Return type: Nothing. Modifies the passed object.

adobo.clustering.snn(nn_idx, k=10, prune_snn=0.067, verbose=False)¶

Computes a Shared Nearest Neighbor (SNN) graph

Notes

Link weights are number of shared nearest neighbors. The sum of SNN similarities over all KNNs is retrieved with linear algebra.

Parameters

nn_idx (numpy.ndarray) – Numpy array generated using knn()
k (int) – Number of nearest neighbors. Default: 10
prune_snn (float) – Threshold for pruning the SNN graph, i.e. the edges with lower value (Jaccard index) than this will be removed. Set to 0 to disable pruning. Increasing this value will result in fewer edges in the graph. Default: 0.067
verbose (bool) – Be verbose or not.

References

1: http://mlwiki.org/index.php/SNN_Clustering

Returns
Return type: Nothing. Modifies the passed object.

adobo.data module¶

Summary¶

This module contains a data storage class.

class adobo.data.dataset(raw_mat, desc='no desc set', output_file=None, input_file=None, sparse=True, verbose=False)¶

Bases: object

Storage container for raw, imputed and normalized data as well as analysis results.

_assays¶

Holding information about what functions have been applied.

Type: dict

count_data¶

Raw read count matrix.

Type: pandas.DataFrame

imp_count_data¶

Raw data after imputing dropouts.

Type: pandas.DataFrame

_low_quality_cells¶

Low quality cells identified with adobo.preproc.find_low_quality_cells().

Type: list

_norm_data¶

Stores all analysis results. A nested dictionary.

Type: dict

meta_cells¶

A data frame containing meta data for cells.

Type: pandas.DataFrame

meta_genes¶

A data frame containing meta data for genes.

Type: pandas.DataFrame

desc¶

A string describing the dataset.

Type: str

sparse¶

Represent the data in a sparse data structure. Will save memory at the expense of time. Default: True

Type: bool

output_file¶

A filename that will be used when calling save().

Type: str, optional

version¶

The adobo package version used to create this data object.

Type: str

add_meta_data(axis, key, data, type_='cat')¶

Add meta data to the adobo object

Notes

Meta data can be added to cells or genes.

The parameter name ‘type_’ has an underscore to avoid conflict with Python’s internal type keyword.

Parameters

axis ({‘cells’, ‘genes’}) – Are the data for cells or genes?
key (str) – The variable name for your data. No whitespaces and special characters.
data (numpy.ndarray, list or pandas.Series) – Data to add. Can be a basic Python list, a numpy array or a Pandas Series with an index. If the data type is numpy array or list, then the length must match the length of cells or genes. If the data type is a Pandas series, then the length does not need to match as long as the index is there. Data can be continuous or categorical and this must be specified with type_.
type_ ({‘cat’, ‘cont’}) – Specify if data are categorical or continuous. cat means categorical data and cont means continuous data. Default: ‘cat’

Returns

Return type

Nothing.

assays()¶: Displays a basic summary of the dataset and what analyses have been performed on it.

delete(what)¶

Deletes analysis results from the norm_data dictionary

Parameters: what (str, tuple) – A string (or tuple of strings) specifying keys of what you want to delete. Each string should be a key in norm_data.

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.normalize.norm(exp)
>>> ad.hvg.find_hvg(exp)
>>> ad.dr.pca(exp)
>>> ad.clustering.generate(exp)
>>> exp.norm_data['standard']['clusters']['leiden']
    {'membership': V1        6
    V2        0
    V3        5
    V4        1
    V5        0
             ..
    V8377     1
    V8378    10
    V8379    11
    V8380     3
    V8381     2
    Length: 8381, dtype: int64}
>>> exp.delete(what=('clusters',))
>>> exp.norm_data['standard']['clusters']['leiden']
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    KeyError: 'leiden'

Returns
Return type: Nothing.

df_mem_usage(var)¶

Memory usage for a data frame in mega bytes

Parameters: var (str) – Variable name as a string.
Returns: Mega bytes used with two decimals.
Return type: float

get_assay(name, lang=False)¶: Get info if a function has been applied.

is_normalized()¶

Checks if normalized data can be found

Returns: True if it is normalized otherwise False.
Return type: bool

property low_quality_cells¶

property norm_data¶

print_dict()¶

save(filename=None, compress=True, verbose=False)¶

Serializes the object

Notes

This is a method so that it is not needed to memorize the filename, instead the filename was already specified when the object was created with the output_file parameter. Load the object data with joblib.load.

Parameters

filename (str) – Output filename. Default: None
compress (bool) – Save with data compression or not. Default: True
verbose (bool) – Be verbose or not. Default: False

Returns

Return type

Nothing.

set_assay(name, key=1)¶: Set the assay that was applied.

adobo.de module¶

adobo.dr module¶

Summary¶

Functions for dimensional reduction.

adobo.dr.force_graph(obj, name=(), iterations=1000, edgeWeightInfluence=1.0, jitterTolerance=1.0, barnesHutOptimize=True, scalingRatio=2.0, gravity=1.0, strongGravityMode=False, verbose=False)¶

Generates a force-directed graph

Parameters

obj (adobo.data.dataset) – A data class object.
name (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on the last normalization that was applied.
iterations (int) – Number of iterations. Default: 1000
edgeWeightInfluence (float) – How much influence to edge weights. 0 is no influence and 1 is normal. Default: 1.0
jitterTolerance (float) – Amount swing. Lower gives less speed and more precision. Default: 1.0
barnesHutOptimize (bool) – Run Barnes Hut optimization. Default: True
scalingRatio (float) – Amount of repulsion, higher values make a more sparse graph. Default: 2.0
gravity (float) – Attracts nodes to the center. Prevents islands from drifting away. Default: 1.0
strongGravityMode (bool) – A stronger gravity view. Default: False
verbose (bool) – Be verbose or not.

References

1: https://en.wikipedia.org/wiki/Force-directed_graph_drawing

Returns
Return type: None

adobo.dr.genes2scores(obj, normalization=None, genes=[], bins=25, ctrl=100, retx=True, metadata=None)¶

Create cell scores from a list of genes

Parameters

obj (adobo.data.dataset) – A dataset class object.
normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on all normalizations available.
genes (list) – A list of genes to compute scores from.
bins (int) – Number of expression bins to be used. Default: 25
ctrl (int) – Number of control genes in each bin. Default: 100
retx (bool) – Return scores. Default: True
metadata (str) – If this is set to a string, then the scores will be set as a meta data variable with this column name. Default: None

References

1: Tirosh et al. (2016) Science. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq

Returns
Return type: Nothing. Modifies the passed object.

adobo.dr.irlb(data_norm, scale=True, ncomp=75, var_weigh=True, seed=None)¶

Truncated SVD by implicitly restarted Lanczos bidiagonalization

Notes

The augmented implicitly restarted Lanczos bidiagonalization algorithm (IRLBA) finds a few approximate largest singular values and corresponding singular vectors using a method of Baglama and Reichel.

Cells should be rows and genes as columns.

Parameters

data_norm (pandas.DataFrame) – A pandas data frame containing normalized gene expression data.
scale (bool) – Scales input data prior to PCA. Default: True
ncomp (int) – Number of components to return. Default: 75
var_weigh (bool) – Weigh by the variance of each component. Default: True
seed (int) – For reproducibility. Default: None

References

1: Baglama et al (2005) Augmented Implicitly Restarted Lanczos Bidiagonalization Methods SIAM Journal on Scientific Computing
2: https://github.com/bwlewis/irlbpy

Returns

pd.DataFrame – A py:class:pandas.DataFrame containing the components (columns).
pd.DataFrame – A py:class:pandas.DataFrame containing the contributions of every gene (rows).

adobo.dr.jackstraw(obj, normalization=None, permutations=500, ncomp=None, subset_frac_genes=0.05, score_thr=0.001, fdr=0.01, retx=True, verbose=False)¶

Determine the number of relevant PCA components.

Notes

Permutes a subset of the data matrix and compares PCA scores with the original. The final output is a p-value for each component generated using a Chi-sq test.

Parameters

obj (adobo.data.dataset) – A dataset class object.
normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on all normalizations available.
permutations (int) – Number of permutations to run. Default: 500
ncomp (int) – Number of principal components to calculate significance for. If None, then will calculate for all components previously saved from py:func:adobo.dr.pca. Default: None
subset_frac_genes (float) – Proportion genes to use. Default: 0.10
score_thr (float) – Threshold for significance. Default: 1e-05
fdr (float) – Acceptable false discovery rate. Default: 0.01
retx (bool) – In addition to also modifying the object, also return results. Default: True
verbose (bool) – Be verbose. Default: False

References

1: Chung & Storey (2015) Statistical significance of variables driving systematic variation in high-dimensional data, Bioinformatics https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4325543/

Returns

pandas.DataFrame – A genes by principal component data frame containing empirical p-values for the significance of every gene of the PC.
pandas.DataFrame – A data frame containing a single p-value for every PC generated from a Chi^2 test. Can be used to select the number of components to include by examinng p-values.

adobo.dr.pca(obj, method='irlb', normalization=None, ncomp=75, genes='hvg', scale=True, var_weigh=True, use_combat=False, verbose=False, seed=42)¶

Runs Principal Component Analysis (PCA)

Notes

Scaling of the data is achieved by setting scale=True (default), which will center (subtract the column mean) and scale columns (divide by their standard deviation).

Parameters

obj (adobo.data.dataset) – A dataset class object.
method ({‘irlb’, ‘svd’}) – Method to use for PCA. This does not matter much. Default: irlb
normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on all normalizations available.
ncomp (int) – Number of components to return. Default: 75
genes ({‘hvg’, ‘all’} or list) – If a string, the allowed values are ‘hvg’ to use only the highly variable genes or ‘all’ to use all genes. If a list, then the list specifies the list of genes to use. Default: ‘hvg’
scale (bool) – Scales input data prior to PCA. Default: True
use_combat (bool) – Use ComBat corrected data or not. Default: False
var_weigh (bool) – Weigh by the variance of each component. Default: True
verbose (bool) – Be noisy or not. Default: False
seed (int) – For reproducibility (only irlb). Default: 42

References

1: https://en.wikipedia.org/wiki/Principal_component_analysis
2: Baglama et al (2005) Augmented Implicitly Restarted Lanczos Bidiagonalization Methods SIAM Journal on Scientific Computing
3: https://github.com/bwlewis/irlbpy
4: https://tinyurl.com/yyt6df5x

Returns

Modifies the passed object. Results are stored in two: dictonaries in the passed object: dr (containing the components)

and dr_gene_contr (containing gene loadings).

Return type

None

adobo.dr.regress(obj, target_vars=[], normalization=None)¶

Regress out the effects of certain meta data variables.

Notes

This function can be used to remove known confounding variables such as ambient gene expression modules, cell cycle genes or known experimental batches. It fits a linear model using numpy’s least square method (numpy.linalg.lstsq), predicts expression values from the model and then extracts the residuals, which become the new expression values.

Parameters

obj (adobo.data.dataset) – A dataset class object.
target_vars (list) – A list of target meta data variables.
normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on the last normalization used.

Returns

Return type

Nothing. Modifies the passed object.

adobo.dr.svd(data_norm, scale=True, ncomp=75, only_sdev=False)¶

Principal component analysis via singular value decomposition

Parameters

data_norm (pandas.DataFrame) – A pandas data frame containing normalized gene expression data. Preferrably this should be a subset of the normalized gene expression matrix containing highly variable genes.
scale (bool) – Scales input data prior to PCA. Default: True
ncomp (int) – Number of components to return. Default: 75
only_sdev (bool) – Only return the standard deviation of the components. Default: False

References

1: https://tinyurl.com/yyt6df5x

Returns

pd.DataFrame – A py:class:pandas.DataFrame containing the components (columns). Only if only_sdev=False.
pd.DataFrame – A py:class:pandas.DataFrame containing the contributions of every gene (rows). Only if only_sdev=False.
pd.DataFrame – A py:class:pandas.DataFrame containing standard deviations of components. Only if only_sdev is set to True.

adobo.dr.tsne(obj, run_on_PCA=True, name=None, perplexity=30, n_iter=2000, seed=None, verbose=False, **args)¶

Projects data to a two dimensional space using the tSNE algorithm.

Notes

It is recommended to perform this function on data in PCA space. This function calls sklearn.manifold.TSNE(), and any additional parameters will be passed to it.

Parameters

obj (adobo.data.dataset) – A dataset class object.
run_on_PCA (bool) – To run tSNE on PCA components or not. If False then runs on the entire normalized gene expression matrix. Default: True
name (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on all normalizations available.
perplexity (float) – From [1]: The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. Different values can result in significanlty different results. Default: 30
n_iter (int) – Number of iterations. Default: 2000
seed (int) – For reproducibility. Default: None
verbose (bool) – Be verbose. Default: False

References

1: van der Maaten, L.J.P.; Hinton, G.E. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9:2579-2605, 2008.
2: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

Returns
Return type: Nothing. Modifies the passed object.

adobo.dr.umap(obj, run_on_PCA=True, name=None, n_neighbors=15, distance='euclidean', n_epochs=None, learning_rate=1.0, min_dist=0.1, spread=1.0, seed=None, verbose=False, **args)¶

Projects data to a low-dimensional space using the Uniform Manifold Approximation and Projection (UMAP) algorithm

Notes

UMAP is a non-linear data reduction algorithm.

Parameters

obj (adobo.data.dataset) – A dataset class object.
run_on_PCA (bool) – To run tSNE on PCA components or not. If False then runs on the entire normalized gene expression matrix. Default: True
name (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on all normalizations available.
n_neighbors (int) – The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100. Default: 15
distance (str) – The metric to use to compute distances in high dimensional space. Default: ‘euclidean’
n_epochs (int) – The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If None is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small). Default: None
learning_rate (float) – The initial learning rate for the embedding optimization. Default: 1.0
min_dist (float) – The effective minimum distance between embedded points. Default: 0.1
spread (float) – The effective scale of embedded points. Default: 1.0
seed (int) – For reproducibility. Default: None
verbose (bool) – Be verbose. Default: False

References

1: McInnes L, Healy J, Melville J (2018) UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, https://arxiv.org/abs/1802.03426
2: https://github.com/lmcinnes/umap
3: https://umap-learn.readthedocs.io/en/latest/

Returns
Return type: Nothing. Modifies the passed object.

adobo.hvg module¶

Summary¶

Functions for detection of highly variable genes.

adobo.hvg.brennecke(data_norm, log, ercc=None, fdr=0.1, ngenes=1000, minBiolDisp=0.5, verbose=False)¶

Implements the method of Brennecke et al. (2013) to identify highly variable genes

Notes

Fits data using GLM with Fisher Scoring. GLM code copied from (credits to @madrury for this code): https://github.com/madrury/py-glm

Parameters

data_norm (pandas.DataFrame) – A pandas data frame containing normalized gene expression data.
log (bool) – If normalized data were log transformed or not.
ercc (pandas.DataFrame) – A pandas data frame containing normalized ercc spikes.
fdr (float) – False Discovery Rate considered significant.
minBiolDisp (float) – Minimum percentage of variance due to biological factors.
ngenes (int) – Number of top highly variable genes to return.
verbose (bool) – Be verbose or not.

References

1: Brennecke et al. (2013) Nature Methods https://doi.org/10.1038/nmeth.2645

Returns: A list containing highly variable genes.
Return type: list

adobo.hvg.chen2016(data_norm, log, fdr=0.1, ngenes=1000)¶

This function implements the approach from Chen (2016) to identify highly variable genes.

Notes

Expression counts should be normalized and on a log scale.

Parameters

data_norm (pandas.DataFrame) – A pandas data frame containing normalized gene expression data.
log (bool) – If normalized data were log transformed or not.
fdr (float) – False Discovery Rate considered significant.
ngenes (int) – Number of top highly variable genes to return.

References

1: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2897-6
2: https://github.com/hillas/scVEGs/blob/master/scVEGs.r

Returns: A list containing highly variable genes.
Return type: list

adobo.hvg.find_hvg(obj, method='seurat', normalization=None, ngenes=1000, fdr=0.1, use_combat=False, verbose=False)¶

Finding highly variable genes

Notes

A wrapper function around the individual HVG functions, which can also be called directly.

The method ‘brennecke’ should not be applied on ‘fqn’ normalized data.

Parameters

obj (adobo.data.dataset) – A dataset class object.
method ({‘seurat’, ‘brennecke’, ‘scran’, ‘chen2016’, ‘mm’}) – Specifies the method to be used.
normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on all normalizations available.
ngenes (int) – Number of genes to return.
fdr (float) – False Discovery Rate threshold for significant genes applied to those methods that use it (brennecke, chen2016, mm). Note that the number of returned genes might be fewer than specified by ngenes because of FDR consideration.
use_combat (bool) – Use combat-adjusted data. Default: False
verbose (bool) – Be verbose or not.

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.normalize.norm(exp)
>>> ad.hvg.find_hvg(exp)

References

1: Yip et al. (2018) Briefings in Bioinformatics https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bby011/4898116

Returns
Return type: Nothing. Modifies the passed object.

adobo.hvg.mm(data_norm, log, fdr=0.1, ngenes=1000)¶

This function implements the approach from Andrews (2018).

Notes

Input should be normalized but nog log’ed.

Parameters

data (pandas.DataFrame) – A pandas data frame containing normalized counts.
fdr (float) – False Discovery Rate considered significant.
ngenes (int) – Number of top highly variable genes to return.

References

1: https://doi.org/10.1093/bioinformatics/bty1044
2: https://github.com/tallulandrews/M3Drop

Returns: A list containing highly variable genes.
Return type: list

adobo.hvg.scran(data_norm, log, ngenes=1000, ercc=None)¶

This function implements the approach from the scran R package

Notes

Expression counts should be normalized and on a log scale.

Outline of the steps:

fits a polynomial regression model to mean and variance of the technical genes
decomposes the total variance of the biological genes by subtracting the technical variance predicted by the fit
sort based on biological variance

Parameters

data_norm (pandas.DataFrame) – A pandas data frame containing normalized gene expression data.
log (bool) – If normalized data were log transformed or not.
ercc (pandas.DataFrame) – A pandas data frame containing normalized ercc spikes.
ngenes (int) – Number of top highly variable genes to return.

References

1: Lun ATL, McCarthy DJ, Marioni JC (2016). “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.” F1000Research, https://doi.org/10.12688/f1000research.9501.2

Returns: A list containing highly variable genes.
Return type: list

adobo.hvg.seurat(data, ngenes=1000, num_bins=20)¶

Retrieves a list of highly variable genes using Seurat’s strategy

Notes

The function bins the genes according to average expression, then calculates dispersion for each bin as variance to mean ratio. Within each bin, Z-scores are calculated and returned. Z-scores are ranked and the top 1000 are selected. Input data should be normalized first.

Parameters

obj (pandas.DataFrame) – A pandas data frame object containing raw read counts (rows=genes, columns=cells).
ngenes (int) – Number of top highly variable genes to return.
num_bins (int) – Number of bins to use.

References

1: https://cran.r-project.org/web/packages/Seurat/index.html

Returns: A list containing highly variable genes.
Return type: list

adobo.normalize module¶

Summary¶

This module contains functions to normalize raw read counts.

adobo.normalize.ComBat(obj, normalization=None, meta_cells_var=None, mean_only=True, par_prior=True, verbose=False)¶

Adjust for batch effects in datasets where the batch covariate is known

Notes

ComBat is a classical method for batch correction and it has been shown to perform well on single cell data. The drawback of using ComBat is that all cells in a batch is used for estimating model parameters. This implementation follows the ComBat function in the R package SVA.

Commands should run in this order: >>> ad.normalize.norm(exp) >>> exp.add_meta_data(axis=’cells’, key=’batch’, data=batch_vector) >>> ad.normalize.ComBat(exp, meta_cells_var=’batch’, verbose=True) >>> ad.hvg.find_hvg(exp, use_combat=True) >>> ad.dr.pca(exp, use_combat=True, verbose=True)

Parameters

obj (adobo.data.dataset) – A data class object.
normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on all normalizations available.
meta_cells_var (str) – Meta data variable. Should be a column name in data.dataset.meta_cells.
mean_only (bool) – Mean only version of ComBat. Default: True
par_prior (bool) – True indicates parametric adjustments will be used, False indicates non-parametric adjustments will be used. Default: True
verbose (bool) – Be verbose or not. Default: False

References

1: Johnson et al. (2007) Biostatistics. Adjusting batch effects in microarray expression data using empirical Bayes methods.
2: Buttner et al. (2019) Nat Met. A test metric for assessing single-cell RNA-seq batch correction

Returns
Return type: Modifies the passed object.

adobo.normalize.clean_matrix(data, obj, remove_low_qual=True, remove_mito=True, meta=False)¶

adobo.normalize.clr(data, axis='genes')¶

Performs centered log ratio normalization similar to Seurat

Parameters

data (pandas.DataFrame) – A pandas data frame object containing raw read counts (rows=genes, columns=cells).
axis ({'genes', 'cells'}) – Normalize over genes or cells. Default: ‘genes’

References

1: Hafemeister et al. (2019) https://www.biorxiv.org/content/10.1101/576827v1

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.normalize.norm(exp, method='clr')

Returns: A normalized data matrix with same dimensions as before.
Return type: pandas.DataFrame

adobo.normalize.fqn(data)¶

Performs full quantile normalization (FQN)

Notes

FQN has been shown to perform well on single cell data and was a popular normalization scheme for microarray data. The present function does not handle ties well.

Parameters: data (pandas.DataFrame) – A pandas data frame object containing raw read counts (rows=genes, columns=cells).

References

1: Bolstad et al. (2003) Bioinformatics https://academic.oup.com/bioinformatics/article/19/2/185/372664
2: Cole et al. (2019) Cell Systems https://www.biorxiv.org/content/10.1101/235382v2

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.normalize.norm(exp, method='fqn')

Returns: A normalized data matrix with same dimensions as before.
Return type: pandas.DataFrame

adobo.normalize.norm(obj, method='standard', name=None, use_imputed=False, log=True, log_func=<ufunc 'log2'>, small_const=1, remove_low_qual=True, remove_mito=True, gene_lengths=None, scaling_factor=10000, axis='genes', ngenes=2000, nworkers='auto', retx=False, verbose=False)¶

Normalizes gene expression data

Notes

A wrapper function around the individual normalization functions, which can also be called directly.

Parameters

obj (adobo.data.dataset) – A dataset class object.
method ({‘standard’, ‘rpkm’, ‘fqn’, ‘clr’, ‘vsn’}) – Specifies the method to use. standard refers to the simplest normalization strategy involving scaling genes by total number of reads per cell. rpkm performs RPKM normalization and requires the gene_lengths parameter to be set. fqn performs a full-quantile normalization. clr performs centered log ratio normalization. vsn performs a variance stabilizing normalization. Default: standard
name (str) – A choosen name for the normalization. It is used for storing and retrieving this normalization for plotting later. If None or an empty string, then it is set to the value of method.
use_imputed (bool) – Use imputed data. If set to True, then adobo.preproc.impute() must have been run previously. Default: False
log (bool) – Perform log transformation. Default: True
log_func (numpy.func) – Logarithmic function to use. For example: np.log2, np.log1p, np.log10, etc. Default: np.log2
small_const (float) – A small constant to add to expression values to avoid log’ing genes with zero expression. Default: 1
remove_low_qual (bool) – Remove low quality cells and uninformative genes identified by prior steps. Default: True
remove_mito (bool) – Remove mitochondrial genes (if these have been detected with adobo.preproc.find_mitochondrial_genes). Default: True
gene_lengths (pandas.Series or str) – A pandas.Series containing the gene lengths in base pairs and gene names set as index. The names must match the gene names used in data (the order does not need to match and any symbols not found in the data will be discarded). Normally gene lengths should be the combined length of exons for every gene. If gene_lengths is a str then it is taken as a filename and loaded; first column is gene names and second column is the length, field separator is one space. gene_lengths needs to be set _only_ if method=’rpkm’. Default: None
scaling_factor (int) – Scaling factor used to multiply the scaled counts with. Only used for method=”depth”. Default: 10000
axis ({‘genes’, ‘cells’}) – Only applicable when method=”clr”, defines the axis to normalize across. Default: ‘genes’
ngenes (int) – For method=’vsn’, number of genes to use when estimating parameters. Default: 2000
nworkers (int or {‘auto’}) – For method=’vsn’. If a string, then the only accepted value is ‘auto’, and the number of worker processes will be the total number of detected physical cores. If an integer then it specifies the number of worker processes. Default: ‘auto’
retx (bool) – Return the normalized data as well. Default: False
verbose (bool) – Be verbose or not. Default: False

References

1: Cole et al. (2019) Cell Systems https://www.biorxiv.org/content/10.1101/235382v2

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.normalize.norm(exp)

Returns
Return type: Nothing. Modifies the passed object.

adobo.normalize.rpkm(data, gene_lengths)¶

Normalize expression values as RPKM

Notes

This method should be used if you need to adjust for gene length, such as in a SMART-Seq2 protocol.

Parameters

obj (pandas.DataFrame) – A pandas data frame object containing raw read counts (rows=genes, columns=cells).
gene_lengths (pandas.Series or str) – Should contain the gene lengths in base pairs and gene names set as index. The names must match the gene names used in data. Normally gene lengths should be the combined length of exons for every gene. If gene_lengths is a str then it is taken as a file path and loads it; first column is gene names and second column is the length, field separator is one space; an alternative format is a single column of combined exon lengths where the total number of rows matches the number of rows in the raw read counts matrix and with the same order.

References

1: Conesa et al. (2016) Genome Biology https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8

Returns: A normalized data matrix with same dimensions as before.
Return type: pandas.DataFrame

adobo.normalize.standard(data, scaling_factor=10000)¶

Performs a standard normalization by scaling with the total read depth per cell and then multiplying with a scaling factor.

Parameters

data (pandas.DataFrame) – A pandas data frame object containing raw read counts (rows=genes, columns=cells).
scaling_factor (int) – Scaling factor used to multiply the scaled counts with. Default: 10000

References

1: Evans et al. (2018) Briefings in Bioinformatics https://academic.oup.com/bib/article/19/5/776/3056951

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.normalize.norm(exp, method='standard')

Returns: A normalized data matrix with same dimensions as before.
Return type: pandas.DataFrame

adobo.normalize.vsn(data, min_cells=5, gmean_eps=1, ngenes=2000, nworkers='auto', verbose=False)¶

Performs variance stabilizing normalization based on a negative binomial regression model with regularized parameters

Notes

Use only with UMI counts. Adopts a subset of the functionality of vst in the R package sctransform.

Parameters

data (pandas.DataFrame) – A pandas data frame object containing raw read counts (rows=genes, columns=cells).
min_cells (int) – Minimum number of cells expressing a gene for the gene to be used. Default: 10
gmean_eps (float) – A small constant to avoid log(0)=-Inf. Default: 1
ngenes (int) – Number of genes to use when estimating parameters. Default: 2000
nworkers (int or {‘auto’}) – If a string, then the only accepted value is ‘auto’, and the number of worker processes will be the total number of detected physical cores. If an integer then it specifies the number of worker processes. Default: ‘auto’
verbose (bool) – Be verbose or not. Default: False

References

1: https://cran.r-project.org/web/packages/sctransform/index.html
2: https://www.biorxiv.org/content/10.1101/576827v1

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.normalize.norm(exp, method='vsn')

Returns: A data matrix with adjusted counts.
Return type: pandas.DataFrame

adobo.plotting module¶

Summary¶

Functions for plotting scRNA-seq data.

adobo.plotting.cell_viz(obj, reduction=None, normalization=(), clustering=(), metadata=(), genes=(), highlight=None, highlight_color=('black', 'red'), selection_mode=False, edges=False, cell_types=False, trajectory=None, filename=None, marker_size=0.8, font_size=8, colors='adobo', title=None, legend=True, legend_marker_scale=10, legend_position=(1, 1), min_cluster_size=10, figsize=(10, 10), margins=None, dark=False, aspect_ratio='equal', verbose=False, **args)¶

Generates a 2d scatter plot from an embedding

Parameters

obj (adobo.data.dataset) – A data class object
reduction ({‘tsne’, ‘umap’, ‘pca’, ‘force_graph’}) – The dimensional reduction to use. Default is to use the last one generated.
normalization (tuple) – A tuple of normalization to use. If it has the length zero, then the last generated will be used.
clustering (tuple) – Specifies the clustering outcomes to plot. If None, then the last generated clustering is plotted.
metadata (tuple, optional) – Specifies the metadata variables to plot.
genes (tuple, optional) – Specifies genes to plot. Can also be a regular expression matching a single gene name.
highlight (int or str) – Highlight a cluster or a single cell. Integer if cluster and string if a cell.
highlight_color (tuple) – The colors to use when highlighting a cluster. Should be a tuple of length two. First item is the color of all other cluster than the selected, the second item is the color of the highlighted cluster.
selection_mode (bool) – Enables interactive selection of cells. Prints the IDs of the cells inside the rectangle. Default: False
edges (bool) – Draw edges (only applicable if reduction=’force_graph’). Default: False
cell_types (bool) – Print cell type predictions, applicable if adobo.bio.cell_type_predict() has been run. Default: False
trajectory (str, optional) – The trajectory to plot. For example ‘slingshot’. Default: None
filename (str, optional) – Name of an output file instead of showing on screen.
marker_size (float) – The size of the markers. Default: 0.8
font_size (float) – Font size. Default: 8
colors ({‘default’, ‘random’} or list) – Can be: (i) “adobo” or “random”; or (ii) a list of colors with the same length as the number of factors. If colors is set to “adobo”, then colors are retrieved from adobo._constants.CLUSTER_COLORS_DEFAULT (but if the number of clusters exceed 50, then random colors will be used). Default: adobo
title (str) – An optional title of the plot.
legend (bool) – Add legend or not. Default: True
legend_marker_scale (int) – Scale the markers in the legend. Default: 10
legend_position (tuple) – A tuple of length two describing the position of the legend. Default: (1,1)
min_cluster_size (int) – Can be used to prevent clusters below a certain number of cells to be plotted. Default: 10
figsize (tuple) – Figure size in inches. Default: (10, 10)
margins (dict) – Can be used to adjust margins. Should be a dict with one or more of the keys: ‘left’, ‘bottom’, ‘right’, ‘top’, ‘wspace’, ‘hspace’. Set verbose=True to figure out the present values. Default: None
dark (bool) – Make the background color black. Default: False
aspect_ratio ({‘equal’, ‘auto’}) – Set the aspect of the axis scaling, i.e. the ratio of y-unit to x-unit. Default: ‘equal’
verbose (bool) – Be verbose or not. Default: True

Returns

Return type

None

adobo.plotting.exp_genes(obj, normalization=None, clust_alg=None, cluster=None, min_cluster_size=10, violin=True, scale='width', fontsize=10, figsize=(10, 5), linewidth=0.5, filename=None, title=None, **args)¶

Compare number of expressed genes across clusters

Parameters

obj (adobo.data.dataset) – A data class object
normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on the last normalization that was applied.
clust_alg (str) – Name of the clustering strategy. If empty or None, the last one will be used.
cluster (list or int) – List of cluster identifiers to plot. If a list, then expecting a list of cluster indices. An integer specifies only one cluster index. If None, then shows the expression across all clusters. Default: None
min_cluster_size (int) – Can be used to prevent clusters below a certain number of cells to be plotted. Default: 10
violin (bool) – Draws a violin plot (otherwise a box plot). Default: True
scale ({‘width’, ‘area’}) – If area, each violin will have the same area. If width, each violin will have the same width. Default: ‘width’
fontsize (int) – Specifies font size. Default: 6
figsize (tuple) – Figure size in inches. Default: (10, 10)
linewidth (float) – Border width. Default: 0.5
filename (str, optional) – Write to a file instead of showing the plot on screen.
title (str) – Title of the plot. Default: None
**args – Passed on into seaborn’s violinplot and boxplot functions

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.normalize.norm(exp, method='standard')
>>> ad.hvg.find_hvg(exp)
>>> ad.dr.pca(exp)
>>> ad.clustering.generate(exp, clust_alg='leiden')
>>> ad.plotting.exp_genes(obj)

Returns
Return type: Nothing

adobo.plotting.genes_violin(obj, normalization='', clust_alg=None, cluster=None, gene=None, rank_func=<function median>, top=10, violin=True, scale='width', fontsize=10, figsize=(10, 5), linewidth=0.5, filename=None, **args)¶

Plot individual genes using violin plot (or box plot). Can be used to plot the top genes in the total dataset or top genes in individual clusters. Specific genes can also be selected using the parameter genes.

Parameters

obj (adobo.data.dataset) – A data class object
normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on the last normalization that was applied.
clust_alg (str) – Name of the clustering strategy. If empty or None, the last one will be used.
cluster (list or int) – List of cluster identifiers to plot. If a list, then expecting a list of cluster indices. An integer specifies only one cluster index. If None, then shows the expression across all clusters. Default: None
gene (str) – Compare a single gene across all clusters (can also be a regular expression, but it must match a single gene). If this is None, then the top is plotted based on the ranking function specified below. Default: None
rank_func (np.median) – Ranking function. numpy’s median is the default.
top (int) – Specifies the number of top scoring genes to include. Default: 10
violin (bool) – Draws a violin plot (otherwise a box plot). Default: True
scale ({‘width’, ‘area’}) – If area, each violin will have the same area. If width, each violin will have the same width. Default: ‘width’
fontsize (int) – Specifies font size. Default: 6
figsize (tuple) – Figure size in inches. Default: (10, 10)
linewidth (float) – Border width. Default: 0.5
filename (str, optional) – Write to a file instead of showing the plot on screen.
**args – Passed on into seaborn’s violinplot and boxplot functions

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.normalize.norm(exp, method='standard')
>>> ad.hvg.find_hvg(exp)
>>> ad.dr.pca(exp)
>>> ad.clustering.generate(exp, clust_alg='leiden')
>>>
>>> # top 10 genes in cluster 0
>>> ad.plotting.genes_violin(exp, top=10, cluster=0)
>>>
>>> # top 10 genes across all clusters
>>> ad.plotting.genes_violin(exp, top=10)
>>>
>>> # plotting one gene across all clusters
>>> ad.plotting.genes_violin(exp, gene='ENSG00000163220')
>>>
>>> # same, but using a box plot
>>> ad.plotting.genes_violin(exp, gene='ENSG00000163220', violin=False)

Returns
Return type: Nothing

adobo.plotting.jackstraw_barplot(obj, normalization=None, fontsize=12, figsize=(15, 6), filename=None, title=None, **args)¶

Make a barplot of jackstraw p-values for principal components

Parameters

obj (adobo.data.dataset) – A data class object
normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on the last normalization that was applied.
fontsize (int) – Specifies font size. Default: 12
figsize (tuple) – Figure size in inches. Default: (10, 10)
linewidth (float) – Border width. Default: 0.5
filename (str, optional) – Write to a file instead of showing the plot on screen.
title (str) – Title of the plot. Default: None
**args – Passed on into seaborn’s violinplot and boxplot functions

Returns

Return type

Nothing

adobo.plotting.overall(obj, what='cells', how='histogram', bin_size=100, cut_off=None, color='#E69F00', title=None, filename=None, **args)¶

Generates a plot of read counts per cell or expressed genes per cell

Parameters

obj (adobo.data.dataset) – A data class object.
what ({‘cells’, ‘genes’}) – If ‘cells’ then plots the number of reads per cell. If ‘genes’, then plots the number of expressed genes per cell. Default: ‘cells’
how ({‘histogram’, ‘boxplot’, ‘barplot’, ‘violin’}) – Type of plot to generate. Default: ‘histogram’
bin_size (int) – If how is a histogram, then this is the bin size. Default: 100
cut_off (int) – Set a cut off for genes or reads by drawing a red line and print the number of cells over and under the cut off. Only valid if how=’histogram’. Default: None
color (str) – Color of the plot. Default: ‘#E69F00’
title (str) – Change the default title of the plot. Default: None
filename (str, optional) – Write plot to file instead of showing it on the screen. Default: None

Returns

Return type

None

adobo.plotting.overall_scatter(obj, color_kept='#E69F00', color_filtered='red', title=None, filename=None, **args)¶

Generates a scatter plot showing the total number of reads on: one axis and the number of detected genes on the other axis

Parameters

obj (adobo.data.dataset) – A data class object.
color_kept (str) – Color of the plot. Default: ‘#E69F00’
color_filtered (str) – Color of the cells that have been filtered out. Default: ‘red’
title (str) – Title of the plot. Default: None
filename (str, optional) – Write plot to file instead of showing it on the screen. Default: None

Returns

Return type

None

adobo.plotting.pca_contributors(obj, normalization=None, how='heatmap', clust_alg=None, cluster=None, all_genes=False, dim=[0, 1, 2], top=20, color='#E69F00', fontsize=6, figsize=(10, 5), filename=None, verbose=False, **args)¶

Examine the top contributing genes to each PCA component. Optionally, one can examine the PCA components of a cell cluster instead.

Note

The function takes half the genes with top negative scores and the other half from genes with positive scores. Additional parameters are passed into matplotlib.pyplot.savefig().

Parameters

obj (adobo.data.dataset) – A data class object
normalization (str) – The name of the normalization to operate on. If empty or None, the last one generated is be used. Default: None
how ({‘heatmap’, ‘barplot’}) – How to visualize, can be barplot or heatmap. If ‘barplot’, then shows the PCA scores. If ‘heatmap’, then visualizes the expression of genes with top PCA scores. Default: ‘barplot’
clust_alg (str) – Name of the clustering strategy. If empty or None, the last one generated is be used. Default: None
cluster (int) – Name of the cluster.
all_genes (bool) – If cluster is set, then indicates if PCA should be computed on all genes or only on the highly variable genes. Default: False
dim (list or int) – If list, then it specifies indices of components to plot. If integer, then it specifies the first components to plot. First component has index zero. Default: [0, 1, 2]
top (int) – Specifies the number of top scoring genes to include (i.e. will use this many positive/negative scoring genes). Default: 20
color (str) – Color of the bars. Default: “#fcc603”
fontsize (int) – Specifies font size. Default: 6
figsize (tuple) – Figure size in inches. Default: (10, 10)
filename (str, optional) – Write to a file instead of showing the plot on screen. File type is determined by the filename extension.
verbose (bool) – Be verbose or not. Default: False

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.normalize.norm(exp, method='standard')
>>> ad.hvg.find_hvg(exp)
>>> ad.dr.pca(exp)
>>> ad.plotting.pca_contributors(exp, dim=4)
>>> # decomposition of a specific cluster
>>> ad.clustering.generate(exp, clust_alg='leiden')
>>> ad.plotting.pca_contributors(exp, dim=4, cluster=0)

Returns
Return type: Nothing

adobo.plotting.pca_elbow(obj, normalization=None, comp_max=100, all_genes=False, filename=None, font_size=8, figsize=(6, 4), color='#E69F00', title='PCA elbow plot', **args)¶

Generates a PCA elbow plot

Notes

Can be useful for determining the number of components to include. Here, PCA is computed using singular value decomposition.

Parameters

obj (adobo.data.dataset) – A data class object
normalization (str) – The name of the normalization to operate on. If empty or None, the last one generated is be used. Default: None
comp_max (int) – Maximum number of components to include. Default: 100
all_genes (bool) – Run on all genes, i.e. not only highly variable genes. Default: False
filename (str, optional) – Name of an output file instead of showing on screen.
font_size (float) – Font size. Default: 8
figsize (tuple) – Figure size in inches. Default: (10, 10)
color (str) – Color of the line. Default: #fcc603
title (str) – A plot title.

Returns

Return type

Nothing.

adobo.plotting.tree(obj, normalization='', clust_alg=None, method='complete', cell_types=True, min_cluster_size=10, fontsize=8, figsize=(10, 5), filename=None, title=None, **args)¶

Generates a dendrogram of cluster relationships

Parameters

obj (adobo.data.dataset) – A data class object
normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on the last normalization that was applied.
clust_alg (str) – Name of the clustering strategy. If empty or None, the last one will be used.
method (‘{‘complete’, ‘single’, ‘average’, ‘weighted’, ‘centroid’, ‘median’, ‘ward’}’) – The linkage algorithm to use. Default: ‘complete’
cell_types (bool) – Add putative cell type annotations (if available). Default: True
min_cluster_size (int) – Can be used to prevent clusters below a certain number of cells to be plotted. Default: 10
fontsize (int) – Specifies font size. Default: 6
figsize (tuple) – Figure size in inches. Default: (10, 10)
filename (str) – Write to a file instead of showing the plot on screen. Default: None
title (str) – Plot title.

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.preproc.symbol_switch(exp, species='human')
>>> ad.normalize.norm(exp, method='standard')
>>> ad.hvg.find_hvg(exp)
>>> ad.dr.pca(exp)
>>> ad.clustering.generate(exp, clust_alg='leiden')
>>> ad.bio.cell_type_predict(exp, verbose=True)
>>> ad.plotting.tree(exp)

Returns
Return type: Nothing

adobo.preproc module¶

Summary¶

Functions for pre-processing scRNA-seq data.

adobo.preproc.find_ercc(obj, ercc_pattern='^ERCC[_-]\\S+$', verbose=False)¶

Flag ERCC spikes

Parameters

obj (adobo.data.dataset) – A data class object.
ercc_pattern (str, optional) – A regular expression matching ercc gene symbols. Default: “ercc[_-]S+$”
verbose (bool, optional) – Be verbose or not. Default: False

Returns

Number of detected ercc spikes.

Return type

int

adobo.preproc.find_low_quality_cells(obj, rRNA_genes, sd_thres=3, seed=42, verbose=False)¶

Statistical detection of low quality cells using Mahalanobis distances

Notes

Mahalanobis distances are computed from five quality metrics. A robust estimate of covariance is used in the Mahalanobis function. Cells with Mahalanobis distances of three standard deviations from the mean are by default considered outliers. The five metrics are:

log-transformed number of molecules detected

the number of genes detected

the percentage of reads mapping to ribosomal

mitochondrial genes

ercc recovery (if available)

Parameters

obj (adobo.data.dataset) – A data class object.
rRNA_genes (list or str) – Either a list of rRNA genes or a string containing the path to a file containing the rRNA genes (one gene per line).
sd_thres (float) – Number of standard deviations to consider significant, i.e. cells are low quality if this. Set to higher to remove fewer cells. Default: 3
seed (float) – For the random number generator. Default: 42
verbose (bool) – Be verbose or not. Default: False

Returns

A list of low quality cells that were identified, and also modifies the passed object.

Return type

list

adobo.preproc.find_mitochondrial_genes(obj, mito_pattern='^mt-', genes=None, verbose=False)¶

Find mitochondrial genes and adds percent mitochondrial expression of total expression to the cellular meta data

Parameters

obj (adobo.data.dataset) – A data class object.
mito_pattern (str) – A regular expression matching mitochondrial gene symbols. Default: “^mt-“
genes (list, optional) – Instead of using mito_pattern, specify a list of genes that are mitochondrial.
verbose (boolean) – Be verbose or not. Default: False

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.preproc.find_mitochondrial_genes(exp)

Returns: Number of mitochondrial genes detected.
Return type: int

adobo.preproc.impute(obj, filtered=True, res=0.5, drop_thre=0.5, nworkers='auto', verbose=True)¶

Impute dropouts using the method described in Li (2018) Nature Communications

Notes

Dropouts are artifacts in scRNA-seq data. One method to alleviate the problem with dropouts is to perform imputation (i.e. replacing missing data points with predicted values).

The present method uses a different procedure for subpopulation identification as compared with the original paper.

Parameters

obj (adobo.data.dataset) – A data class object.
filtered (bool) – If data have been filtered using adobo.preproc.simple_filter(), run imputation on filtered data; otherwise runs on the entire raw read count matrix. Default: True
res (float) – Resolution parameter for the Leiden clustering, change to modify cluster resolution. Default: 0.5
drop_thre (float) – Drop threshold. Default: 0.5
nworkers (int or {‘auto’}) – If a string, then the only accepted value is ‘auto’, and the number of worker processes will be the total number of detected physical cores. If an integer then it specifies the number of worker processes. Default: ‘auto’
verbose (bool) – Be verbose or not. Default: True

References

1: Li & Li (2018) An accurate and robust imputation method scImpute for single-cell RNA-seq data https://www.nature.com/articles/s41467-018-03405-7
2: https://github.com/Vivianstats/scImpute

Returns
Return type: Modifies the passed object.

adobo.preproc.mad_outlier(obj, nmads=3, verbose=False)¶

Outlier detection based on median absolute deviation

Notes

Removes cells with a number of median absolute deviations below the median of either of two quality metrics. The quality metrics are the log of the library size and the log of number of detected genes. The principle is similar to Lun et al. Three mads is the default.

Parameters

obj (adobo.data.dataset) – A data class object.
nmads (int) – Number of median absolute deviations below the median for the cell to be considered an outlier. Default: 3
verbose (bool) – Be verbose or not. Default: False

References

1: Lun et al. (2016) F1000Res, A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5112579/

Returns
Return type: Modifies the passed object.

adobo.preproc.reset_filters(obj)¶

Resets cell and gene filters

Parameters: obj (adobo.data.dataset) – A data class object.
Returns
Return type: Nothing. Modifies the passed object.

adobo.preproc.simple_filter(obj, what='cells', minreads=1000, maxreads=None, mingenes=None, maxgenes=None, min_exp=0.001, verbose=False)¶

Removes cells with too few reads or genes with very low expression

Notes

Default is to remove cells.

Parameters

obj (adobo.data.dataset) – A data class object.
what ({‘cells’, ‘genes’}) – Determines what should be filtered from the expression matrix. If ‘cells’, then cells are filtered. If ‘genes’, then genes are filtered. Default: ‘cells’
minreads (int, optional) – When filtering cells, defines the minimum number of reads per cell needed to keep the cell. Default: 1000
maxreads (int, optional) – When filtering cells, defines the maximum number of reads allowed to keep the cell. Useful for filtering out suspected doublets. Default: None
mingenes (float, int) – When filtering cells, defines the minimum number of genes that must be expressed in a cell to keep it. Default: None
maxgenes (float, int) – When filtering cells, defines the maximum number of genes that a cell is allowed to express to keep it. Default: None
min_exp (float, int) – Used to set a threshold for how to filter out genes. If integer, defines the minimum number of cells that must express a gene to keep the gene. If float, defines the minimum fraction of cells must express the gene to keep the gene. Set to None to ignore this option. Default: 0.001
verbose (bool, optional) – Be verbose or not. Default: False

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.preproc.simple_filter(exp, what='cells', minreads=1500)
>>> ad.preproc.simple_filter(exp, what='genes')

Returns: Number of removed cells or genes.
Return type: int

adobo.preproc.symbol_switch(obj, species)¶

Changes gene symbol format

Notes

If gene symbols are in the format ENS[0-9]+, this function changes gene identifiers to symbol_ENS[0-9]+.

Parameters

obj (adobo.data.dataset) – A data class object.
species ('{'human', 'mouse'}') – Species. Default: ‘human’

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.preproc.symbol_switch(exp, species='human')

Returns
Return type: Modifies the passed object.

adobo.traj module¶

Summary¶

Functions for trajectory analysis.

adobo.traj.slingshot(obj, name=(), min_cluster_size=10, verbose=False)¶

Trajectory analysis on the cluster level following the strategy in the R package slingshot

Notes

Slingshot’s approach takes cells in a low dimensional space (UMAP is used below) and a clustering to generate a graph where vertices are clusters.

Only slingthot’s ‘getLineages’ method is used at the moment.

References

1: Street et al. (2018) BMC Genomics. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics
2: https://bioconductor.org/packages/release/bioc/html/slingshot.html

Parameters

obj (adobo.data.dataset) – A data class object.
name (tuple) – A tuple of normalization to use. If it has the length zero, then all available normalizations will be used.
min_cluster_size (int) – Minimum number of cells per cluster to include the cluster. Default: 10
verbose (bool, optional) – Be verbose or not. Default: False

Returns

Return type

Nothing modifies the passed object.

Table of Contents

This Page

adobo package¶

Subpackages¶

Submodules¶

adobo.IO module¶

adobo.bio module¶

Summary¶

adobo.bulk module¶

adobo.clustering module¶

Summary¶

adobo.data module¶

Summary¶

adobo.de module¶

adobo.dr module¶

Summary¶

adobo.hvg module¶

Summary¶

adobo.normalize module¶

Summary¶

adobo.plotting module¶

Summary¶

adobo.preproc module¶

Summary¶

adobo.traj module¶

Summary¶

Module contents¶