adobo package

Submodules

adobo.IO module

adobo.bio module

Summary

Functions related to biology.

adobo.bio.cell_cycle_predict(obj, clf, tr_features, name=(), verbose=False)

Predicts cell cycle phase

Notes

The classifier is trained on mouse data, so it should _only_ be used on mouse data unless it is trained on something else. Gene identifiers must use ensembl identifiers (prefixed with ‘ENSMUSG’); pure gene symbols are not enough. Results are returned as a column in the data frame meta_cells of the passed object. Does not return probability scores.

Parameters
  • obj (adobo.data.dataset) – A data class object.

  • clf (sklearn.linear_model.SGDClassifier) – The classifier.

  • tr_features (list) – Training features.

  • name (tuple) – A tuple of normalization to use. If it has the length zero, then all available normalizations will be used.

  • verbose (bool) – Be verbose. Default: False

Returns

Return type

Modifies the passed object.

adobo.bio.cell_cycle_train(verbose=False)

Trains a cell cycle classifier using Stochastic Gradient Descent with data from Buettner et al.

Notes

Genes are selected from GO:0007049

Does only need to be trained once; the second time it is serialized from disk.

Parameters

verbose (bool) – Be verbose or not. Default: False

References

1

Buettner et al. (2015) Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat Biotech.

Returns

  • sklearn.linear_model.SGDClassifier – A trained classifier.

  • list – Containing training features.

adobo.bio.cell_type_predict(obj, name=(), clustering=(), min_cluster_size=10, cell_type_markers=None, verbose=False)

Predicts cell types using the expression of marker genes

Notes

Gene identifiers should be in symbol form, not ensembl identifiers, etc.

Parameters
  • obj (adobo.data.dataset) – A data class object.

  • name (tuple) – A tuple of normalization to use. If it has the length zero, then all available normalizations will be used.

  • clustering (tuple, optional) – Specifies the clustering outcomes to work on.

  • min_cluster_size (int) – Minimum number of cells per cluster; clusters smaller than this are ignored. Default: 10

  • cell_type_markers (pandas.DataFrame) – Source of gene markers used to define cell types. This is set to None as default, indicating that PanglaoDB markers will be used. To use custom markers, set this to a pandas data frame where the first column is a gene and the second column is the name of the cell type (every cell type will have multiple rows). Default: None

  • Default (None) –

  • verbose (bool) – Be verbose or not. Default: False

Returns

Return type

Modifies the passed object.

adobo.bulk module

adobo.clustering module

Summary

This module contains functions to cluster data.

adobo.clustering.generate(obj, k=10, name=None, distance='euclidean', graph='snn', clust_alg='leiden', prune_snn=0.067, res=0.8, save_graph=True, seed=42, verbose=False)

A wrapper function for generating single cell clusters from a shared nearest neighbor graph with the Leiden algorithm

Parameters
  • obj (adobo.data.dataset) – A dataset class object.

  • k (int) – Number of nearest neighbors. Default: 10

  • name (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on all normalizations available.

  • distance (str) – Distance metric to use. See here for valid choices: https://tinyurl.com/y4bckf7w Default: ‘euclidean’

  • target ({‘irlb’, ‘svd’}) – The dimensionality reduction result to run on. Default: irlb

  • graph ({‘snn’}) – Type of graph to generate. Only shared nearest neighbor (snn) supported at the moment.

  • clust_alg (`{‘leiden’, ‘louvain’, ‘walktrap’, ‘spinglass’, ‘multilevel’,) – ‘infomap’, ‘label_prop’, ‘leading_eigenvector’}` Clustering algorithm to be used.

  • prune_snn (float) – Threshold for pruning the SNN graph, i.e. the edges with lower value (Jaccard index) than this will be removed. Set to 0 to disable pruning. Increasing this value will result in fewer edges in the graph. Default: 0.067

  • res (float) – Resolution parameter for the Leiden algorithm _only_; change to modify cluster resolution. Default: 0.8

  • save_graph (bool) – To save the graph or not. Default: True

  • seed (int) – For reproducibility.

  • verbose (bool) – Be verbose or not.

References

1

Yang et al. (2016) A Comparative Analysis of Community Detection Algorithms on Artificial Networks. Scientific Reports

Returns

A dict containing cluster sizes (number of cells), only retx is set to True.

Return type

dict

adobo.clustering.igraph(snn_graph, clust_alg)

Runs clustering functions within igraph

Parameters
  • snn_graph (pandas.DataFrame) – Source and target nodes.

  • clust_alg (`{‘walktrap’, ‘spinglass’, ‘multilevel’, ‘infomap’,) – ‘label_prop’, ‘leading_eigenvector’}` Specifies the community detection algorithm.

References

1

Pons & Latapy (2006) Computing Communities in Large NetworksUsing Random Walks, Journal of Graph Algorithms and Applications

2

Reichardt & Bornholdt (2006) Statistical mechanics of community detection, Physical Review E

Returns

Return type

Nothing. Modifies the passed object.

adobo.clustering.knn(comp, k=10, distance='euclidean')

Nearest Neighbour Search. Finds the k number of near neighbours for each cell.

Parameters
  • comp (pandas.DataFrame) – A pandas data frame containing PCA components.

  • k (int) – Number of nearest neighbors. Default: 10

  • target ({‘irlb’, ‘svd’}) – The dimensionality reduction result to run the NN search on. Default: irlb

  • distance (str) – Distance metric to use. See here for valid choices: https://tinyurl.com/y4bckf7w

Returns

Array containing indices.

Return type

numpy.ndarray

adobo.clustering.leiden(snn_graph, res=0.8, seed=42)

Runs the Leiden algorithm

Parameters
  • snn_graph (pandas.DataFrame) – Source and target nodes.

  • res (float) – Resolution parameter, change to modify cluster resolution. Default: 0.8

  • seed (int) – For reproducibility.

References

1

https://github.com/vtraag/leidenalg

2

Traag et al. (2018) https://arxiv.org/abs/1810.08473

Returns

Return type

Nothing. Modifies the passed object.

adobo.clustering.louvain(snn_graph, res=0.8, seed=42)

Runs the Louvain algorithm

Parameters
  • snn_graph (pandas.DataFrame) – Source and target nodes.

  • res (float) – Resolution parameter, change to modify cluster resolution. Default: 0.8

  • seed (int) – For reproducibility.

References

1

https://github.com/taynaud/python-louvain

2

https://perso.uclouvain.be/vincent.blondel/research/louvain.html

3

Blondel et al., Fast unfolding of communities in large networks (2008), Journal of Statistical Mechanics: Theory and Experiment

Returns

Return type

Nothing. Modifies the passed object.

adobo.clustering.snn(nn_idx, k=10, prune_snn=0.067, verbose=False)

Computes a Shared Nearest Neighbor (SNN) graph

Notes

Link weights are number of shared nearest neighbors. The sum of SNN similarities over all KNNs is retrieved with linear algebra.

Parameters
  • nn_idx (numpy.ndarray) – Numpy array generated using knn()

  • k (int) – Number of nearest neighbors. Default: 10

  • prune_snn (float) – Threshold for pruning the SNN graph, i.e. the edges with lower value (Jaccard index) than this will be removed. Set to 0 to disable pruning. Increasing this value will result in fewer edges in the graph. Default: 0.067

  • verbose (bool) – Be verbose or not.

References

1

http://mlwiki.org/index.php/SNN_Clustering

Returns

Return type

Nothing. Modifies the passed object.

adobo.data module

Summary

This module contains a data storage class.

class adobo.data.dataset(raw_mat, desc='no desc set', output_file=None, input_file=None, sparse=True, verbose=False)

Bases: object

Storage container for raw, imputed and normalized data as well as analysis results.

_assays

Holding information about what functions have been applied.

Type

dict

count_data

Raw read count matrix.

Type

pandas.DataFrame

imp_count_data

Raw data after imputing dropouts.

Type

pandas.DataFrame

_low_quality_cells

Low quality cells identified with adobo.preproc.find_low_quality_cells().

Type

list

_norm_data

Stores all analysis results. A nested dictionary.

Type

dict

meta_cells

A data frame containing meta data for cells.

Type

pandas.DataFrame

meta_genes

A data frame containing meta data for genes.

Type

pandas.DataFrame

desc

A string describing the dataset.

Type

str

sparse

Represent the data in a sparse data structure. Will save memory at the expense of time. Default: True

Type

bool

output_file

A filename that will be used when calling save().

Type

str, optional

version

The adobo package version used to create this data object.

Type

str

add_meta_data(axis, key, data, type_='cat')

Add meta data to the adobo object

Notes

Meta data can be added to cells or genes.

The parameter name ‘type_’ has an underscore to avoid conflict with Python’s internal type keyword.

Parameters
  • axis ({‘cells’, ‘genes’}) – Are the data for cells or genes?

  • key (str) – The variable name for your data. No whitespaces and special characters.

  • data (numpy.ndarray, list or pandas.Series) – Data to add. Can be a basic Python list, a numpy array or a Pandas Series with an index. If the data type is numpy array or list, then the length must match the length of cells or genes. If the data type is a Pandas series, then the length does not need to match as long as the index is there. Data can be continuous or categorical and this must be specified with type_.

  • type_ ({‘cat’, ‘cont’}) – Specify if data are categorical or continuous. cat means categorical data and cont means continuous data. Default: ‘cat’

Returns

Return type

Nothing.

assays()

Displays a basic summary of the dataset and what analyses have been performed on it.

delete(what)

Deletes analysis results from the norm_data dictionary

Parameters

what (str, tuple) – A string (or tuple of strings) specifying keys of what you want to delete. Each string should be a key in norm_data.

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.normalize.norm(exp)
>>> ad.hvg.find_hvg(exp)
>>> ad.dr.pca(exp)
>>> ad.clustering.generate(exp)
>>> exp.norm_data['standard']['clusters']['leiden']
    {'membership': V1        6
    V2        0
    V3        5
    V4        1
    V5        0
             ..
    V8377     1
    V8378    10
    V8379    11
    V8380     3
    V8381     2
    Length: 8381, dtype: int64}
>>> exp.delete(what=('clusters',))
>>> exp.norm_data['standard']['clusters']['leiden']
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    KeyError: 'leiden'
Returns

Return type

Nothing.

df_mem_usage(var)

Memory usage for a data frame in mega bytes

Parameters

var (str) – Variable name as a string.

Returns

Mega bytes used with two decimals.

Return type

float

get_assay(name, lang=False)

Get info if a function has been applied.

is_normalized()

Checks if normalized data can be found

Returns

True if it is normalized otherwise False.

Return type

bool

property low_quality_cells
property norm_data
print_dict()
save(filename=None, compress=True, verbose=False)

Serializes the object

Notes

This is a method so that it is not needed to memorize the filename, instead the filename was already specified when the object was created with the output_file parameter. Load the object data with joblib.load.

Parameters
  • filename (str) – Output filename. Default: None

  • compress (bool) – Save with data compression or not. Default: True

  • verbose (bool) – Be verbose or not. Default: False

Returns

Return type

Nothing.

set_assay(name, key=1)

Set the assay that was applied.

adobo.de module

adobo.dr module

Summary

Functions for dimensional reduction.

adobo.dr.force_graph(obj, name=(), iterations=1000, edgeWeightInfluence=1.0, jitterTolerance=1.0, barnesHutOptimize=True, scalingRatio=2.0, gravity=1.0, strongGravityMode=False, verbose=False)

Generates a force-directed graph

Parameters
  • obj (adobo.data.dataset) – A data class object.

  • name (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on the last normalization that was applied.

  • iterations (int) – Number of iterations. Default: 1000

  • edgeWeightInfluence (float) – How much influence to edge weights. 0 is no influence and 1 is normal. Default: 1.0

  • jitterTolerance (float) – Amount swing. Lower gives less speed and more precision. Default: 1.0

  • barnesHutOptimize (bool) – Run Barnes Hut optimization. Default: True

  • scalingRatio (float) – Amount of repulsion, higher values make a more sparse graph. Default: 2.0

  • gravity (float) – Attracts nodes to the center. Prevents islands from drifting away. Default: 1.0

  • strongGravityMode (bool) – A stronger gravity view. Default: False

  • verbose (bool) – Be verbose or not.

References

1

https://en.wikipedia.org/wiki/Force-directed_graph_drawing

Returns

Return type

None

adobo.dr.genes2scores(obj, normalization=None, genes=[], bins=25, ctrl=100, retx=True, metadata=None)

Create cell scores from a list of genes

Parameters
  • obj (adobo.data.dataset) – A dataset class object.

  • normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on all normalizations available.

  • genes (list) – A list of genes to compute scores from.

  • bins (int) – Number of expression bins to be used. Default: 25

  • ctrl (int) – Number of control genes in each bin. Default: 100

  • retx (bool) – Return scores. Default: True

  • metadata (str) – If this is set to a string, then the scores will be set as a meta data variable with this column name. Default: None

References

1

Tirosh et al. (2016) Science. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq

Returns

Return type

Nothing. Modifies the passed object.

adobo.dr.irlb(data_norm, scale=True, ncomp=75, var_weigh=True, seed=None)

Truncated SVD by implicitly restarted Lanczos bidiagonalization

Notes

The augmented implicitly restarted Lanczos bidiagonalization algorithm (IRLBA) finds a few approximate largest singular values and corresponding singular vectors using a method of Baglama and Reichel.

Cells should be rows and genes as columns.

Parameters
  • data_norm (pandas.DataFrame) – A pandas data frame containing normalized gene expression data.

  • scale (bool) – Scales input data prior to PCA. Default: True

  • ncomp (int) – Number of components to return. Default: 75

  • var_weigh (bool) – Weigh by the variance of each component. Default: True

  • seed (int) – For reproducibility. Default: None

References

1

Baglama et al (2005) Augmented Implicitly Restarted Lanczos Bidiagonalization Methods SIAM Journal on Scientific Computing

2

https://github.com/bwlewis/irlbpy

Returns

  • pd.DataFrame – A py:class:pandas.DataFrame containing the components (columns).

  • pd.DataFrame – A py:class:pandas.DataFrame containing the contributions of every gene (rows).

adobo.dr.jackstraw(obj, normalization=None, permutations=500, ncomp=None, subset_frac_genes=0.05, score_thr=0.001, fdr=0.01, retx=True, verbose=False)

Determine the number of relevant PCA components.

Notes

Permutes a subset of the data matrix and compares PCA scores with the original. The final output is a p-value for each component generated using a Chi-sq test.

Parameters
  • obj (adobo.data.dataset) – A dataset class object.

  • normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on all normalizations available.

  • permutations (int) – Number of permutations to run. Default: 500

  • ncomp (int) – Number of principal components to calculate significance for. If None, then will calculate for all components previously saved from py:func:adobo.dr.pca. Default: None

  • subset_frac_genes (float) – Proportion genes to use. Default: 0.10

  • score_thr (float) – Threshold for significance. Default: 1e-05

  • fdr (float) – Acceptable false discovery rate. Default: 0.01

  • retx (bool) – In addition to also modifying the object, also return results. Default: True

  • verbose (bool) – Be verbose. Default: False

References

1

Chung & Storey (2015) Statistical significance of variables driving systematic variation in high-dimensional data, Bioinformatics https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4325543/

Returns

  • pandas.DataFrame – A genes by principal component data frame containing empirical p-values for the significance of every gene of the PC.

  • pandas.DataFrame – A data frame containing a single p-value for every PC generated from a Chi^2 test. Can be used to select the number of components to include by examinng p-values.

adobo.dr.pca(obj, method='irlb', normalization=None, ncomp=75, genes='hvg', scale=True, var_weigh=True, use_combat=False, verbose=False, seed=42)

Runs Principal Component Analysis (PCA)

Notes

Scaling of the data is achieved by setting scale=True (default), which will center (subtract the column mean) and scale columns (divide by their standard deviation).

Parameters
  • obj (adobo.data.dataset) – A dataset class object.

  • method ({‘irlb’, ‘svd’}) – Method to use for PCA. This does not matter much. Default: irlb

  • normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on all normalizations available.

  • ncomp (int) – Number of components to return. Default: 75

  • genes ({‘hvg’, ‘all’} or list) – If a string, the allowed values are ‘hvg’ to use only the highly variable genes or ‘all’ to use all genes. If a list, then the list specifies the list of genes to use. Default: ‘hvg’

  • scale (bool) – Scales input data prior to PCA. Default: True

  • use_combat (bool) – Use ComBat corrected data or not. Default: False

  • var_weigh (bool) – Weigh by the variance of each component. Default: True

  • verbose (bool) – Be noisy or not. Default: False

  • seed (int) – For reproducibility (only irlb). Default: 42

References

1

https://en.wikipedia.org/wiki/Principal_component_analysis

2

Baglama et al (2005) Augmented Implicitly Restarted Lanczos Bidiagonalization Methods SIAM Journal on Scientific Computing

3

https://github.com/bwlewis/irlbpy

4

https://tinyurl.com/yyt6df5x

Returns

Modifies the passed object. Results are stored in two

dictonaries in the passed object: dr (containing the components)

and dr_gene_contr (containing gene loadings).

Return type

None

adobo.dr.regress(obj, target_vars=[], normalization=None)

Regress out the effects of certain meta data variables.

Notes

This function can be used to remove known confounding variables such as ambient gene expression modules, cell cycle genes or known experimental batches. It fits a linear model using numpy’s least square method (numpy.linalg.lstsq), predicts expression values from the model and then extracts the residuals, which become the new expression values.

Parameters
  • obj (adobo.data.dataset) – A dataset class object.

  • target_vars (list) – A list of target meta data variables.

  • normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on the last normalization used.

Returns

Return type

Nothing. Modifies the passed object.

adobo.dr.svd(data_norm, scale=True, ncomp=75, only_sdev=False)

Principal component analysis via singular value decomposition

Parameters
  • data_norm (pandas.DataFrame) – A pandas data frame containing normalized gene expression data. Preferrably this should be a subset of the normalized gene expression matrix containing highly variable genes.

  • scale (bool) – Scales input data prior to PCA. Default: True

  • ncomp (int) – Number of components to return. Default: 75

  • only_sdev (bool) – Only return the standard deviation of the components. Default: False

References

1

https://tinyurl.com/yyt6df5x

Returns

  • pd.DataFrame – A py:class:pandas.DataFrame containing the components (columns). Only if only_sdev=False.

  • pd.DataFrame – A py:class:pandas.DataFrame containing the contributions of every gene (rows). Only if only_sdev=False.

  • pd.DataFrame – A py:class:pandas.DataFrame containing standard deviations of components. Only if only_sdev is set to True.

adobo.dr.tsne(obj, run_on_PCA=True, name=None, perplexity=30, n_iter=2000, seed=None, verbose=False, **args)

Projects data to a two dimensional space using the tSNE algorithm.

Notes

It is recommended to perform this function on data in PCA space. This function calls sklearn.manifold.TSNE(), and any additional parameters will be passed to it.

Parameters
  • obj (adobo.data.dataset) – A dataset class object.

  • run_on_PCA (bool) – To run tSNE on PCA components or not. If False then runs on the entire normalized gene expression matrix. Default: True

  • name (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on all normalizations available.

  • perplexity (float) – From [1]: The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. Different values can result in significanlty different results. Default: 30

  • n_iter (int) – Number of iterations. Default: 2000

  • seed (int) – For reproducibility. Default: None

  • verbose (bool) – Be verbose. Default: False

References

1

van der Maaten, L.J.P.; Hinton, G.E. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9:2579-2605, 2008.

2

https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

Returns

Return type

Nothing. Modifies the passed object.

adobo.dr.umap(obj, run_on_PCA=True, name=None, n_neighbors=15, distance='euclidean', n_epochs=None, learning_rate=1.0, min_dist=0.1, spread=1.0, seed=None, verbose=False, **args)

Projects data to a low-dimensional space using the Uniform Manifold Approximation and Projection (UMAP) algorithm

Notes

UMAP is a non-linear data reduction algorithm.

Parameters
  • obj (adobo.data.dataset) – A dataset class object.

  • run_on_PCA (bool) – To run tSNE on PCA components or not. If False then runs on the entire normalized gene expression matrix. Default: True

  • name (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on all normalizations available.

  • n_neighbors (int) – The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100. Default: 15

  • distance (str) – The metric to use to compute distances in high dimensional space. Default: ‘euclidean’

  • n_epochs (int) – The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If None is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small). Default: None

  • learning_rate (float) – The initial learning rate for the embedding optimization. Default: 1.0

  • min_dist (float) – The effective minimum distance between embedded points. Default: 0.1

  • spread (float) – The effective scale of embedded points. Default: 1.0

  • seed (int) – For reproducibility. Default: None

  • verbose (bool) – Be verbose. Default: False

References

1

McInnes L, Healy J, Melville J (2018) UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, https://arxiv.org/abs/1802.03426

2

https://github.com/lmcinnes/umap

3

https://umap-learn.readthedocs.io/en/latest/

Returns

Return type

Nothing. Modifies the passed object.

adobo.hvg module

Summary

Functions for detection of highly variable genes.

adobo.hvg.brennecke(data_norm, log, ercc=None, fdr=0.1, ngenes=1000, minBiolDisp=0.5, verbose=False)

Implements the method of Brennecke et al. (2013) to identify highly variable genes

Notes

Fits data using GLM with Fisher Scoring. GLM code copied from (credits to @madrury for this code): https://github.com/madrury/py-glm

Parameters
  • data_norm (pandas.DataFrame) – A pandas data frame containing normalized gene expression data.

  • log (bool) – If normalized data were log transformed or not.

  • ercc (pandas.DataFrame) – A pandas data frame containing normalized ercc spikes.

  • fdr (float) – False Discovery Rate considered significant.

  • minBiolDisp (float) – Minimum percentage of variance due to biological factors.

  • ngenes (int) – Number of top highly variable genes to return.

  • verbose (bool) – Be verbose or not.

References

1

Brennecke et al. (2013) Nature Methods https://doi.org/10.1038/nmeth.2645

Returns

A list containing highly variable genes.

Return type

list

adobo.hvg.chen2016(data_norm, log, fdr=0.1, ngenes=1000)

This function implements the approach from Chen (2016) to identify highly variable genes.

Notes

Expression counts should be normalized and on a log scale.

Parameters
  • data_norm (pandas.DataFrame) – A pandas data frame containing normalized gene expression data.

  • log (bool) – If normalized data were log transformed or not.

  • fdr (float) – False Discovery Rate considered significant.

  • ngenes (int) – Number of top highly variable genes to return.

References

1

https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2897-6

2

https://github.com/hillas/scVEGs/blob/master/scVEGs.r

Returns

A list containing highly variable genes.

Return type

list

adobo.hvg.find_hvg(obj, method='seurat', normalization=None, ngenes=1000, fdr=0.1, use_combat=False, verbose=False)

Finding highly variable genes

Notes

A wrapper function around the individual HVG functions, which can also be called directly.

The method ‘brennecke’ should not be applied on ‘fqn’ normalized data.

Parameters
  • obj (adobo.data.dataset) – A dataset class object.

  • method ({‘seurat’, ‘brennecke’, ‘scran’, ‘chen2016’, ‘mm’}) – Specifies the method to be used.

  • normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on all normalizations available.

  • ngenes (int) – Number of genes to return.

  • fdr (float) – False Discovery Rate threshold for significant genes applied to those methods that use it (brennecke, chen2016, mm). Note that the number of returned genes might be fewer than specified by ngenes because of FDR consideration.

  • use_combat (bool) – Use combat-adjusted data. Default: False

  • verbose (bool) – Be verbose or not.

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.normalize.norm(exp)
>>> ad.hvg.find_hvg(exp)

References

1

Yip et al. (2018) Briefings in Bioinformatics https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bby011/4898116

Returns

Return type

Nothing. Modifies the passed object.

adobo.hvg.mm(data_norm, log, fdr=0.1, ngenes=1000)

This function implements the approach from Andrews (2018).

Notes

Input should be normalized but nog log’ed.

Parameters
  • data (pandas.DataFrame) – A pandas data frame containing normalized counts.

  • fdr (float) – False Discovery Rate considered significant.

  • ngenes (int) – Number of top highly variable genes to return.

References

1

https://doi.org/10.1093/bioinformatics/bty1044

2

https://github.com/tallulandrews/M3Drop

Returns

A list containing highly variable genes.

Return type

list

adobo.hvg.scran(data_norm, log, ngenes=1000, ercc=None)

This function implements the approach from the scran R package

Notes

Expression counts should be normalized and on a log scale.

Outline of the steps:

  1. fits a polynomial regression model to mean and variance of the technical genes

  2. decomposes the total variance of the biological genes by subtracting the technical variance predicted by the fit

  3. sort based on biological variance

Parameters
  • data_norm (pandas.DataFrame) – A pandas data frame containing normalized gene expression data.

  • log (bool) – If normalized data were log transformed or not.

  • ercc (pandas.DataFrame) – A pandas data frame containing normalized ercc spikes.

  • ngenes (int) – Number of top highly variable genes to return.

References

1

Lun ATL, McCarthy DJ, Marioni JC (2016). “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.” F1000Research, https://doi.org/10.12688/f1000research.9501.2

Returns

A list containing highly variable genes.

Return type

list

adobo.hvg.seurat(data, ngenes=1000, num_bins=20)

Retrieves a list of highly variable genes using Seurat’s strategy

Notes

The function bins the genes according to average expression, then calculates dispersion for each bin as variance to mean ratio. Within each bin, Z-scores are calculated and returned. Z-scores are ranked and the top 1000 are selected. Input data should be normalized first.

Parameters
  • obj (pandas.DataFrame) – A pandas data frame object containing raw read counts (rows=genes, columns=cells).

  • ngenes (int) – Number of top highly variable genes to return.

  • num_bins (int) – Number of bins to use.

References

1

https://cran.r-project.org/web/packages/Seurat/index.html

Returns

A list containing highly variable genes.

Return type

list

adobo.normalize module

Summary

This module contains functions to normalize raw read counts.

adobo.normalize.ComBat(obj, normalization=None, meta_cells_var=None, mean_only=True, par_prior=True, verbose=False)

Adjust for batch effects in datasets where the batch covariate is known

Notes

ComBat is a classical method for batch correction and it has been shown to perform well on single cell data. The drawback of using ComBat is that all cells in a batch is used for estimating model parameters. This implementation follows the ComBat function in the R package SVA.

Commands should run in this order: >>> ad.normalize.norm(exp) >>> exp.add_meta_data(axis=’cells’, key=’batch’, data=batch_vector) >>> ad.normalize.ComBat(exp, meta_cells_var=’batch’, verbose=True) >>> ad.hvg.find_hvg(exp, use_combat=True) >>> ad.dr.pca(exp, use_combat=True, verbose=True)

Parameters
  • obj (adobo.data.dataset) – A data class object.

  • normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on all normalizations available.

  • meta_cells_var (str) – Meta data variable. Should be a column name in data.dataset.meta_cells.

  • mean_only (bool) – Mean only version of ComBat. Default: True

  • par_prior (bool) – True indicates parametric adjustments will be used, False indicates non-parametric adjustments will be used. Default: True

  • verbose (bool) – Be verbose or not. Default: False

References

1

Johnson et al. (2007) Biostatistics. Adjusting batch effects in microarray expression data using empirical Bayes methods.

2

Buttner et al. (2019) Nat Met. A test metric for assessing single-cell RNA-seq batch correction

Returns

Return type

Modifies the passed object.

adobo.normalize.clean_matrix(data, obj, remove_low_qual=True, remove_mito=True, meta=False)
adobo.normalize.clr(data, axis='genes')

Performs centered log ratio normalization similar to Seurat

Parameters
  • data (pandas.DataFrame) – A pandas data frame object containing raw read counts (rows=genes, columns=cells).

  • axis ({'genes', 'cells'}) – Normalize over genes or cells. Default: ‘genes’

References

1

Hafemeister et al. (2019) https://www.biorxiv.org/content/10.1101/576827v1

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.normalize.norm(exp, method='clr')
Returns

A normalized data matrix with same dimensions as before.

Return type

pandas.DataFrame

adobo.normalize.fqn(data)

Performs full quantile normalization (FQN)

Notes

FQN has been shown to perform well on single cell data and was a popular normalization scheme for microarray data. The present function does not handle ties well.

Parameters

data (pandas.DataFrame) – A pandas data frame object containing raw read counts (rows=genes, columns=cells).

References

1

Bolstad et al. (2003) Bioinformatics https://academic.oup.com/bioinformatics/article/19/2/185/372664

2

Cole et al. (2019) Cell Systems https://www.biorxiv.org/content/10.1101/235382v2

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.normalize.norm(exp, method='fqn')
Returns

A normalized data matrix with same dimensions as before.

Return type

pandas.DataFrame

adobo.normalize.norm(obj, method='standard', name=None, use_imputed=False, log=True, log_func=<ufunc 'log2'>, small_const=1, remove_low_qual=True, remove_mito=True, gene_lengths=None, scaling_factor=10000, axis='genes', ngenes=2000, nworkers='auto', retx=False, verbose=False)

Normalizes gene expression data

Notes

A wrapper function around the individual normalization functions, which can also be called directly.

Parameters
  • obj (adobo.data.dataset) – A dataset class object.

  • method ({‘standard’, ‘rpkm’, ‘fqn’, ‘clr’, ‘vsn’}) – Specifies the method to use. standard refers to the simplest normalization strategy involving scaling genes by total number of reads per cell. rpkm performs RPKM normalization and requires the gene_lengths parameter to be set. fqn performs a full-quantile normalization. clr performs centered log ratio normalization. vsn performs a variance stabilizing normalization. Default: standard

  • name (str) – A choosen name for the normalization. It is used for storing and retrieving this normalization for plotting later. If None or an empty string, then it is set to the value of method.

  • use_imputed (bool) – Use imputed data. If set to True, then adobo.preproc.impute() must have been run previously. Default: False

  • log (bool) – Perform log transformation. Default: True

  • log_func (numpy.func) – Logarithmic function to use. For example: np.log2, np.log1p, np.log10, etc. Default: np.log2

  • small_const (float) – A small constant to add to expression values to avoid log’ing genes with zero expression. Default: 1

  • remove_low_qual (bool) – Remove low quality cells and uninformative genes identified by prior steps. Default: True

  • remove_mito (bool) – Remove mitochondrial genes (if these have been detected with adobo.preproc.find_mitochondrial_genes). Default: True

  • gene_lengths (pandas.Series or str) – A pandas.Series containing the gene lengths in base pairs and gene names set as index. The names must match the gene names used in data (the order does not need to match and any symbols not found in the data will be discarded). Normally gene lengths should be the combined length of exons for every gene. If gene_lengths is a str then it is taken as a filename and loaded; first column is gene names and second column is the length, field separator is one space. gene_lengths needs to be set _only_ if method=’rpkm’. Default: None

  • scaling_factor (int) – Scaling factor used to multiply the scaled counts with. Only used for method=”depth”. Default: 10000

  • axis ({‘genes’, ‘cells’}) – Only applicable when method=”clr”, defines the axis to normalize across. Default: ‘genes’

  • ngenes (int) – For method=’vsn’, number of genes to use when estimating parameters. Default: 2000

  • nworkers (int or {‘auto’}) – For method=’vsn’. If a string, then the only accepted value is ‘auto’, and the number of worker processes will be the total number of detected physical cores. If an integer then it specifies the number of worker processes. Default: ‘auto’

  • retx (bool) – Return the normalized data as well. Default: False

  • verbose (bool) – Be verbose or not. Default: False

References

1

Cole et al. (2019) Cell Systems https://www.biorxiv.org/content/10.1101/235382v2

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.normalize.norm(exp)
Returns

Return type

Nothing. Modifies the passed object.

adobo.normalize.rpkm(data, gene_lengths)

Normalize expression values as RPKM

Notes

This method should be used if you need to adjust for gene length, such as in a SMART-Seq2 protocol.

Parameters
  • obj (pandas.DataFrame) – A pandas data frame object containing raw read counts (rows=genes, columns=cells).

  • gene_lengths (pandas.Series or str) – Should contain the gene lengths in base pairs and gene names set as index. The names must match the gene names used in data. Normally gene lengths should be the combined length of exons for every gene. If gene_lengths is a str then it is taken as a file path and loads it; first column is gene names and second column is the length, field separator is one space; an alternative format is a single column of combined exon lengths where the total number of rows matches the number of rows in the raw read counts matrix and with the same order.

References

1

Conesa et al. (2016) Genome Biology https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8

Returns

A normalized data matrix with same dimensions as before.

Return type

pandas.DataFrame

adobo.normalize.standard(data, scaling_factor=10000)

Performs a standard normalization by scaling with the total read depth per cell and then multiplying with a scaling factor.

Parameters
  • data (pandas.DataFrame) – A pandas data frame object containing raw read counts (rows=genes, columns=cells).

  • scaling_factor (int) – Scaling factor used to multiply the scaled counts with. Default: 10000

References

1

Evans et al. (2018) Briefings in Bioinformatics https://academic.oup.com/bib/article/19/5/776/3056951

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.normalize.norm(exp, method='standard')
Returns

A normalized data matrix with same dimensions as before.

Return type

pandas.DataFrame

adobo.normalize.vsn(data, min_cells=5, gmean_eps=1, ngenes=2000, nworkers='auto', verbose=False)

Performs variance stabilizing normalization based on a negative binomial regression model with regularized parameters

Notes

Use only with UMI counts. Adopts a subset of the functionality of vst in the R package sctransform.

Parameters
  • data (pandas.DataFrame) – A pandas data frame object containing raw read counts (rows=genes, columns=cells).

  • min_cells (int) – Minimum number of cells expressing a gene for the gene to be used. Default: 10

  • gmean_eps (float) – A small constant to avoid log(0)=-Inf. Default: 1

  • ngenes (int) – Number of genes to use when estimating parameters. Default: 2000

  • nworkers (int or {‘auto’}) – If a string, then the only accepted value is ‘auto’, and the number of worker processes will be the total number of detected physical cores. If an integer then it specifies the number of worker processes. Default: ‘auto’

  • verbose (bool) – Be verbose or not. Default: False

References

1

https://cran.r-project.org/web/packages/sctransform/index.html

2

https://www.biorxiv.org/content/10.1101/576827v1

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.normalize.norm(exp, method='vsn')
Returns

A data matrix with adjusted counts.

Return type

pandas.DataFrame

adobo.plotting module

Summary

Functions for plotting scRNA-seq data.

adobo.plotting.cell_viz(obj, reduction=None, normalization=(), clustering=(), metadata=(), genes=(), highlight=None, highlight_color=('black', 'red'), selection_mode=False, edges=False, cell_types=False, trajectory=None, filename=None, marker_size=0.8, font_size=8, colors='adobo', title=None, legend=True, legend_marker_scale=10, legend_position=(1, 1), min_cluster_size=10, figsize=(10, 10), margins=None, dark=False, aspect_ratio='equal', verbose=False, **args)

Generates a 2d scatter plot from an embedding

Parameters
  • obj (adobo.data.dataset) – A data class object

  • reduction ({‘tsne’, ‘umap’, ‘pca’, ‘force_graph’}) – The dimensional reduction to use. Default is to use the last one generated.

  • normalization (tuple) – A tuple of normalization to use. If it has the length zero, then the last generated will be used.

  • clustering (tuple) – Specifies the clustering outcomes to plot. If None, then the last generated clustering is plotted.

  • metadata (tuple, optional) – Specifies the metadata variables to plot.

  • genes (tuple, optional) – Specifies genes to plot. Can also be a regular expression matching a single gene name.

  • highlight (int or str) – Highlight a cluster or a single cell. Integer if cluster and string if a cell.

  • highlight_color (tuple) – The colors to use when highlighting a cluster. Should be a tuple of length two. First item is the color of all other cluster than the selected, the second item is the color of the highlighted cluster.

  • selection_mode (bool) – Enables interactive selection of cells. Prints the IDs of the cells inside the rectangle. Default: False

  • edges (bool) – Draw edges (only applicable if reduction=’force_graph’). Default: False

  • cell_types (bool) – Print cell type predictions, applicable if adobo.bio.cell_type_predict() has been run. Default: False

  • trajectory (str, optional) – The trajectory to plot. For example ‘slingshot’. Default: None

  • filename (str, optional) – Name of an output file instead of showing on screen.

  • marker_size (float) – The size of the markers. Default: 0.8

  • font_size (float) – Font size. Default: 8

  • colors ({‘default’, ‘random’} or list) – Can be: (i) “adobo” or “random”; or (ii) a list of colors with the same length as the number of factors. If colors is set to “adobo”, then colors are retrieved from adobo._constants.CLUSTER_COLORS_DEFAULT (but if the number of clusters exceed 50, then random colors will be used). Default: adobo

  • title (str) – An optional title of the plot.

  • legend (bool) – Add legend or not. Default: True

  • legend_marker_scale (int) – Scale the markers in the legend. Default: 10

  • legend_position (tuple) – A tuple of length two describing the position of the legend. Default: (1,1)

  • min_cluster_size (int) – Can be used to prevent clusters below a certain number of cells to be plotted. Default: 10

  • figsize (tuple) – Figure size in inches. Default: (10, 10)

  • margins (dict) – Can be used to adjust margins. Should be a dict with one or more of the keys: ‘left’, ‘bottom’, ‘right’, ‘top’, ‘wspace’, ‘hspace’. Set verbose=True to figure out the present values. Default: None

  • dark (bool) – Make the background color black. Default: False

  • aspect_ratio ({‘equal’, ‘auto’}) – Set the aspect of the axis scaling, i.e. the ratio of y-unit to x-unit. Default: ‘equal’

  • verbose (bool) – Be verbose or not. Default: True

Returns

Return type

None

adobo.plotting.exp_genes(obj, normalization=None, clust_alg=None, cluster=None, min_cluster_size=10, violin=True, scale='width', fontsize=10, figsize=(10, 5), linewidth=0.5, filename=None, title=None, **args)

Compare number of expressed genes across clusters

Parameters
  • obj (adobo.data.dataset) – A data class object

  • normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on the last normalization that was applied.

  • clust_alg (str) – Name of the clustering strategy. If empty or None, the last one will be used.

  • cluster (list or int) – List of cluster identifiers to plot. If a list, then expecting a list of cluster indices. An integer specifies only one cluster index. If None, then shows the expression across all clusters. Default: None

  • min_cluster_size (int) – Can be used to prevent clusters below a certain number of cells to be plotted. Default: 10

  • violin (bool) – Draws a violin plot (otherwise a box plot). Default: True

  • scale ({‘width’, ‘area’}) – If area, each violin will have the same area. If width, each violin will have the same width. Default: ‘width’

  • fontsize (int) – Specifies font size. Default: 6

  • figsize (tuple) – Figure size in inches. Default: (10, 10)

  • linewidth (float) – Border width. Default: 0.5

  • filename (str, optional) – Write to a file instead of showing the plot on screen.

  • title (str) – Title of the plot. Default: None

  • **args – Passed on into seaborn’s violinplot and boxplot functions

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.normalize.norm(exp, method='standard')
>>> ad.hvg.find_hvg(exp)
>>> ad.dr.pca(exp)
>>> ad.clustering.generate(exp, clust_alg='leiden')
>>> ad.plotting.exp_genes(obj)
Returns

Return type

Nothing

adobo.plotting.genes_violin(obj, normalization='', clust_alg=None, cluster=None, gene=None, rank_func=<function median>, top=10, violin=True, scale='width', fontsize=10, figsize=(10, 5), linewidth=0.5, filename=None, **args)

Plot individual genes using violin plot (or box plot). Can be used to plot the top genes in the total dataset or top genes in individual clusters. Specific genes can also be selected using the parameter genes.

Parameters
  • obj (adobo.data.dataset) – A data class object

  • normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on the last normalization that was applied.

  • clust_alg (str) – Name of the clustering strategy. If empty or None, the last one will be used.

  • cluster (list or int) – List of cluster identifiers to plot. If a list, then expecting a list of cluster indices. An integer specifies only one cluster index. If None, then shows the expression across all clusters. Default: None

  • gene (str) – Compare a single gene across all clusters (can also be a regular expression, but it must match a single gene). If this is None, then the top is plotted based on the ranking function specified below. Default: None

  • rank_func (np.median) – Ranking function. numpy’s median is the default.

  • top (int) – Specifies the number of top scoring genes to include. Default: 10

  • violin (bool) – Draws a violin plot (otherwise a box plot). Default: True

  • scale ({‘width’, ‘area’}) – If area, each violin will have the same area. If width, each violin will have the same width. Default: ‘width’

  • fontsize (int) – Specifies font size. Default: 6

  • figsize (tuple) – Figure size in inches. Default: (10, 10)

  • linewidth (float) – Border width. Default: 0.5

  • filename (str, optional) – Write to a file instead of showing the plot on screen.

  • **args – Passed on into seaborn’s violinplot and boxplot functions

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.normalize.norm(exp, method='standard')
>>> ad.hvg.find_hvg(exp)
>>> ad.dr.pca(exp)
>>> ad.clustering.generate(exp, clust_alg='leiden')
>>>
>>> # top 10 genes in cluster 0
>>> ad.plotting.genes_violin(exp, top=10, cluster=0)
>>>
>>> # top 10 genes across all clusters
>>> ad.plotting.genes_violin(exp, top=10)
>>>
>>> # plotting one gene across all clusters
>>> ad.plotting.genes_violin(exp, gene='ENSG00000163220')
>>>
>>> # same, but using a box plot
>>> ad.plotting.genes_violin(exp, gene='ENSG00000163220', violin=False)
Returns

Return type

Nothing

adobo.plotting.jackstraw_barplot(obj, normalization=None, fontsize=12, figsize=(15, 6), filename=None, title=None, **args)

Make a barplot of jackstraw p-values for principal components

Parameters
  • obj (adobo.data.dataset) – A data class object

  • normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on the last normalization that was applied.

  • fontsize (int) – Specifies font size. Default: 12

  • figsize (tuple) – Figure size in inches. Default: (10, 10)

  • linewidth (float) – Border width. Default: 0.5

  • filename (str, optional) – Write to a file instead of showing the plot on screen.

  • title (str) – Title of the plot. Default: None

  • **args – Passed on into seaborn’s violinplot and boxplot functions

Returns

Return type

Nothing

adobo.plotting.overall(obj, what='cells', how='histogram', bin_size=100, cut_off=None, color='#E69F00', title=None, filename=None, **args)

Generates a plot of read counts per cell or expressed genes per cell

Parameters
  • obj (adobo.data.dataset) – A data class object.

  • what ({‘cells’, ‘genes’}) – If ‘cells’ then plots the number of reads per cell. If ‘genes’, then plots the number of expressed genes per cell. Default: ‘cells’

  • how ({‘histogram’, ‘boxplot’, ‘barplot’, ‘violin’}) – Type of plot to generate. Default: ‘histogram’

  • bin_size (int) – If how is a histogram, then this is the bin size. Default: 100

  • cut_off (int) – Set a cut off for genes or reads by drawing a red line and print the number of cells over and under the cut off. Only valid if how=’histogram’. Default: None

  • color (str) – Color of the plot. Default: ‘#E69F00’

  • title (str) – Change the default title of the plot. Default: None

  • filename (str, optional) – Write plot to file instead of showing it on the screen. Default: None

Returns

Return type

None

adobo.plotting.overall_scatter(obj, color_kept='#E69F00', color_filtered='red', title=None, filename=None, **args)
Generates a scatter plot showing the total number of reads on

one axis and the number of detected genes on the other axis

Parameters
  • obj (adobo.data.dataset) – A data class object.

  • color_kept (str) – Color of the plot. Default: ‘#E69F00’

  • color_filtered (str) – Color of the cells that have been filtered out. Default: ‘red’

  • title (str) – Title of the plot. Default: None

  • filename (str, optional) – Write plot to file instead of showing it on the screen. Default: None

Returns

Return type

None

adobo.plotting.pca_contributors(obj, normalization=None, how='heatmap', clust_alg=None, cluster=None, all_genes=False, dim=[0, 1, 2], top=20, color='#E69F00', fontsize=6, figsize=(10, 5), filename=None, verbose=False, **args)

Examine the top contributing genes to each PCA component. Optionally, one can examine the PCA components of a cell cluster instead.

Note

The function takes half the genes with top negative scores and the other half from genes with positive scores. Additional parameters are passed into matplotlib.pyplot.savefig().

Parameters
  • obj (adobo.data.dataset) – A data class object

  • normalization (str) – The name of the normalization to operate on. If empty or None, the last one generated is be used. Default: None

  • how ({‘heatmap’, ‘barplot’}) – How to visualize, can be barplot or heatmap. If ‘barplot’, then shows the PCA scores. If ‘heatmap’, then visualizes the expression of genes with top PCA scores. Default: ‘barplot’

  • clust_alg (str) – Name of the clustering strategy. If empty or None, the last one generated is be used. Default: None

  • cluster (int) – Name of the cluster.

  • all_genes (bool) – If cluster is set, then indicates if PCA should be computed on all genes or only on the highly variable genes. Default: False

  • dim (list or int) – If list, then it specifies indices of components to plot. If integer, then it specifies the first components to plot. First component has index zero. Default: [0, 1, 2]

  • top (int) – Specifies the number of top scoring genes to include (i.e. will use this many positive/negative scoring genes). Default: 20

  • color (str) – Color of the bars. Default: “#fcc603”

  • fontsize (int) – Specifies font size. Default: 6

  • figsize (tuple) – Figure size in inches. Default: (10, 10)

  • filename (str, optional) – Write to a file instead of showing the plot on screen. File type is determined by the filename extension.

  • verbose (bool) – Be verbose or not. Default: False

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.normalize.norm(exp, method='standard')
>>> ad.hvg.find_hvg(exp)
>>> ad.dr.pca(exp)
>>> ad.plotting.pca_contributors(exp, dim=4)
>>> # decomposition of a specific cluster
>>> ad.clustering.generate(exp, clust_alg='leiden')
>>> ad.plotting.pca_contributors(exp, dim=4, cluster=0)
Returns

Return type

Nothing

adobo.plotting.pca_elbow(obj, normalization=None, comp_max=100, all_genes=False, filename=None, font_size=8, figsize=(6, 4), color='#E69F00', title='PCA elbow plot', **args)

Generates a PCA elbow plot

Notes

Can be useful for determining the number of components to include. Here, PCA is computed using singular value decomposition.

Parameters
  • obj (adobo.data.dataset) – A data class object

  • normalization (str) – The name of the normalization to operate on. If empty or None, the last one generated is be used. Default: None

  • comp_max (int) – Maximum number of components to include. Default: 100

  • all_genes (bool) – Run on all genes, i.e. not only highly variable genes. Default: False

  • filename (str, optional) – Name of an output file instead of showing on screen.

  • font_size (float) – Font size. Default: 8

  • figsize (tuple) – Figure size in inches. Default: (10, 10)

  • color (str) – Color of the line. Default: #fcc603

  • title (str) – A plot title.

Returns

Return type

Nothing.

adobo.plotting.tree(obj, normalization='', clust_alg=None, method='complete', cell_types=True, min_cluster_size=10, fontsize=8, figsize=(10, 5), filename=None, title=None, **args)

Generates a dendrogram of cluster relationships

Parameters
  • obj (adobo.data.dataset) – A data class object

  • normalization (str) – The name of the normalization to operate on. If this is empty or None then the function will be applied on the last normalization that was applied.

  • clust_alg (str) – Name of the clustering strategy. If empty or None, the last one will be used.

  • method (‘{‘complete’, ‘single’, ‘average’, ‘weighted’, ‘centroid’, ‘median’, ‘ward’}’) – The linkage algorithm to use. Default: ‘complete’

  • cell_types (bool) – Add putative cell type annotations (if available). Default: True

  • min_cluster_size (int) – Can be used to prevent clusters below a certain number of cells to be plotted. Default: 10

  • fontsize (int) – Specifies font size. Default: 6

  • figsize (tuple) – Figure size in inches. Default: (10, 10)

  • filename (str) – Write to a file instead of showing the plot on screen. Default: None

  • title (str) – Plot title.

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.preproc.symbol_switch(exp, species='human')
>>> ad.normalize.norm(exp, method='standard')
>>> ad.hvg.find_hvg(exp)
>>> ad.dr.pca(exp)
>>> ad.clustering.generate(exp, clust_alg='leiden')
>>> ad.bio.cell_type_predict(exp, verbose=True)
>>> ad.plotting.tree(exp)
Returns

Return type

Nothing

adobo.preproc module

Summary

Functions for pre-processing scRNA-seq data.

adobo.preproc.find_ercc(obj, ercc_pattern='^ERCC[_-]\\S+$', verbose=False)

Flag ERCC spikes

Parameters
  • obj (adobo.data.dataset) – A data class object.

  • ercc_pattern (str, optional) – A regular expression matching ercc gene symbols. Default: “ercc[_-]S+$”

  • verbose (bool, optional) – Be verbose or not. Default: False

Returns

Number of detected ercc spikes.

Return type

int

adobo.preproc.find_low_quality_cells(obj, rRNA_genes, sd_thres=3, seed=42, verbose=False)

Statistical detection of low quality cells using Mahalanobis distances

Notes

Mahalanobis distances are computed from five quality metrics. A robust estimate of covariance is used in the Mahalanobis function. Cells with Mahalanobis distances of three standard deviations from the mean are by default considered outliers. The five metrics are:

  1. log-transformed number of molecules detected

  2. the number of genes detected

  3. the percentage of reads mapping to ribosomal

  4. mitochondrial genes

  5. ercc recovery (if available)

Parameters
  • obj (adobo.data.dataset) – A data class object.

  • rRNA_genes (list or str) – Either a list of rRNA genes or a string containing the path to a file containing the rRNA genes (one gene per line).

  • sd_thres (float) – Number of standard deviations to consider significant, i.e. cells are low quality if this. Set to higher to remove fewer cells. Default: 3

  • seed (float) – For the random number generator. Default: 42

  • verbose (bool) – Be verbose or not. Default: False

Returns

A list of low quality cells that were identified, and also modifies the passed object.

Return type

list

adobo.preproc.find_mitochondrial_genes(obj, mito_pattern='^mt-', genes=None, verbose=False)

Find mitochondrial genes and adds percent mitochondrial expression of total expression to the cellular meta data

Parameters
  • obj (adobo.data.dataset) – A data class object.

  • mito_pattern (str) – A regular expression matching mitochondrial gene symbols. Default: “^mt-“

  • genes (list, optional) – Instead of using mito_pattern, specify a list of genes that are mitochondrial.

  • verbose (boolean) – Be verbose or not. Default: False

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.preproc.find_mitochondrial_genes(exp)
Returns

Number of mitochondrial genes detected.

Return type

int

adobo.preproc.impute(obj, filtered=True, res=0.5, drop_thre=0.5, nworkers='auto', verbose=True)

Impute dropouts using the method described in Li (2018) Nature Communications

Notes

Dropouts are artifacts in scRNA-seq data. One method to alleviate the problem with dropouts is to perform imputation (i.e. replacing missing data points with predicted values).

The present method uses a different procedure for subpopulation identification as compared with the original paper.

Parameters
  • obj (adobo.data.dataset) – A data class object.

  • filtered (bool) – If data have been filtered using adobo.preproc.simple_filter(), run imputation on filtered data; otherwise runs on the entire raw read count matrix. Default: True

  • res (float) – Resolution parameter for the Leiden clustering, change to modify cluster resolution. Default: 0.5

  • drop_thre (float) – Drop threshold. Default: 0.5

  • nworkers (int or {‘auto’}) – If a string, then the only accepted value is ‘auto’, and the number of worker processes will be the total number of detected physical cores. If an integer then it specifies the number of worker processes. Default: ‘auto’

  • verbose (bool) – Be verbose or not. Default: True

References

1

Li & Li (2018) An accurate and robust imputation method scImpute for single-cell RNA-seq data https://www.nature.com/articles/s41467-018-03405-7

2

https://github.com/Vivianstats/scImpute

Returns

Return type

Modifies the passed object.

adobo.preproc.mad_outlier(obj, nmads=3, verbose=False)

Outlier detection based on median absolute deviation

Notes

Removes cells with a number of median absolute deviations below the median of either of two quality metrics. The quality metrics are the log of the library size and the log of number of detected genes. The principle is similar to Lun et al. Three mads is the default.

Parameters
  • obj (adobo.data.dataset) – A data class object.

  • nmads (int) – Number of median absolute deviations below the median for the cell to be considered an outlier. Default: 3

  • verbose (bool) – Be verbose or not. Default: False

References

1

Lun et al. (2016) F1000Res, A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5112579/

Returns

Return type

Modifies the passed object.

adobo.preproc.reset_filters(obj)

Resets cell and gene filters

Parameters

obj (adobo.data.dataset) – A data class object.

Returns

Return type

Nothing. Modifies the passed object.

adobo.preproc.simple_filter(obj, what='cells', minreads=1000, maxreads=None, mingenes=None, maxgenes=None, min_exp=0.001, verbose=False)

Removes cells with too few reads or genes with very low expression

Notes

Default is to remove cells.

Parameters
  • obj (adobo.data.dataset) – A data class object.

  • what ({‘cells’, ‘genes’}) – Determines what should be filtered from the expression matrix. If ‘cells’, then cells are filtered. If ‘genes’, then genes are filtered. Default: ‘cells’

  • minreads (int, optional) – When filtering cells, defines the minimum number of reads per cell needed to keep the cell. Default: 1000

  • maxreads (int, optional) – When filtering cells, defines the maximum number of reads allowed to keep the cell. Useful for filtering out suspected doublets. Default: None

  • mingenes (float, int) – When filtering cells, defines the minimum number of genes that must be expressed in a cell to keep it. Default: None

  • maxgenes (float, int) – When filtering cells, defines the maximum number of genes that a cell is allowed to express to keep it. Default: None

  • min_exp (float, int) – Used to set a threshold for how to filter out genes. If integer, defines the minimum number of cells that must express a gene to keep the gene. If float, defines the minimum fraction of cells must express the gene to keep the gene. Set to None to ignore this option. Default: 0.001

  • verbose (bool, optional) – Be verbose or not. Default: False

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.preproc.simple_filter(exp, what='cells', minreads=1500)
>>> ad.preproc.simple_filter(exp, what='genes')
Returns

Number of removed cells or genes.

Return type

int

adobo.preproc.symbol_switch(obj, species)

Changes gene symbol format

Notes

If gene symbols are in the format ENS[0-9]+, this function changes gene identifiers to symbol_ENS[0-9]+.

Parameters
  • obj (adobo.data.dataset) – A data class object.

  • species ('{'human', 'mouse'}') – Species. Default: ‘human’

Example

>>> import adobo as ad
>>> exp = ad.IO.load_from_file('pbmc8k.mat.gz', bundled=True)
>>> ad.preproc.symbol_switch(exp, species='human')
Returns

Return type

Modifies the passed object.

adobo.traj module

Summary

Functions for trajectory analysis.

adobo.traj.slingshot(obj, name=(), min_cluster_size=10, verbose=False)

Trajectory analysis on the cluster level following the strategy in the R package slingshot

Notes

Slingshot’s approach takes cells in a low dimensional space (UMAP is used below) and a clustering to generate a graph where vertices are clusters.

Only slingthot’s ‘getLineages’ method is used at the moment.

References

1

Street et al. (2018) BMC Genomics. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics

2

https://bioconductor.org/packages/release/bioc/html/slingshot.html

Parameters
  • obj (adobo.data.dataset) – A data class object.

  • name (tuple) – A tuple of normalization to use. If it has the length zero, then all available normalizations will be used.

  • min_cluster_size (int) – Minimum number of cells per cluster to include the cluster. Default: 10

  • verbose (bool, optional) – Be verbose or not. Default: False

Returns

Return type

Nothing modifies the passed object.

Module contents