adobo.glm package

Submodules

adobo.glm.families module

Exponential families for GLMs

class adobo.glm.families.Bernoulli

Bases: adobo.glm.families.ExponentialFamily, adobo.glm.families.ExponentialFamilyMixin

A Bernoulli exponential family, used to fit a classical logistic model.

The GLM fit with this family has the following structure equation:

y | X ~ Bernoulli(p = X beta)

deviance(y, mu)
has_dispersion = False
initial_working_response(y)
initial_working_weights(y)
sample(mus, dispersion)
variance(mu)
class adobo.glm.families.Exponential

Bases: adobo.glm.families.Gamma

An Exponential distribution exponential family.

The GLM fit with this family has the following structure equation:

y | X = Exponentiatl(scale = exp(X beta))

The only difference between this family and the Gamma family is that the dispersion is fixed at 1.

has_dispersion = False
class adobo.glm.families.ExponentialFamily

Bases: object

An ExponentialFamily must implement at least four methods and define one attribute.

inv_link:

The inverse link function.

d_inv_link:

The derivative of the inverse link function.

variance:

The variance funtion linking the mean to the variance of the distribution.

deviance:

The deviance of the family. Used as a measure of model fit.

sample:

A sampler from the conditional distribution of the reponse.

abstract deviance(y, mu)
abstract sample(mus, dispersion)
abstract variance(mu)
class adobo.glm.families.ExponentialFamilyMixin

Bases: object

Implementations of methods common to all ExponentialFamilies.

penalized_deviance(y, mu, alpha, coef)
class adobo.glm.families.Gamma

Bases: adobo.glm.families.ExponentialFamily, adobo.glm.families.ExponentialFamilyMixin

A Gamma exponential family.

The GLM fit with this family has the following structure equation:

y | X ~ Gamma(shape = dispersion, scale = exp(X beta) / dispersion)

Here, sigma is a nuisance parameter.

Note: In this family we use the logarithmic link function, instead of the reciporical link function. Although the reciporical is the canonical link, the logarithmic link is more commonly used in Gamma regression.

deviance(y, mu)
has_dispersion = True
sample(mu, dispersion)
variance(mu)
class adobo.glm.families.Gaussian

Bases: adobo.glm.families.ExponentialFamily, adobo.glm.families.ExponentialFamilyMixin

A Gaussian exponential family, used to fit a classical linear model.

The GLM fit with this family has the following structure equation:

y | X ~ Gaussian(mu = X beta, sigma = dispersion)

Here, sigma is a nuisance parameter.

deviance(y, mu)
has_dispersion = True
initial_working_response(y)
initial_working_weights(y)
sample(mus, dispersion)
variance(mu)
class adobo.glm.families.Poisson

Bases: adobo.glm.families.QuasiPoisson

A QuasiPoisson exponential family, used to fit a possibly overdispersed Poisson regression.

The GLM fit with this family has the following structure equation:

y | X ~ Poisson(mu = exp(X beta))

The Poisson model does not estimate a dispersion parameter; if overdispersion is present, the standard errors estimated in this model may be too small. If this is the case, consider fitting a QuasiPoisson model.

has_dispersion = False
class adobo.glm.families.QuasiPoisson

Bases: adobo.glm.families.ExponentialFamily, adobo.glm.families.ExponentialFamilyMixin

A QuasiPoisson exponential family, used to fit a possibly overdispersed Poisson regression.

The GLM fit with this family has the following structure equation:

y | X ~ Poisson(mu = exp(X beta))

The parameter esimtates of this model are the same as a Poisson model, but a dispersion parameter is estimated, allowing for possibly larger standards errors when overdispersion is present.

deviance(y, mu)
has_dispersion = True
sample(mus, dispersion)
variance(mu)

adobo.glm.glm module

class adobo.glm.glm.GLM(family, alpha=0.0)

Bases: object

A generalized linear model.

GLMs are a generalization of the classical linear and logistic models to other conditional distributions of response y. A GLM is specified by a link function G and a family of conditional distributions dist, with the model specification given by

y | X ~ dist(theta = G(X * beta))

Here beta are the parameters fit in the model, with X * beta a matrix multiplication just like in linear regression. Above, theta is a parameter of the one parameter family of distributions dist.

In this implementation, a specific GLM is specified with a family object of ExponentialFamily type, which contains the information about the conditional distribution of y, and its connection to X, needed to construct the model. See the documentation for ExponentialFamily for details.

The model is fit to data using the well known Fisher Scoring algorithm, which is a version of Newton’s method where the hessian is replaced with its expectation with respect to the assumed distribution of Y.

Parameters
  • family (ExponentialFamily object) – The exponential family used in the model.

  • alpha (float, non-negative) – The ridge regularization strength. If non-zero, the loss function minimized is a penalized deviance, where the penalty is alpha * np.sum(model.coef_**2).

family

The exponential family used in the model.

Type

ExponentialFamily object

alpha

The regularization strength.

Type

float, non-negative

formula

An (optional) formula specifying the model. Used in the case that X is passed as a DataFrame. For documentation on model formulas, please see the patsy library documentation.

Type

str

X_info

Contains information about the model formula used to process the training data frame into a design matrix.

Type

patsy.design_info.DesignInfo object.

X_names

Names for the predictors.

Type

List[str]

y_names

Name for the target varaible.

Type

str

coef_

The fit parameter estimates. None if the model has not yet been fit.

Type

array, shape (n_features, )

deviance_

The final deviance of the fit model on the training data.

Type

float

information_matrix_

The estimated information matrix. This information matrix is evaluated at the fit parameters.

Type

array, shape (n_features, n_features)

n

The number of samples used to fit the model, or the sum of the sample weights.

Type

integer, positive

p

The number of fit parameters in the model.

Type

integer, positive

Notes

Instead of supplying a fit_intercept argument, we have instead assumed the programmer has included a column of ones as the first column X. The fit method will throw an exception if this is not the case.

clone()
property coef_covariance_matrix_
property coef_standard_error_
property dispersion_

Return an estimate of the dispersion parameter phi.

fit(X, y=None, formula=None, *, X_names=None, y_name=None, **kwargs)

Fit the GLM model to training data.

Fitting the model uses the well known Fisher scoring algorithm.

Parameters
  • X (array, shape (n_samples, n_features) or pd.DataFrame) – Training data.

  • y (array, shape (n_samples, )) – Target values.

  • formula (str) – A formula specifying the model. Used in the case that X is passed as a DataFrame. For documentation on model formulas, please see the patsy library documentation.

  • warm_start (array, shape (n_features, )) – Initial values to use for the parameter estimates in the optimization, useful when fitting an entire regulatization path of models. If not supplies, the initial intercept estimate will be the mean of the target array, and all other parameter estimates will be initialized to zero.

  • offset (array, shape (n_samples, )) –

    Offsets for samples. If provided, the model fit is

    E[Y|X] = family.identity_link(np.dot(X, coef) + offset)

    This is specially useful in models with exposures, as in Poisson regression.

  • sample_weights (array, shape (n_sample, )) – Sample weights used in the deviance minimized by the model. If provided, each term in the deviance being minimized is multiplied by its corrosponding weight.

  • max_iter (positive integer) – The maximum number of iterations for the fitting algorithm.

  • tol (float, non-negative and less than one) – The convergence tolerance for the fitting algorithm. The relative change in deviance is compared to this tolerance to check for convergence.

Returns

self – The fit model.

Return type

GLM object

property p_values_

Return an array of p-values for the fit coefficients. These p-values test the hypothesis that the given parameter is zero.

Note: We use the asymptotic normal approximation to the p-values for all models.

predict(X, offset=None)

Return predictions from a fit model.

Predictions are computed using the inverse link function in the family used to fit the model:

preds = family.identity_link(np.dot(X, self.coef_)

Note that in the case of binary models, predict does not make class assignmnets, it returns a probability of class membership.

Parameters
  • X (array, shape (n_samples, n_features)) – Data set.

  • offset (array, shape (n_samples, )) – Offsets to add on the linear scale when making predictions.

Returns

preds – Model predictions.

Return type

array, shape (n_samples, )

score(X, y)

Return the deviance of a fit model on a given dataset.

Parameters
  • X (array, shape (n_samples, n_features)) – Data set.

  • y (array, shape (n_samples, )) – Labels for X.

Returns

deviance – Model deviance scored using supplied data and labels.

Return type

array, shape (n_samples, )

summary()

Print a summary of the model.

adobo.glm.glmnet module

class adobo.glm.glmnet.ElasticNet(lam, alpha, max_iter=200, tol=0.0010000000000000002)

Bases: object

Fit an elastic net model.

The elastic net is a regularized regularized linear regression model incorporating both L1 and L2 penalty terms. It is fit by minimizing the penalized loss function:

sum((y - y_hat)**2)
  • lam * ((1 - alpha)/2 * sum(beta**2)

  • alpha * sum(abs(beta)))

Parameters
  • lam (float) – The overall strength of the regularization.

  • alpha (float in the interval [0, 1]) – The relative strengths of the L1 and L2 regularization.

  • max_iter (positive integer) – The maximum number of coordinate descent cycles before breaking out of the descent loop.

  • tol (float) – The convergence tolerance. The actual convergence tolerance is applied to the absolute multiplicative change in the coefficient vector. When the coefficient vector changes by less than n_params * tol in one full coordinate descent cycle, the algorithm terminates.

intercept_

The fit intercept in the regression. This is stored separately, as the penalty terms are not applied to the intercept.

Type

float

n

The number of samples used to fit the model, or the sum of the sample weights.

Type

integer, positive

p

The number of fit parameters in the model.

Type

integer, positive

Additionally, the following private attributes are used by the fitting
algorithm. They are stored permanently so they can be used as warm starts
to other ElasticNet objects. This is used both when fitting an entire
regularization path of models, and also when fitting Glmnet objects, which
procede by solving quadratic approximations to the Glmnet loss using the
ElasticNet.
The array below (and many of the other arrays used internally during
fitting) use a peculiar ordering. Instead of being arranged to match the
order of the features in a training matrix X, they are instead arranged in
order that the predictors enter the model. This allows for efficient
calculation of the update steps in ElasticNet.fit.
_active_coefs

The active set of coefficients. They are stored in the order that their associated features enter into the model, i.e. the j’th coefficient in this list is associated with the j’th feature to enter the model.

Type

array, shape (n_features,)

_active_coef_idx_list

The indices of the active coefficients into the column dimension of the training data. I.e. the j’th active coefficient is associated with the _active_coef_idx_list[j]’th column of X.

Type

array, shape (n_features,)

_j_to_active_map

Maps a column index of X to an index into _active_coefs. I.e., the position of the j’th column of X in the order in which coefficients enter into the model.

Type

dictionary, positive integer => positive integer

References

The implementation here is based on the discussion in Friedman, Hastie, and Tibshirani: Regularization Paths for Generalized Linear Models via Coordinate Descent (hereafter referenced as [FHT]).

property coef_

The coefficient estimates of a fit model.

This attribute returns the coefficient estimates in the same order as the associated columns in X.

fit(X, y, offset=None, sample_weights=None, warm_start=None)

Fit an elastic net with coordinate descent.

Parameters
  • X (array, shape (n_samples, n_features)) – Training data.

  • y (array, shape (n_samples, )) – Target values.

  • offset (array, shape (n_samples, )) –

    Offsets for samples. If provided, the model fit is

    E[Y|X] = np.dot(X, coef) + offset

  • sample_weights (array, shape (n_sample, )) – Sample weights used in the deviance minimized by the model. If provided, each term in the deviance being minimized is multiplied by its corresponding weight.

Returns

self – The fit model.

Return type

ElasticNet object

Notes

The following data structures are used internally by the fitting algorithm.

x_means: array, shape (n_features,)

The weighted column means of the training data X.

xy_dots: array, shape (n_features,)

The dot products of the columns in the training matrix X with the target y. Arranged in the same order as the columns of X.

offset_dots: array, shape(n_features,)

The weighted dot products of the columns of X with the offset vector.

xx_dots: array, shape (n_features,)

The weighted dot products of the columns of the training matrix X with themselves. I.e., each individual column with itself.

xtx_dots: array, shape (n_features, n_features)

The matrix of weighted dot products of the columns of the training data X with themselves; this includes all such dot products, not just those between columns and themselves. The columns of this matrix are permuted so that the dot products stored in column j of xtx_dots are associated with the j’th coefficient to enter the model. The rows are in order of the columns of X. This matrix is initialized to zero, and filled in lazily as coefficients enter the model, as the cross-column dot products are only needed for the features currently active in the model.

In addition, see the Attributes notes in the documentation of this class for information on how the coefficient estimates are managed internally.

predict(X)

Return predictions given an array X.

Parameters

X (array, shape (n_samples, n_features)) – Data.

Returns

preds – Predictions.

Return type

array, shape (n_samples, )

class adobo.glm.glmnet.GLMNet(family, lambdas, alpha, max_iter=10, tol=0.00010000000000000002)

Bases: object

fit(X, y)
make_working_response_and_weights(X, y, enet)

adobo.glm.simulation module

class adobo.glm.simulation.Simulation(glm)

Bases: object

Run resampling simulations of a glm object.

This object implements respampling stratgies, for example the parametric end non-parametric bootstrap.

Parameters

glm – A GLM type object.

non_parametric_bootstrap(X, y, n_sim=100, offset=None)

Fit models to non-parameteric bootstrap samples.

The non-parametric operates by sampling data with replacement from the pair (X, y), and then fitting glm models to the resampled pairs.

Parameters
  • X (array, shape (n_samples, n_features)) – Data.

  • y (array, shape (n_samples, )) – Targets.

  • n_sim (positive integer) – The number of times to resample the data.

  • offset (array, shape (n_samples, )) – Offsets to use in predictions feeding into the conditional distributions.

Returns

The list of fit models.

Return type

models

parametric_bootstrap(X, n_sim=100, offset=None)

Fit models to parameteric bootstrap samples.

The parametric operates by sampling data from the conditional distributions y | X for a fixed matrix of predictors X, and then fitting glm models to each pair (X, y).

Parameters
  • X (array, shape (n_sample, n_features)) – Data.

  • n_sim (positive integer) – The number of times to sample from the coditional distributions y | X.

  • offset (array, shape (n_samples, )) – Offsets to use in predictions feeding into the conditional distributions.

Returns

The list of fit models.

Return type

models

sample(X, n_sim=100, offset=None)

Sample points from the conditional distribution y | X.

A fitted GLM model determines conditional distributions y | X. Given a matrix of predictors, this method samples y from each conditional distribution.

Parameters
  • X (array, shape (n_samples, n_features)) – Data.

  • n_sim (positive integer) – The number of times to sample from the coditional distributions y | X.

  • offset (array, shape (n_samples, )) – Offsets to use in the predictions feeding into the conditional distributions.

Returns

simulations – The sampled y values.

Return type

array, shape (n_sim, n_samples)

adobo.glm.utils module

adobo.glm.utils.check_commensurate(X, y)
adobo.glm.utils.check_intercept(X)
adobo.glm.utils.check_offset(y, offset)
adobo.glm.utils.check_sample_weights(y, sample_weights)
adobo.glm.utils.check_types(X, y, formula)
adobo.glm.utils.default_X_names(X)
adobo.glm.utils.default_y_name()
adobo.glm.utils.has_converged(loss, loss_prev, tol)
adobo.glm.utils.has_intercept_column(X)
adobo.glm.utils.has_same_length(v, w)
adobo.glm.utils.is_commensurate(X, y)
adobo.glm.utils.soft_threshold(z, gamma)
adobo.glm.utils.weighted_column_dots(X, weights)
adobo.glm.utils.weighted_dot(x0, x1, weights)
adobo.glm.utils.weighted_means(X, weights)

Module contents