adobo.glm package¶

Submodules¶

adobo.glm.families module¶

Exponential families for GLMs

class adobo.glm.families.Bernoulli¶

Bases: adobo.glm.families.ExponentialFamily, adobo.glm.families.ExponentialFamilyMixin

A Bernoulli exponential family, used to fit a classical logistic model.

The GLM fit with this family has the following structure equation:

y | X ~ Bernoulli(p = X beta)

d_inv_link(nu, mu)¶

deviance(y, mu)¶

has_dispersion = False¶

initial_working_response(y)¶

initial_working_weights(y)¶

inv_link(nu)¶

sample(mus, dispersion)¶

variance(mu)¶

class adobo.glm.families.Exponential¶

Bases: adobo.glm.families.Gamma

An Exponential distribution exponential family.

The GLM fit with this family has the following structure equation:

y | X = Exponentiatl(scale = exp(X beta))

The only difference between this family and the Gamma family is that the dispersion is fixed at 1.

has_dispersion = False¶

class adobo.glm.families.ExponentialFamily¶

Bases: object

An ExponentialFamily must implement at least four methods and define one attribute.

inv_link:: The inverse link function.

d_inv_link:: The derivative of the inverse link function.

variance:: The variance funtion linking the mean to the variance of the distribution.

deviance:: The deviance of the family. Used as a measure of model fit.

sample:: A sampler from the conditional distribution of the reponse.

abstract d_inv_link(nu, mu)¶

abstract deviance(y, mu)¶

abstract inv_link(nu)¶

abstract sample(mus, dispersion)¶

abstract variance(mu)¶

class adobo.glm.families.ExponentialFamilyMixin¶

Bases: object

Implementations of methods common to all ExponentialFamilies.

penalized_deviance(y, mu, alpha, coef)¶

class adobo.glm.families.Gamma¶

Bases: adobo.glm.families.ExponentialFamily, adobo.glm.families.ExponentialFamilyMixin

A Gamma exponential family.

The GLM fit with this family has the following structure equation:

y | X ~ Gamma(shape = dispersion, scale = exp(X beta) / dispersion)

Here, sigma is a nuisance parameter.

Note: In this family we use the logarithmic link function, instead of the reciporical link function. Although the reciporical is the canonical link, the logarithmic link is more commonly used in Gamma regression.

d_inv_link(nu, mu)¶

deviance(y, mu)¶

has_dispersion = True¶

identity_link(nu)¶

inv_link(nu)¶

sample(mu, dispersion)¶

variance(mu)¶

class adobo.glm.families.Gaussian¶

Bases: adobo.glm.families.ExponentialFamily, adobo.glm.families.ExponentialFamilyMixin

A Gaussian exponential family, used to fit a classical linear model.

The GLM fit with this family has the following structure equation:

y | X ~ Gaussian(mu = X beta, sigma = dispersion)

Here, sigma is a nuisance parameter.

d_inv_link(nu, mu)¶

deviance(y, mu)¶

has_dispersion = True¶

initial_working_response(y)¶

initial_working_weights(y)¶

inv_link(nu)¶

sample(mus, dispersion)¶

variance(mu)¶

class adobo.glm.families.Poisson¶

Bases: adobo.glm.families.QuasiPoisson

A QuasiPoisson exponential family, used to fit a possibly overdispersed Poisson regression.

The GLM fit with this family has the following structure equation:

y | X ~ Poisson(mu = exp(X beta))

The Poisson model does not estimate a dispersion parameter; if overdispersion is present, the standard errors estimated in this model may be too small. If this is the case, consider fitting a QuasiPoisson model.

has_dispersion = False¶

class adobo.glm.families.QuasiPoisson¶

Bases: adobo.glm.families.ExponentialFamily, adobo.glm.families.ExponentialFamilyMixin

A QuasiPoisson exponential family, used to fit a possibly overdispersed Poisson regression.

The GLM fit with this family has the following structure equation:

y | X ~ Poisson(mu = exp(X beta))

The parameter esimtates of this model are the same as a Poisson model, but a dispersion parameter is estimated, allowing for possibly larger standards errors when overdispersion is present.

d_inv_link(nu, mu)¶

deviance(y, mu)¶

has_dispersion = True¶

inv_link(nu)¶

sample(mus, dispersion)¶

variance(mu)¶

adobo.glm.glm module¶

class adobo.glm.glm.GLM(family, alpha=0.0)¶

Bases: object

A generalized linear model.

GLMs are a generalization of the classical linear and logistic models to other conditional distributions of response y. A GLM is specified by a link function G and a family of conditional distributions dist, with the model specification given by

y | X ~ dist(theta = G(X * beta))

Here beta are the parameters fit in the model, with X * beta a matrix multiplication just like in linear regression. Above, theta is a parameter of the one parameter family of distributions dist.

In this implementation, a specific GLM is specified with a family object of ExponentialFamily type, which contains the information about the conditional distribution of y, and its connection to X, needed to construct the model. See the documentation for ExponentialFamily for details.

The model is fit to data using the well known Fisher Scoring algorithm, which is a version of Newton’s method where the hessian is replaced with its expectation with respect to the assumed distribution of Y.

Parameters

family (ExponentialFamily object) – The exponential family used in the model.
alpha (float, non-negative) – The ridge regularization strength. If non-zero, the loss function minimized is a penalized deviance, where the penalty is alpha * np.sum(model.coef_**2).

family¶

The exponential family used in the model.

Type: ExponentialFamily object

alpha¶

The regularization strength.

Type: float, non-negative

formula¶

An (optional) formula specifying the model. Used in the case that X is passed as a DataFrame. For documentation on model formulas, please see the patsy library documentation.

Type: str

X_info¶

Contains information about the model formula used to process the training data frame into a design matrix.

Type: patsy.design_info.DesignInfo object.

X_names¶

Names for the predictors.

Type: List[str]

y_names¶

Name for the target varaible.

Type: str

coef_¶

The fit parameter estimates. None if the model has not yet been fit.

Type: array, shape (n_features, )

deviance_¶

The final deviance of the fit model on the training data.

Type: float

information_matrix_¶

The estimated information matrix. This information matrix is evaluated at the fit parameters.

Type: array, shape (n_features, n_features)

n¶

The number of samples used to fit the model, or the sum of the sample weights.

Type: integer, positive

p¶

The number of fit parameters in the model.

Type: integer, positive

Notes

Instead of supplying a fit_intercept argument, we have instead assumed the programmer has included a column of ones as the first column X. The fit method will throw an exception if this is not the case.

clone()¶

property coef_covariance_matrix_¶

property coef_standard_error_¶

property dispersion_¶: Return an estimate of the dispersion parameter phi.

fit(X, y=None, formula=None, *, X_names=None, y_name=None, **kwargs)¶

Fit the GLM model to training data.

Fitting the model uses the well known Fisher scoring algorithm.

Parameters

X (array, shape (n_samples, n_features) or pd.DataFrame) – Training data.
y (array, shape (n_samples, )) – Target values.
formula (str) – A formula specifying the model. Used in the case that X is passed as a DataFrame. For documentation on model formulas, please see the patsy library documentation.
warm_start (array, shape (n_features, )) – Initial values to use for the parameter estimates in the optimization, useful when fitting an entire regulatization path of models. If not supplies, the initial intercept estimate will be the mean of the target array, and all other parameter estimates will be initialized to zero.
offset (array, shape (n_samples, )) –

Offsets for samples. If provided, the model fit is
E[Y|X] = family.identity_link(np.dot(X, coef) + offset)

This is specially useful in models with exposures, as in Poisson regression.
sample_weights (array, shape (n_sample, )) – Sample weights used in the deviance minimized by the model. If provided, each term in the deviance being minimized is multiplied by its corrosponding weight.
max_iter (positive integer) – The maximum number of iterations for the fitting algorithm.
tol (float, non-negative and less than one) – The convergence tolerance for the fitting algorithm. The relative change in deviance is compared to this tolerance to check for convergence.

Returns

self – The fit model.

Return type

GLM object

property p_values_¶

Return an array of p-values for the fit coefficients. These p-values test the hypothesis that the given parameter is zero.

Note: We use the asymptotic normal approximation to the p-values for all models.

predict(X, offset=None)¶

Return predictions from a fit model.

Predictions are computed using the inverse link function in the family used to fit the model:

preds = family.identity_link(np.dot(X, self.coef_)

Note that in the case of binary models, predict does not make class assignmnets, it returns a probability of class membership.

Parameters

X (array, shape (n_samples, n_features)) – Data set.
offset (array, shape (n_samples, )) – Offsets to add on the linear scale when making predictions.

Returns

preds – Model predictions.

Return type

array, shape (n_samples, )

score(X, y)¶

Return the deviance of a fit model on a given dataset.

Parameters

X (array, shape (n_samples, n_features)) – Data set.
y (array, shape (n_samples, )) – Labels for X.

Returns

deviance – Model deviance scored using supplied data and labels.

Return type

array, shape (n_samples, )

summary()¶: Print a summary of the model.

adobo.glm.glmnet module¶

class adobo.glm.glmnet.ElasticNet(lam, alpha, max_iter=200, tol=0.0010000000000000002)¶

Bases: object

Fit an elastic net model.

The elastic net is a regularized regularized linear regression model incorporating both L1 and L2 penalty terms. It is fit by minimizing the penalized loss function:

sum((y - y_hat)**2)

lam * ((1 - alpha)/2 * sum(beta**2)

alpha * sum(abs(beta)))

Parameters

lam (float) – The overall strength of the regularization.
alpha (float in the interval [0, 1]) – The relative strengths of the L1 and L2 regularization.
max_iter (positive integer) – The maximum number of coordinate descent cycles before breaking out of the descent loop.
tol (float) – The convergence tolerance. The actual convergence tolerance is applied to the absolute multiplicative change in the coefficient vector. When the coefficient vector changes by less than n_params * tol in one full coordinate descent cycle, the algorithm terminates.

intercept_¶

The fit intercept in the regression. This is stored separately, as the penalty terms are not applied to the intercept.

Type: float

n¶

The number of samples used to fit the model, or the sum of the sample weights.

Type: integer, positive

p¶

The number of fit parameters in the model.

Type: integer, positive

Additionally, the following private attributes are used by the fitting

algorithm. They are stored permanently so they can be used as warm starts

to other ElasticNet objects. This is used both when fitting an entire

regularization path of models, and also when fitting Glmnet objects, which

procede by solving quadratic approximations to the Glmnet loss using the

ElasticNet.

The array below (and many of the other arrays used internally during

fitting) use a peculiar ordering. Instead of being arranged to match the

order of the features in a training matrix X, they are instead arranged in

order that the predictors enter the model. This allows for efficient

calculation of the update steps in ElasticNet.fit.

_active_coefs¶

The active set of coefficients. They are stored in the order that their associated features enter into the model, i.e. the j’th coefficient in this list is associated with the j’th feature to enter the model.

Type: array, shape (n_features,)

_active_coef_idx_list¶

The indices of the active coefficients into the column dimension of the training data. I.e. the j’th active coefficient is associated with the _active_coef_idx_list[j]’th column of X.

Type: array, shape (n_features,)

_j_to_active_map¶

Maps a column index of X to an index into _active_coefs. I.e., the position of the j’th column of X in the order in which coefficients enter into the model.

Type: dictionary, positive integer => positive integer

References

The implementation here is based on the discussion in Friedman, Hastie, and Tibshirani: Regularization Paths for Generalized Linear Models via Coordinate Descent (hereafter referenced as [FHT]).

property coef_¶

The coefficient estimates of a fit model.

This attribute returns the coefficient estimates in the same order as the associated columns in X.

fit(X, y, offset=None, sample_weights=None, warm_start=None)¶

Fit an elastic net with coordinate descent.

Parameters

X (array, shape (n_samples, n_features)) – Training data.
y (array, shape (n_samples, )) – Target values.
offset (array, shape (n_samples, )) –

Offsets for samples. If provided, the model fit is
E[Y|X] = np.dot(X, coef) + offset
sample_weights (array, shape (n_sample, )) – Sample weights used in the deviance minimized by the model. If provided, each term in the deviance being minimized is multiplied by its corresponding weight.

Returns

self – The fit model.

Return type

ElasticNet object

Notes

The following data structures are used internally by the fitting algorithm.

x_means: array, shape (n_features,): The weighted column means of the training data X.
xy_dots: array, shape (n_features,): The dot products of the columns in the training matrix X with the target y. Arranged in the same order as the columns of X.
offset_dots: array, shape(n_features,): The weighted dot products of the columns of X with the offset vector.
xx_dots: array, shape (n_features,): The weighted dot products of the columns of the training matrix X with themselves. I.e., each individual column with itself.
xtx_dots: array, shape (n_features, n_features): The matrix of weighted dot products of the columns of the training data X with themselves; this includes all such dot products, not just those between columns and themselves. The columns of this matrix are permuted so that the dot products stored in column j of xtx_dots are associated with the j’th coefficient to enter the model. The rows are in order of the columns of X. This matrix is initialized to zero, and filled in lazily as coefficients enter the model, as the cross-column dot products are only needed for the features currently active in the model.

In addition, see the Attributes notes in the documentation of this class for information on how the coefficient estimates are managed internally.

predict(X)¶

Return predictions given an array X.

Parameters: X (array, shape (n_samples, n_features)) – Data.
Returns: preds – Predictions.
Return type: array, shape (n_samples, )

class adobo.glm.glmnet.GLMNet(family, lambdas, alpha, max_iter=10, tol=0.00010000000000000002)¶

Bases: object

fit(X, y)¶

make_working_response_and_weights(X, y, enet)¶

adobo.glm.simulation module¶

class adobo.glm.simulation.Simulation(glm)¶

Bases: object

Run resampling simulations of a glm object.

This object implements respampling stratgies, for example the parametric end non-parametric bootstrap.

Parameters: glm – A GLM type object.

non_parametric_bootstrap(X, y, n_sim=100, offset=None)¶

Fit models to non-parameteric bootstrap samples.

The non-parametric operates by sampling data with replacement from the pair (X, y), and then fitting glm models to the resampled pairs.

Parameters

X (array, shape (n_samples, n_features)) – Data.
y (array, shape (n_samples, )) – Targets.
n_sim (positive integer) – The number of times to resample the data.
offset (array, shape (n_samples, )) – Offsets to use in predictions feeding into the conditional distributions.

Returns

The list of fit models.

Return type

models

parametric_bootstrap(X, n_sim=100, offset=None)¶

Fit models to parameteric bootstrap samples.

The parametric operates by sampling data from the conditional distributions y | X for a fixed matrix of predictors X, and then fitting glm models to each pair (X, y).

Parameters

X (array, shape (n_sample, n_features)) – Data.
n_sim (positive integer) – The number of times to sample from the coditional distributions y | X.
offset (array, shape (n_samples, )) – Offsets to use in predictions feeding into the conditional distributions.

Returns

The list of fit models.

Return type

models

sample(X, n_sim=100, offset=None)¶

Sample points from the conditional distribution y | X.

A fitted GLM model determines conditional distributions y | X. Given a matrix of predictors, this method samples y from each conditional distribution.

Parameters

X (array, shape (n_samples, n_features)) – Data.
n_sim (positive integer) – The number of times to sample from the coditional distributions y | X.
offset (array, shape (n_samples, )) – Offsets to use in the predictions feeding into the conditional distributions.

Returns

simulations – The sampled y values.

Return type

array, shape (n_sim, n_samples)

adobo.glm.utils module¶

adobo.glm.utils.check_commensurate(X, y)¶

adobo.glm.utils.check_intercept(X)¶

adobo.glm.utils.check_offset(y, offset)¶

adobo.glm.utils.check_sample_weights(y, sample_weights)¶

adobo.glm.utils.check_types(X, y, formula)¶

adobo.glm.utils.default_X_names(X)¶

adobo.glm.utils.default_y_name()¶

adobo.glm.utils.has_converged(loss, loss_prev, tol)¶

adobo.glm.utils.has_intercept_column(X)¶

adobo.glm.utils.has_same_length(v, w)¶

adobo.glm.utils.is_commensurate(X, y)¶

adobo.glm.utils.soft_threshold(z, gamma)¶

adobo.glm.utils.weighted_column_dots(X, weights)¶

adobo.glm.utils.weighted_dot(x0, x1, weights)¶

adobo.glm.utils.weighted_means(X, weights)¶

Table of Contents

This Page

adobo.glm package¶

Submodules¶

adobo.glm.families module¶

adobo.glm.glm module¶

adobo.glm.glmnet module¶

adobo.glm.simulation module¶

adobo.glm.utils module¶

Module contents¶