adobo.glm package¶
Submodules¶
adobo.glm.families module¶
Exponential families for GLMs
-
class
adobo.glm.families.
Bernoulli
¶ Bases:
adobo.glm.families.ExponentialFamily
,adobo.glm.families.ExponentialFamilyMixin
A Bernoulli exponential family, used to fit a classical logistic model.
The GLM fit with this family has the following structure equation:
y | X ~ Bernoulli(p = X beta)
-
d_inv_link
(nu, mu)¶
-
deviance
(y, mu)¶
-
has_dispersion
= False¶
-
initial_working_response
(y)¶
-
initial_working_weights
(y)¶
-
inv_link
(nu)¶
-
sample
(mus, dispersion)¶
-
variance
(mu)¶
-
-
class
adobo.glm.families.
Exponential
¶ Bases:
adobo.glm.families.Gamma
An Exponential distribution exponential family.
The GLM fit with this family has the following structure equation:
y | X = Exponentiatl(scale = exp(X beta))
The only difference between this family and the Gamma family is that the dispersion is fixed at 1.
-
has_dispersion
= False¶
-
-
class
adobo.glm.families.
ExponentialFamily
¶ Bases:
object
An ExponentialFamily must implement at least four methods and define one attribute.
-
inv_link:
The inverse link function.
-
d_inv_link:
The derivative of the inverse link function.
-
variance:
The variance funtion linking the mean to the variance of the distribution.
-
deviance:
The deviance of the family. Used as a measure of model fit.
-
sample:
A sampler from the conditional distribution of the reponse.
-
abstract
d_inv_link
(nu, mu)¶
-
abstract
deviance
(y, mu)¶
-
abstract
inv_link
(nu)¶
-
abstract
sample
(mus, dispersion)¶
-
abstract
variance
(mu)¶
-
-
class
adobo.glm.families.
ExponentialFamilyMixin
¶ Bases:
object
Implementations of methods common to all ExponentialFamilies.
-
penalized_deviance
(y, mu, alpha, coef)¶
-
-
class
adobo.glm.families.
Gamma
¶ Bases:
adobo.glm.families.ExponentialFamily
,adobo.glm.families.ExponentialFamilyMixin
A Gamma exponential family.
The GLM fit with this family has the following structure equation:
y | X ~ Gamma(shape = dispersion, scale = exp(X beta) / dispersion)
Here, sigma is a nuisance parameter.
Note: In this family we use the logarithmic link function, instead of the reciporical link function. Although the reciporical is the canonical link, the logarithmic link is more commonly used in Gamma regression.
-
d_inv_link
(nu, mu)¶
-
deviance
(y, mu)¶
-
has_dispersion
= True¶
-
identity_link
(nu)¶
-
inv_link
(nu)¶
-
sample
(mu, dispersion)¶
-
variance
(mu)¶
-
-
class
adobo.glm.families.
Gaussian
¶ Bases:
adobo.glm.families.ExponentialFamily
,adobo.glm.families.ExponentialFamilyMixin
A Gaussian exponential family, used to fit a classical linear model.
The GLM fit with this family has the following structure equation:
y | X ~ Gaussian(mu = X beta, sigma = dispersion)
Here, sigma is a nuisance parameter.
-
d_inv_link
(nu, mu)¶
-
deviance
(y, mu)¶
-
has_dispersion
= True¶
-
initial_working_response
(y)¶
-
initial_working_weights
(y)¶
-
inv_link
(nu)¶
-
sample
(mus, dispersion)¶
-
variance
(mu)¶
-
-
class
adobo.glm.families.
Poisson
¶ Bases:
adobo.glm.families.QuasiPoisson
A QuasiPoisson exponential family, used to fit a possibly overdispersed Poisson regression.
The GLM fit with this family has the following structure equation:
y | X ~ Poisson(mu = exp(X beta))
The Poisson model does not estimate a dispersion parameter; if overdispersion is present, the standard errors estimated in this model may be too small. If this is the case, consider fitting a QuasiPoisson model.
-
has_dispersion
= False¶
-
-
class
adobo.glm.families.
QuasiPoisson
¶ Bases:
adobo.glm.families.ExponentialFamily
,adobo.glm.families.ExponentialFamilyMixin
A QuasiPoisson exponential family, used to fit a possibly overdispersed Poisson regression.
The GLM fit with this family has the following structure equation:
y | X ~ Poisson(mu = exp(X beta))
The parameter esimtates of this model are the same as a Poisson model, but a dispersion parameter is estimated, allowing for possibly larger standards errors when overdispersion is present.
-
d_inv_link
(nu, mu)¶
-
deviance
(y, mu)¶
-
has_dispersion
= True¶
-
inv_link
(nu)¶
-
sample
(mus, dispersion)¶
-
variance
(mu)¶
-
adobo.glm.glm module¶
-
class
adobo.glm.glm.
GLM
(family, alpha=0.0)¶ Bases:
object
A generalized linear model.
GLMs are a generalization of the classical linear and logistic models to other conditional distributions of response y. A GLM is specified by a link function G and a family of conditional distributions dist, with the model specification given by
y | X ~ dist(theta = G(X * beta))
Here beta are the parameters fit in the model, with X * beta a matrix multiplication just like in linear regression. Above, theta is a parameter of the one parameter family of distributions dist.
In this implementation, a specific GLM is specified with a family object of ExponentialFamily type, which contains the information about the conditional distribution of y, and its connection to X, needed to construct the model. See the documentation for ExponentialFamily for details.
The model is fit to data using the well known Fisher Scoring algorithm, which is a version of Newton’s method where the hessian is replaced with its expectation with respect to the assumed distribution of Y.
- Parameters
family (ExponentialFamily object) – The exponential family used in the model.
alpha (float, non-negative) – The ridge regularization strength. If non-zero, the loss function minimized is a penalized deviance, where the penalty is alpha * np.sum(model.coef_**2).
-
family
¶ The exponential family used in the model.
- Type
ExponentialFamily object
-
alpha
¶ The regularization strength.
- Type
float, non-negative
-
formula
¶ An (optional) formula specifying the model. Used in the case that X is passed as a DataFrame. For documentation on model formulas, please see the patsy library documentation.
- Type
str
-
X_info
¶ Contains information about the model formula used to process the training data frame into a design matrix.
- Type
patsy.design_info.DesignInfo object.
-
X_names
¶ Names for the predictors.
- Type
List[str]
-
y_names
¶ Name for the target varaible.
- Type
str
-
coef_
¶ The fit parameter estimates. None if the model has not yet been fit.
- Type
array, shape (n_features, )
-
deviance_
¶ The final deviance of the fit model on the training data.
- Type
float
-
information_matrix_
¶ The estimated information matrix. This information matrix is evaluated at the fit parameters.
- Type
array, shape (n_features, n_features)
-
n
¶ The number of samples used to fit the model, or the sum of the sample weights.
- Type
integer, positive
-
p
¶ The number of fit parameters in the model.
- Type
integer, positive
Notes
Instead of supplying a fit_intercept argument, we have instead assumed the programmer has included a column of ones as the first column X. The fit method will throw an exception if this is not the case.
-
clone
()¶
-
property
coef_covariance_matrix_
¶
-
property
coef_standard_error_
¶
-
property
dispersion_
¶ Return an estimate of the dispersion parameter phi.
-
fit
(X, y=None, formula=None, *, X_names=None, y_name=None, **kwargs)¶ Fit the GLM model to training data.
Fitting the model uses the well known Fisher scoring algorithm.
- Parameters
X (array, shape (n_samples, n_features) or pd.DataFrame) – Training data.
y (array, shape (n_samples, )) – Target values.
formula (str) – A formula specifying the model. Used in the case that X is passed as a DataFrame. For documentation on model formulas, please see the patsy library documentation.
warm_start (array, shape (n_features, )) – Initial values to use for the parameter estimates in the optimization, useful when fitting an entire regulatization path of models. If not supplies, the initial intercept estimate will be the mean of the target array, and all other parameter estimates will be initialized to zero.
offset (array, shape (n_samples, )) –
- Offsets for samples. If provided, the model fit is
E[Y|X] = family.identity_link(np.dot(X, coef) + offset)
This is specially useful in models with exposures, as in Poisson regression.
sample_weights (array, shape (n_sample, )) – Sample weights used in the deviance minimized by the model. If provided, each term in the deviance being minimized is multiplied by its corrosponding weight.
max_iter (positive integer) – The maximum number of iterations for the fitting algorithm.
tol (float, non-negative and less than one) – The convergence tolerance for the fitting algorithm. The relative change in deviance is compared to this tolerance to check for convergence.
- Returns
self – The fit model.
- Return type
GLM object
-
property
p_values_
¶ Return an array of p-values for the fit coefficients. These p-values test the hypothesis that the given parameter is zero.
Note: We use the asymptotic normal approximation to the p-values for all models.
-
predict
(X, offset=None)¶ Return predictions from a fit model.
Predictions are computed using the inverse link function in the family used to fit the model:
preds = family.identity_link(np.dot(X, self.coef_)
Note that in the case of binary models, predict does not make class assignmnets, it returns a probability of class membership.
- Parameters
X (array, shape (n_samples, n_features)) – Data set.
offset (array, shape (n_samples, )) – Offsets to add on the linear scale when making predictions.
- Returns
preds – Model predictions.
- Return type
array, shape (n_samples, )
-
score
(X, y)¶ Return the deviance of a fit model on a given dataset.
- Parameters
X (array, shape (n_samples, n_features)) – Data set.
y (array, shape (n_samples, )) – Labels for X.
- Returns
deviance – Model deviance scored using supplied data and labels.
- Return type
array, shape (n_samples, )
-
summary
()¶ Print a summary of the model.
adobo.glm.glmnet module¶
-
class
adobo.glm.glmnet.
ElasticNet
(lam, alpha, max_iter=200, tol=0.0010000000000000002)¶ Bases:
object
Fit an elastic net model.
The elastic net is a regularized regularized linear regression model incorporating both L1 and L2 penalty terms. It is fit by minimizing the penalized loss function:
- sum((y - y_hat)**2)
lam * ((1 - alpha)/2 * sum(beta**2)
alpha * sum(abs(beta)))
- Parameters
lam (float) – The overall strength of the regularization.
alpha (float in the interval [0, 1]) – The relative strengths of the L1 and L2 regularization.
max_iter (positive integer) – The maximum number of coordinate descent cycles before breaking out of the descent loop.
tol (float) – The convergence tolerance. The actual convergence tolerance is applied to the absolute multiplicative change in the coefficient vector. When the coefficient vector changes by less than n_params * tol in one full coordinate descent cycle, the algorithm terminates.
-
intercept_
¶ The fit intercept in the regression. This is stored separately, as the penalty terms are not applied to the intercept.
- Type
float
-
n
¶ The number of samples used to fit the model, or the sum of the sample weights.
- Type
integer, positive
-
p
¶ The number of fit parameters in the model.
- Type
integer, positive
-
Additionally, the following private attributes are used by the fitting
-
algorithm. They are stored permanently so they can be used as warm starts
-
to other ElasticNet objects. This is used both when fitting an entire
-
regularization path of models, and also when fitting Glmnet objects, which
-
procede by solving quadratic approximations to the Glmnet loss using the
-
ElasticNet.
-
The array below (and many of the other arrays used internally during
-
fitting) use a peculiar ordering. Instead of being arranged to match the
-
order of the features in a training matrix X, they are instead arranged in
-
order that the predictors enter the model. This allows for efficient
-
calculation of the update steps in ElasticNet.fit.
-
_active_coefs
¶ The active set of coefficients. They are stored in the order that their associated features enter into the model, i.e. the j’th coefficient in this list is associated with the j’th feature to enter the model.
- Type
array, shape (n_features,)
-
_active_coef_idx_list
¶ The indices of the active coefficients into the column dimension of the training data. I.e. the j’th active coefficient is associated with the _active_coef_idx_list[j]’th column of X.
- Type
array, shape (n_features,)
-
_j_to_active_map
¶ Maps a column index of X to an index into _active_coefs. I.e., the position of the j’th column of X in the order in which coefficients enter into the model.
- Type
dictionary, positive integer => positive integer
References
The implementation here is based on the discussion in Friedman, Hastie, and Tibshirani: Regularization Paths for Generalized Linear Models via Coordinate Descent (hereafter referenced as [FHT]).
-
property
coef_
¶ The coefficient estimates of a fit model.
This attribute returns the coefficient estimates in the same order as the associated columns in X.
-
fit
(X, y, offset=None, sample_weights=None, warm_start=None)¶ Fit an elastic net with coordinate descent.
- Parameters
X (array, shape (n_samples, n_features)) – Training data.
y (array, shape (n_samples, )) – Target values.
offset (array, shape (n_samples, )) –
- Offsets for samples. If provided, the model fit is
E[Y|X] = np.dot(X, coef) + offset
sample_weights (array, shape (n_sample, )) – Sample weights used in the deviance minimized by the model. If provided, each term in the deviance being minimized is multiplied by its corresponding weight.
- Returns
self – The fit model.
- Return type
ElasticNet object
Notes
The following data structures are used internally by the fitting algorithm.
- x_means: array, shape (n_features,)
The weighted column means of the training data X.
- xy_dots: array, shape (n_features,)
The dot products of the columns in the training matrix X with the target y. Arranged in the same order as the columns of X.
- offset_dots: array, shape(n_features,)
The weighted dot products of the columns of X with the offset vector.
- xx_dots: array, shape (n_features,)
The weighted dot products of the columns of the training matrix X with themselves. I.e., each individual column with itself.
- xtx_dots: array, shape (n_features, n_features)
The matrix of weighted dot products of the columns of the training data X with themselves; this includes all such dot products, not just those between columns and themselves. The columns of this matrix are permuted so that the dot products stored in column j of xtx_dots are associated with the j’th coefficient to enter the model. The rows are in order of the columns of X. This matrix is initialized to zero, and filled in lazily as coefficients enter the model, as the cross-column dot products are only needed for the features currently active in the model.
In addition, see the Attributes notes in the documentation of this class for information on how the coefficient estimates are managed internally.
-
predict
(X)¶ Return predictions given an array X.
- Parameters
X (array, shape (n_samples, n_features)) – Data.
- Returns
preds – Predictions.
- Return type
array, shape (n_samples, )
adobo.glm.simulation module¶
-
class
adobo.glm.simulation.
Simulation
(glm)¶ Bases:
object
Run resampling simulations of a glm object.
This object implements respampling stratgies, for example the parametric end non-parametric bootstrap.
- Parameters
glm – A GLM type object.
-
non_parametric_bootstrap
(X, y, n_sim=100, offset=None)¶ Fit models to non-parameteric bootstrap samples.
The non-parametric operates by sampling data with replacement from the pair (X, y), and then fitting glm models to the resampled pairs.
- Parameters
X (array, shape (n_samples, n_features)) – Data.
y (array, shape (n_samples, )) – Targets.
n_sim (positive integer) – The number of times to resample the data.
offset (array, shape (n_samples, )) – Offsets to use in predictions feeding into the conditional distributions.
- Returns
The list of fit models.
- Return type
models
-
parametric_bootstrap
(X, n_sim=100, offset=None)¶ Fit models to parameteric bootstrap samples.
The parametric operates by sampling data from the conditional distributions y | X for a fixed matrix of predictors X, and then fitting glm models to each pair (X, y).
- Parameters
X (array, shape (n_sample, n_features)) – Data.
n_sim (positive integer) – The number of times to sample from the coditional distributions y | X.
offset (array, shape (n_samples, )) – Offsets to use in predictions feeding into the conditional distributions.
- Returns
The list of fit models.
- Return type
models
-
sample
(X, n_sim=100, offset=None)¶ Sample points from the conditional distribution y | X.
A fitted GLM model determines conditional distributions y | X. Given a matrix of predictors, this method samples y from each conditional distribution.
- Parameters
X (array, shape (n_samples, n_features)) – Data.
n_sim (positive integer) – The number of times to sample from the coditional distributions y | X.
offset (array, shape (n_samples, )) – Offsets to use in the predictions feeding into the conditional distributions.
- Returns
simulations – The sampled y values.
- Return type
array, shape (n_sim, n_samples)
adobo.glm.utils module¶
-
adobo.glm.utils.
check_commensurate
(X, y)¶
-
adobo.glm.utils.
check_intercept
(X)¶
-
adobo.glm.utils.
check_offset
(y, offset)¶
-
adobo.glm.utils.
check_sample_weights
(y, sample_weights)¶
-
adobo.glm.utils.
check_types
(X, y, formula)¶
-
adobo.glm.utils.
default_X_names
(X)¶
-
adobo.glm.utils.
default_y_name
()¶
-
adobo.glm.utils.
has_converged
(loss, loss_prev, tol)¶
-
adobo.glm.utils.
has_intercept_column
(X)¶
-
adobo.glm.utils.
has_same_length
(v, w)¶
-
adobo.glm.utils.
is_commensurate
(X, y)¶
-
adobo.glm.utils.
soft_threshold
(z, gamma)¶
-
adobo.glm.utils.
weighted_column_dots
(X, weights)¶
-
adobo.glm.utils.
weighted_dot
(x0, x1, weights)¶
-
adobo.glm.utils.
weighted_means
(X, weights)¶