API
The AbstractMethod
defines the interface of all quantification and unfolding algorithms. Most importantly, this interface consists of their fit
and predict
methods.
- class qunfold.AbstractMethod
Abstract base class for quantification methods.
- abstract fit(X, y, n_classes=None)
Fit this quantifier to data.
- Parameters:
X – The feature matrix to which this quantifier will be fitted.
y – The labels to which this quantifier will be fitted.
n_classes (optional) – The number of expected classes. Defaults to None.
- Returns:
This fitted quantifier itself.
- abstract predict(X)
Predict the class prevalences in a data set.
- Parameters:
X – The feature matrix for which this quantifier will make a prediction.
- Returns:
A numpy array of class prevalences.
Instances of many Popular algorithms for quantification and unfolding are created through the corresponding constructors. However, you can also define your own quantification method as a LinearMethod
that combines an arbitrary choice of Losses, Regularizers and Data representations.
- class qunfold.LinearMethod(loss: ~qunfold.methods.linear.losses.AbstractLoss, representation: ~qunfold.methods.linear.representations.AbstractRepresentation, solver: str = 'trust-ncg', solver_options: ~typing.Dict[str, ~typing.Any] = <factory>, seed: int | None = None)
A generic quantification / unfolding method that predicts class prevalences by solving a system of linear equations.
This class represents any method that consists of a loss function, a data representation, and a regularization term. In this implementation, any regularized loss is minimized through unconstrained second-order minimization. Valid probability estimates are ensured through a soft-max trick by Bunse (2022).
- Parameters:
loss – An instance from qunfold.methods.linear.losses.
representation – An instance from qunfold.methods.linear.representations.
solver (optional) – The method argument in scipy.optimize.minimize. Defaults to “trust-ncg”.
solver_options (optional) – The options argument in scipy.optimize.minimize. Defaults to {“gtol”: 1e-8, “maxiter”: 1000}.
seed (optional) – A random number generator seed from which a numpy RandomState is created. Defaults to None.
Examples
Here, we create the ordinal variant of ACC (Bunse et al., 2023). This variant consists of the original data representation of ACC and of the original loss of ACC, the latter of which is regularized towards smooth solutions.
>>> LinearMethod( >>> TikhonovRegularized(LeastSquaresLoss(), 0.01), >>> ClassRepresentation(RandomForestClassifier(oob_score=True)) >>> )
- fit(X, y, n_classes=None)
Fit this quantifier to data.
- Parameters:
X – The feature matrix to which this quantifier will be fitted.
y – The labels to which this quantifier will be fitted.
n_classes (optional) – The number of expected classes. Defaults to None.
- Returns:
This fitted quantifier itself.
- predict(X)
Predict the class prevalences in a data set.
- Parameters:
X – The feature matrix for which this quantifier will make a prediction.
- Returns:
A numpy array of class prevalences.
- solve(q, M, N=None)
Solve the linear system of equations q=M*p for p.
- Parameters:
q – A numpy array.
M – A numpy matrix.
N – The number of data items that q represents. For some losses, this argument is optional.
- Returns:
The solution vector p.
Popular algorithms
We categorize existing, well-known quantification and unfolding algorithms into Classify and count methods, Distribution matching methods, and Unfolding methods. Each of these methods consists of a fixed combination of Losses, Regularizers, and Data representations.
Classify and count
- class qunfold.ACC(classifier, fit_classifier=True, solver='trust-ncg', solver_options={'gtol': 1e-08, 'maxiter': 1000}, seed=None)
Adjusted Classify & Count by Forman (2008).
This subclass of LinearMethod is instantiated with a LeastSquaresLoss and a ClassRepresentation.
- Parameters:
classifier – A classifier that implements the API of scikit-learn.
fit_classifier (optional) – Whether to fit the classifier when this quantifier is fitted. Defaults to True.
**kwargs – Keyword arguments accepted by LinearMethod.
- class qunfold.PACC(classifier, fit_classifier=True, solver='trust-ncg', solver_options={'gtol': 1e-08, 'maxiter': 1000}, seed=None)
Probabilistic Adjusted Classify & Count by Bella et al. (2010).
This subclass of LinearMethod is instantiated with a LeastSquaresLoss and a ClassRepresentation.
- Parameters:
classifier – A classifier that implements the API of scikit-learn.
fit_classifier (optional) – Whether to fit the classifier when this quantifier is fitted. Defaults to True.
**kwargs – Keyword arguments accepted by LinearMethod.
Distribution matching
- class qunfold.EDx(metric='euclidean', solver='trust-ncg', solver_options={'gtol': 1e-08, 'maxiter': 1000}, seed=None)
The energy distance-based EDx method by Kawakubo et al. (2016).
This subclass of LinearMethod is instantiated with an EnergyLoss and a DistanceRepresentation.
- Parameters:
metric (optional) – The metric with which the distance between data items is measured. Can take any value that is accepted by scipy.spatial.distance.cdist. Defaults to “euclidean”.
**kwargs – Keyword arguments accepted by LinearMethod.
- class qunfold.EDy(classifier, metric='euclidean', fit_classifier=True, solver='trust-ncg', solver_options={'gtol': 1e-08, 'maxiter': 1000}, seed=None)
The energy distance-based EDy method by Castaño et al. (2022).
This subclass of LinearMethod is instantiated with an EnergyLoss and a DistanceRepresentation, the latter of which uses a ClassRepresentation as a preprocessor.
- Parameters:
classifier – A classifier that implements the API of scikit-learn.
metric (optional) – The metric with which the distance between data items is measured. Can take any value that is accepted by scipy.spatial.distance.cdist. Defaults to “euclidean”.
fit_classifier (optional) – Whether to fit the classifier when this quantifier is fitted. Defaults to True.
**kwargs – Keyword arguments accepted by LinearMethod.
- class qunfold.HDx(n_bins, solver='trust-ncg', solver_options={'gtol': 1e-08, 'maxiter': 1000}, seed=None)
The Hellinger distance-based HDx method by González-Castro et al. (2013).
This subclass of LinearMethod is instantiated with a HellingerSurrogateLoss and a HistogramRepresentation.
- Parameters:
n_bins – The number of bins in each feature.
**kwargs – Keyword arguments accepted by LinearMethod.
- class qunfold.HDy(classifier, n_bins, *, fit_classifier=True, solver='trust-ncg', solver_options={'gtol': 1e-08, 'maxiter': 1000}, seed=None)
The Hellinger distance-based HDy method by González-Castro et al. (2013).
This subclass of LinearMethod is instantiated with a HellingerSurrogateLoss and a HistogramRepresentation, the latter of which uses a ClassRepresentation as a preprocessor.
- Parameters:
classifier – A classifier that implements the API of scikit-learn.
n_bins – The number of bins in each class.
fit_classifier (optional) – Whether to fit the classifier when this quantifier is fitted. Defaults to True.
**kwargs – Keyword arguments accepted by LinearMethod.
- class qunfold.KMM(kernel='energy', sigma=1, n_rff=1000, solver='trust-ncg', solver_options={'gtol': 1e-08, 'maxiter': 1000}, seed=None)
The kernel-based KMM method with random Fourier features by Dussap et al. (2023).
This subclass of LinearMethod is instantiated with a LeastSquaresLoss and an instance of a KernelRepresentation sub-class that corresponds to the kernel argument.
- Parameters:
kernel (optional) – Which kernel to use. Can be a callable with the signature (X[y==i], X[y==j]) -> scalar or one of “energy”, “gaussian”, “laplacian” and “rff”. Defaults to “energy”.
sigma (optional) – A smoothing parameter that is used if kernel in [“gaussian”, “laplacian”, “rff”]. Defaults to 1.
n_rff (optional) – The number of random Fourier features if kernel == “rff”. Defaults to 1000.
**kwargs – Keyword arguments accepted by LinearMethod. The seed argument also controls the randomness of the random Fourier features if kernel == “rff”.
Unfolding
- class qunfold.RUN(representation, *, tau=0.0, solver='trust-ncg', solver_options={'gtol': 1e-08, 'maxiter': 1000}, seed=None)
Regularized Unfolding by Blobel (1985).
This subclass of LinearMethod is instantiated with a TikhonovRegularized(BlobelLoss).
- Parameters:
representation – An instance from qunfold.methods.linear.representations.
tau (optional) – The regularization strength. Defaults to 0.
**kwargs – Keyword arguments accepted by LinearMethod.
Methods beyond systems of linear equations
Not all quantification algorithms make predictions by solving systems of linear equations. Instead, the following methods maximize the likelihood of the prediction directly.
- class qunfold.LikelihoodMaximizer(classifier: ~typing.Any, solver: str = 'trust-ncg', solver_options: ~typing.Dict[str, ~typing.Any] = <factory>, tau_0: float = 0.0, tau_1: float = 0.0, fit_classifier: bool = True, seed: int | None = None)
The maximum likelihood method, as studied by Alexandari et al. (2020).
This method is proven to be asymptotically equivalent to the ExpectationMaximizer by Saerens et al. (2002).
- Parameters:
classifier – A classifier that implements the API of scikit-learn.
solver (optional) – The method argument in scipy.optimize.minimize. Defaults to “trust-ncg”.
solver_options (optional) – The options argument in scipy.optimize.minimize. Defaults to {“gtol”: 1e-8, “maxiter”: 1000}.
tau_0 (optional) – The regularization strength for penalizing deviations from uniform predictions. Defaults to 0.
tau_1 (optional) – The regularization strength for penalizing deviations from non-ordinal predictions. Defaults to 0.
fit_classifier (optional) – Whether to fit the classifier when this quantifier is fitted. Defaults to True.
seed (optional) – A random number generator seed from which a numpy RandomState is created. Defaults to None.
- class qunfold.ExpectationMaximizer(classifier: Any, max_iter: int = 100, tol: float = 1e-06, fit_classifier: bool = True)
The expectation maximization-based method by Saerens et al. (2002).
This method is proven to be asymptotically equivalent to the LikelihoodMaximizer by Alexandari et al. (2020).
- Parameters:
classifier – A classifier that implements the API of scikit-learn.
max_iter (optional) – The maximum number of iterations. Defaults to 100, which is hardly ever reached with the default value of tol.
tol (optional) – The convergence tolerance for the L2 norm between iterations. Defaults to 1e-6, the float32 resolution.
fit_classifier (optional) – Whether to fit the classifier when this quantifier is fitted. Defaults to True.
Losses
- class qunfold.LeastSquaresLoss
The loss function of ACC (Forman, 2008), PACC (Bella et al., 2019), and ReadMe (Hopkins & King, 2010).
This loss function computes the sum of squares of element-wise errors between q and M*p.
- class qunfold.EnergyLoss
The loss function of EDx (Kawakubo et al., 2016) and EDy (Castaño et al., 2022).
This loss function represents the Energy Distance between two samples.
- class qunfold.HellingerSurrogateLoss
The loss function of HDx and HDy (González-Castro et al., 2013).
This loss function computes the average of the squared Hellinger distances between feature-wise (or class-wise) histograms. Note that the original HDx and HDy by González-Castro et al (2013) do not use the squared but the regular Hellinger distance. Their approach is problematic because the regular distance is not always twice differentiable and, hence, complicates numerical optimizations.
- class qunfold.BlobelLoss
The loss function of RUN (Blobel, 1985).
This loss function models a likelihood function under the assumption of independent Poisson-distributed elements of q with Poisson rates M*p.
- class qunfold.CombinedLoss(*losses, weights=None)
The weighted sum of multiple losses.
- Parameters:
*losses – An arbitrary number of losses to be added together.
weights (optional) – An array of weights which the losses are scaled.
Hint
You can use the CombinedLoss
to create arbitrary, weighted sums of losses and regularizers.
Regularizers
- qunfold.TikhonovRegularized(loss, tau=0.0)
Add TikhonovRegularization (Blobel, 1985) to any loss.
Calling this function is equivalent to calling
>>> CombinedLoss(loss, TikhonovRegularization(), weights=[1, tau])
- Parameters:
loss – An instance from qunfold.losses.
tau (optional) – The regularization strength. Defaults to 0.
- Returns:
An instance of CombinedLoss.
Examples
The regularized loss of RUN (Blobel, 1985) is:
>>> TikhonovRegularization(BlobelLoss(), tau)
- class qunfold.TikhonovRegularization
Tikhonov regularization, as proposed by Blobel (1985).
This regularization promotes smooth solutions. This behavior is often required in ordinal quantification and in unfolding problems.
Data representations
- class qunfold.ClassRepresentation(classifier: Any, is_probabilistic: bool = False, fit_classifier: bool = True)
A classification-based data representation.
This representation can either be probabilistic (using the posterior predictions of a classifier) or crisp (using the class predictions of a classifier). It is used in ACC, PACC, CC, PCC, and SLD.
- Parameters:
classifier – A classifier that implements the API of scikit-learn.
is_probabilistic (optional) – Whether probabilistic or crisp predictions of the classifier are used to represent the data. Defaults to False.
fit_classifier (optional) – Whether to fit the classifier when this quantifier is fitted. Defaults to True.
- class qunfold.DistanceRepresentation(metric: str = 'euclidean', preprocessor: AbstractRepresentation | None = None)
A distance-based data representation, as it is used in EDx and EDy.
- Parameters:
metric (optional) – The metric with which the distance between data items is measured. Can take any value that is accepted by scipy.spatial.distance.cdist. Defaults to “euclidean”.
preprocessor (optional) – Another AbstractRepresentation that is called before this representation. Defaults to None.
- class qunfold.HistogramRepresentation(n_bins: int, preprocessor: AbstractRepresentation | None = None, unit_scale: bool = True)
A histogram-based data representation, as it is used in HDx and HDy.
- Parameters:
n_bins – The number of bins in each feature.
preprocessor (optional) – Another AbstractRepresentation that is called before this representation. Defaults to None.
unit_scale (optional) – Whether or not to scale each output to a sum of one. A value of False indicates that the sum of each output is the number of features. Defaults to True.
- class qunfold.EnergyKernelRepresentation(preprocessor: AbstractRepresentation | None = None)
A kernel-based data representation, as it is used in KMM, that uses the energy kernel:
k(x_1, x_2) = ||x_1|| + ||x_2|| - ||x_1 - x_2||
Note
The methods of this representation do not support setting average=False.
- Parameters:
preprocessor (optional) – Another AbstractRepresentation that is called before this representation. Defaults to None.
- class qunfold.GaussianKernelRepresentation(sigma: float = 1.0, preprocessor: AbstractRepresentation | None = None)
A kernel-based data representation, as it is used in KMM, that uses the gaussian kernel:
k(x, y) = exp(-||x - y||^2 / (2σ^2))
- Parameters:
sigma (optional) – A smoothing parameter of the kernel function. Defaults to 1.
preprocessor (optional) – Another AbstractRepresentation that is called before this representation. Defaults to None.
- class qunfold.LaplacianKernelRepresentation(sigma=1.0)
A kernel-based data representation, as it is used in KMM, that uses the laplacian kernel.
- Parameters:
sigma (optional) – A smoothing parameter of the kernel function. Defaults to 1.
- class qunfold.GaussianRFFKernelRepresentation(sigma: float = 1.0, n_rff: int = 1000, preprocessor: AbstractRepresentation | None = None, seed: int | None = None)
An efficient approximation of the GaussianKernelRepresentation, as it is used in KMM, using random Fourier features.
- Parameters:
sigma (optional) – A smoothing parameter of the kernel function. Defaults to 1.
n_rff (optional) – The number of random Fourier features. Defaults to 1000.
preprocessor (optional) – Another AbstractRepresentation that is called before this representation. Defaults to None.
seed (optional) – Controls the randomness of the random Fourier features. Defaults to None.
- class qunfold.OriginalRepresentation
A dummy representation that simply returns the data as it is.
Utilities
The following classes provide functionalities that go beyond the composition of quantification methods.
QuaPy
The qunfold.quapy
module allows you to wrap any quantification method for being used in QuaPy.
- class qunfold.quapy.QuaPyWrapper(_method: AbstractMethod)
A thin wrapper for using qunfold methods in QuaPy.
- Parameters:
_method – An instance of qunfold.methods.AbstractMethod to wrap.
Examples
Here, we wrap an instance of ACC to perform a grid search with QuaPy.
>>> qunfold_method = QuaPyWrapper(ACC(RandomForestClassifier(obb_score=True))) >>> quapy.model_selection.GridSearchQ( >>> model = qunfold_method, >>> param_grid = { # try both splitting criteria >>> "representation__classifier__estimator__criterion": ["gini", "entropy"], >>> }, >>> # ... >>> )
Cross-validated training
The qunfold.sklearn
module allows you to train classification-based quantification methods through cross-validation. Importing this module requires scikit-learn to be installed.
- class qunfold.sklearn.CVClassifier(estimator, n_estimators=5, random_state=None)
An ensemble of classifiers that are trained from cross-validation folds.
All objects of this type have a fixed attribute oob_score = True and, when trained, a fitted attribute self.oob_decision_function_, just like scikit-learn bagging classifiers.
- Parameters:
estimator – A classifier that implements the API of scikit-learn.
n_estimators (optional) – The number of stratified cross-validation folds. Defaults to 5.
random_state (optional) – The random state for stratification. Defaults to None.
Examples
Here, we create an instance of ACC that trains a logistic regression classifier with 10 cross-validation folds.
>>> ACC(CVClassifier(LogisticRegression(), 10))
Hint
If you use a bagging classifier (like random forests) with oob_score=True
, you do not need to use cross-validation. Instead, the quantification method is then trained on the out-of-bag predictions of the bagging classifier.