API

The AbstractMethod defines the interface of all quantification and unfolding algorithms. Most importantly, this interface consists of their fit and predict methods.

class qunfold.AbstractMethod

Abstract base class for quantification methods.

abstract fit(X, y, n_classes=None)

Fit this quantifier to data.

Parameters:
  • X – The feature matrix to which this quantifier will be fitted.

  • y – The labels to which this quantifier will be fitted.

  • n_classes (optional) – The number of expected classes. Defaults to None.

Returns:

This fitted quantifier itself.

abstract predict(X)

Predict the class prevalences in a data set.

Parameters:

X – The feature matrix for which this quantifier will make a prediction.

Returns:

A numpy array of class prevalences.

Instances of many Popular algorithms for quantification and unfolding are created through the corresponding constructors. However, you can also define your own quantification method as a LinearMethod that combines an arbitrary choice of Losses, Regularizers and Data representations.

class qunfold.LinearMethod(loss: ~qunfold.methods.linear.losses.AbstractLoss, representation: ~qunfold.methods.linear.representations.AbstractRepresentation, solver: str = 'trust-ncg', solver_options: ~typing.Dict[str, ~typing.Any] = <factory>, seed: int | None = None)

A generic quantification / unfolding method that predicts class prevalences by solving a system of linear equations.

This class represents any method that consists of a loss function, a data representation, and a regularization term. In this implementation, any regularized loss is minimized through unconstrained second-order minimization. Valid probability estimates are ensured through a soft-max trick by Bunse (2022).

Parameters:
  • loss – An instance from qunfold.methods.linear.losses.

  • representation – An instance from qunfold.methods.linear.representations.

  • solver (optional) – The method argument in scipy.optimize.minimize. Defaults to “trust-ncg”.

  • solver_options (optional) – The options argument in scipy.optimize.minimize. Defaults to {“gtol”: 1e-8, “maxiter”: 1000}.

  • seed (optional) – A random number generator seed from which a numpy RandomState is created. Defaults to None.

Examples

Here, we create the ordinal variant of ACC (Bunse et al., 2023). This variant consists of the original data representation of ACC and of the original loss of ACC, the latter of which is regularized towards smooth solutions.

>>> LinearMethod(
>>>     TikhonovRegularized(LeastSquaresLoss(), 0.01),
>>>     ClassRepresentation(RandomForestClassifier(oob_score=True))
>>> )
fit(X, y, n_classes=None)

Fit this quantifier to data.

Parameters:
  • X – The feature matrix to which this quantifier will be fitted.

  • y – The labels to which this quantifier will be fitted.

  • n_classes (optional) – The number of expected classes. Defaults to None.

Returns:

This fitted quantifier itself.

predict(X)

Predict the class prevalences in a data set.

Parameters:

X – The feature matrix for which this quantifier will make a prediction.

Returns:

A numpy array of class prevalences.

solve(q, M, N=None)

Solve the linear system of equations q=M*p for p.

Parameters:
  • q – A numpy array.

  • M – A numpy matrix.

  • N – The number of data items that q represents. For some losses, this argument is optional.

Returns:

The solution vector p.

Losses

class qunfold.LeastSquaresLoss

The loss function of ACC (Forman, 2008), PACC (Bella et al., 2019), and ReadMe (Hopkins & King, 2010).

This loss function computes the sum of squares of element-wise errors between q and M*p.

class qunfold.EnergyLoss

The loss function of EDx (Kawakubo et al., 2016) and EDy (Castaño et al., 2022).

This loss function represents the Energy Distance between two samples.

class qunfold.HellingerSurrogateLoss

The loss function of HDx and HDy (González-Castro et al., 2013).

This loss function computes the average of the squared Hellinger distances between feature-wise (or class-wise) histograms. Note that the original HDx and HDy by González-Castro et al (2013) do not use the squared but the regular Hellinger distance. Their approach is problematic because the regular distance is not always twice differentiable and, hence, complicates numerical optimizations.

class qunfold.BlobelLoss

The loss function of RUN (Blobel, 1985).

This loss function models a likelihood function under the assumption of independent Poisson-distributed elements of q with Poisson rates M*p.

class qunfold.CombinedLoss(*losses, weights=None)

The weighted sum of multiple losses.

Parameters:
  • *losses – An arbitrary number of losses to be added together.

  • weights (optional) – An array of weights which the losses are scaled.

Hint

You can use the CombinedLoss to create arbitrary, weighted sums of losses and regularizers.

Regularizers

qunfold.TikhonovRegularized(loss, tau=0.0)

Add TikhonovRegularization (Blobel, 1985) to any loss.

Calling this function is equivalent to calling

>>> CombinedLoss(loss, TikhonovRegularization(), weights=[1, tau])
Parameters:
  • loss – An instance from qunfold.losses.

  • tau (optional) – The regularization strength. Defaults to 0.

Returns:

An instance of CombinedLoss.

Examples

The regularized loss of RUN (Blobel, 1985) is:

>>> TikhonovRegularization(BlobelLoss(), tau)
class qunfold.TikhonovRegularization

Tikhonov regularization, as proposed by Blobel (1985).

This regularization promotes smooth solutions. This behavior is often required in ordinal quantification and in unfolding problems.

Data representations

class qunfold.ClassRepresentation(classifier: Any, is_probabilistic: bool = False, fit_classifier: bool = True)

A classification-based data representation.

This representation can either be probabilistic (using the posterior predictions of a classifier) or crisp (using the class predictions of a classifier). It is used in ACC, PACC, CC, PCC, and SLD.

Parameters:
  • classifier – A classifier that implements the API of scikit-learn.

  • is_probabilistic (optional) – Whether probabilistic or crisp predictions of the classifier are used to represent the data. Defaults to False.

  • fit_classifier (optional) – Whether to fit the classifier when this quantifier is fitted. Defaults to True.

class qunfold.DistanceRepresentation(metric: str = 'euclidean', preprocessor: AbstractRepresentation | None = None)

A distance-based data representation, as it is used in EDx and EDy.

Parameters:
  • metric (optional) – The metric with which the distance between data items is measured. Can take any value that is accepted by scipy.spatial.distance.cdist. Defaults to “euclidean”.

  • preprocessor (optional) – Another AbstractRepresentation that is called before this representation. Defaults to None.

class qunfold.HistogramRepresentation(n_bins: int, preprocessor: AbstractRepresentation | None = None, unit_scale: bool = True)

A histogram-based data representation, as it is used in HDx and HDy.

Parameters:
  • n_bins – The number of bins in each feature.

  • preprocessor (optional) – Another AbstractRepresentation that is called before this representation. Defaults to None.

  • unit_scale (optional) – Whether or not to scale each output to a sum of one. A value of False indicates that the sum of each output is the number of features. Defaults to True.

class qunfold.EnergyKernelRepresentation(preprocessor: AbstractRepresentation | None = None)

A kernel-based data representation, as it is used in KMM, that uses the energy kernel:

k(x_1, x_2) = ||x_1|| + ||x_2|| - ||x_1 - x_2||

Note

The methods of this representation do not support setting average=False.

Parameters:

preprocessor (optional) – Another AbstractRepresentation that is called before this representation. Defaults to None.

class qunfold.GaussianKernelRepresentation(sigma: float = 1.0, preprocessor: AbstractRepresentation | None = None)

A kernel-based data representation, as it is used in KMM, that uses the gaussian kernel:

k(x, y) = exp(-||x - y||^2 / (2σ^2))

Parameters:
  • sigma (optional) – A smoothing parameter of the kernel function. Defaults to 1.

  • preprocessor (optional) – Another AbstractRepresentation that is called before this representation. Defaults to None.

class qunfold.LaplacianKernelRepresentation(sigma=1.0)

A kernel-based data representation, as it is used in KMM, that uses the laplacian kernel.

Parameters:

sigma (optional) – A smoothing parameter of the kernel function. Defaults to 1.

class qunfold.GaussianRFFKernelRepresentation(sigma: float = 1.0, n_rff: int = 1000, preprocessor: AbstractRepresentation | None = None, seed: int | None = None)

An efficient approximation of the GaussianKernelRepresentation, as it is used in KMM, using random Fourier features.

Parameters:
  • sigma (optional) – A smoothing parameter of the kernel function. Defaults to 1.

  • n_rff (optional) – The number of random Fourier features. Defaults to 1000.

  • preprocessor (optional) – Another AbstractRepresentation that is called before this representation. Defaults to None.

  • seed (optional) – Controls the randomness of the random Fourier features. Defaults to None.

class qunfold.OriginalRepresentation

A dummy representation that simply returns the data as it is.

Utilities

The following classes provide functionalities that go beyond the composition of quantification methods.

QuaPy

The qunfold.quapy module allows you to wrap any quantification method for being used in QuaPy.

class qunfold.quapy.QuaPyWrapper(_method: AbstractMethod)

A thin wrapper for using qunfold methods in QuaPy.

Parameters:

_method – An instance of qunfold.methods.AbstractMethod to wrap.

Examples

Here, we wrap an instance of ACC to perform a grid search with QuaPy.

>>> qunfold_method = QuaPyWrapper(ACC(RandomForestClassifier(obb_score=True)))
>>> quapy.model_selection.GridSearchQ(
>>>     model = qunfold_method,
>>>     param_grid = { # try both splitting criteria
>>>         "representation__classifier__estimator__criterion": ["gini", "entropy"],
>>>     },
>>>     # ...
>>> )

Cross-validated training

The qunfold.sklearn module allows you to train classification-based quantification methods through cross-validation. Importing this module requires scikit-learn to be installed.

class qunfold.sklearn.CVClassifier(estimator, n_estimators=5, random_state=None)

An ensemble of classifiers that are trained from cross-validation folds.

All objects of this type have a fixed attribute oob_score = True and, when trained, a fitted attribute self.oob_decision_function_, just like scikit-learn bagging classifiers.

Parameters:
  • estimator – A classifier that implements the API of scikit-learn.

  • n_estimators (optional) – The number of stratified cross-validation folds. Defaults to 5.

  • random_state (optional) – The random state for stratification. Defaults to None.

Examples

Here, we create an instance of ACC that trains a logistic regression classifier with 10 cross-validation folds.

>>> ACC(CVClassifier(LogisticRegression(), 10))

Hint

If you use a bagging classifier (like random forests) with oob_score=True, you do not need to use cross-validation. Instead, the quantification method is then trained on the out-of-bag predictions of the bagging classifier.