API Reference

Pinard 1.0 - 06/2023

Augmentation API

class pinard.augmentation.Augmenter(apply_on='samples', random_state=None, *, copy=True)

Bases: TransformerMixin, BaseEstimator

Base class for data augmentation transformers.

abstract augment(X, apply_on='samples')

Perform data augmentation.

Parameters:
  • X (array-like) – Input data to augment.

  • apply_on (str) – The level at which augmentation is applied. Can be one of ‘samples’, ‘features’, ‘subsets’, or ‘global’. Defaults to ‘samples’.

Returns:

Augmented data.

Return type:

array-like

fit(X, y=None)

Fit to data.

Parameters:
  • X (array-like) – Input data to fit.

  • y (array-like or None) – Target variable (unused).

Returns:

self – Returns the instance itself.

Return type:

object

fit_transform(X, y=None, **fit_params)

Fit to data and transform it.

Parameters:
  • X (array-like) – Input data to fit and transform.

  • y (array-like or None) – Target variable (unused).

  • **fit_params (dict) – Additional fitting parameters (unused).

Returns:

Transformed data.

Return type:

array-like

transform(X)

Transform the input data by applying data augmentation.

Parameters:

X (array-like) – Input data to transform.

Returns:

Transformed data after augmentation.

Return type:

array-like

class pinard.augmentation.IdentityAugmenter(apply_on='samples', random_state=None, *, copy=True)

Bases: Augmenter

An augmenter that returns the input data without any changes.

augment(X, _)

Perform identity augmentation.

Parameters:
  • X (array-like) – Input data to augment.

  • _ (str) – Placeholder for unused parameter.

Returns:

Augmented data (same as input data).

Return type:

array-like

class pinard.augmentation.Random_X_Operation(apply_on='features', random_state=None, *, copy=True, operator_func=<built-in function mul>, operator_range=(0.97, 1.03))

Bases: Augmenter

Class for applying random operation on data augmentation.

Parameters:
  • apply_on (str, optional) – Apply augmentation on “features” or “samples” data. Default is “features”.

  • random_state (int or None, optional) – Random seed for reproducibility. Default is None.

  • copy (bool, optional) – If True, creates a copy of the input data. Default is True.

  • operator_func (function, optional) – Operator function to be applied. Default is operator.mul.

  • operator_range (tuple, optional) – Range for generating random values for the operator. Default is (0.97, 1.03).

augment(X, apply_on='samples')

Augment the data by applying random operation.

Parameters:
  • X (ndarray) – Input data to be augmented.

  • apply_on (str, optional) – Apply augmentation on “features” or “samples” data. Default is “features”.

Returns:

Augmented data.

Return type:

ndarray

class pinard.augmentation.Rotate_Translate(apply_on='samples', random_state=None, *, copy=True, p_range=2, y_factor=3)

Bases: Augmenter

Class for rotating and translating data augmentation.

Parameters:
  • apply_on (str, optional) – Apply augmentation on “samples” or “global” data. Default is “samples”.

  • random_state (int or None, optional) – Random seed for reproducibility. Default is None.

  • copy (bool, optional) – If True, creates a copy of the input data. Default is True.

  • p_range (int, optional) – Range for generating random slope values. Default is 2.

  • y_factor (int, optional) – Scaling factor for the initial value. Default is 3.

augment(X, apply_on='samples')

Augment the data by rotating and translating the signal.

Parameters:
  • X (ndarray) – Input data to be augmented.

  • apply_on (str, optional) – Apply augmentation on “samples” or “global” data. Default is “samples”.

Returns:

Augmented data.

Return type:

ndarray

class pinard.augmentation.Spline_Curve_Simplification(apply_on='samples', random_state=None, *, copy=True, spline_points=None, uniform=False)

Bases: Augmenter

Class to simplify a 1D signal using B-spline interpolation along the curve.

Parameters:
  • X (ndarray) – Input data.

  • apply_on (str, optional) – Apply augmentation on “samples” or “global” (default: “samples”).

  • spline_points (int, optional) – Number of spline points for simplification. Default is None: the length of the sample / 4.

  • uniform (bool, optional) – If True, the spline points are uniformly spaced. Default is False.

augment(X, apply_on='samples')

Select regularly spaced points on the x-axis and adjust a spline.

Parameters:
  • X (ndarray) – Input data.

  • apply_on (str, optional) – Apply augmentation on “samples” or “features” (default: “samples”).

Returns:

Augmented data.

Return type:

ndarray

class pinard.augmentation.Spline_Smoothing(apply_on='samples', random_state=None, *, copy=True)

Bases: Augmenter

Class to apply a smoothing spline to a 1D signal.

Parameters:
  • X (ndarray) – Input data.

  • apply_on (str, optional) – Apply augmentation on “samples” or “global” (default: “samples”).

augment(X, apply_on='samples')

Apply a smoothing spline to the data.

Parameters:
  • X (ndarray) – Input data.

  • apply_on (str, optional) – Apply augmentation on “samples” or “global” (default: “samples”).

Returns:

Augmented data.

Return type:

ndarray

class pinard.augmentation.Spline_X_Perturbations(apply_on='samples', random_state=None, *, copy=True, spline_degree=3, perturbation_density=0.05, perturbation_range=(-10, 10))

Bases: Augmenter

Class to apply a perturbation to a 1D signal using B-spline interpolation.

Parameters:
  • X (ndarray) – Input data.

  • apply_on (str, optional) – Apply augmentation on “samples” or “global” (default: “samples”).

  • spline_degree (int, optional) – Degree of the spline. Default is 3 (cubic).

  • perturbation_density (float, optional) – Density of perturbation points relative to data size. Default is 0.05.

  • perturbation_range (tuple, optional) – Range of perturbation values (min, max). Default is (-10, 10).

augment(X, apply_on='samples')

Augment the data with a perturbation using B-spline interpolation.

Parameters:
  • X (ndarray) – Input data to be augmented.

  • apply_on (str, optional) – Apply augmentation on “samples” or “global” data. Default is “samples”.

Returns:

Augmented data.

Return type:

ndarray

class pinard.augmentation.Spline_X_Simplification(apply_on='samples', random_state=None, *, copy=True, spline_points=None, uniform=False)

Bases: Augmenter

Class to simplify a 1D signal using B-spline interpolation along the x-axis.

Parameters:
  • X (ndarray) – Input data.

  • apply_on (str, optional) – Apply augmentation on “samples” or “global” (default: “samples”).

  • spline_points (int, optional) – Number of spline points for simplification. Default is None: the length of the sample / 4.

  • uniform (bool, optional) – If True, the spline points are uniformly spaced. Default is False.

augment(X, apply_on='samples')

Select randomly spaced points along the x-axis and adjust a spline.

Parameters:
  • X (ndarray) – Input data.

  • apply_on (str, optional) – Apply augmentation on “samples” or “global” (default: “samples”).

Returns:

Augmented data.

Return type:

ndarray

class pinard.augmentation.Spline_Y_Perturbations(apply_on='samples', random_state=None, *, copy=True, spline_points=None, perturbation_intensity=0.005)

Bases: Augmenter

Augment the data with a perturbation on the y-axis using B-spline interpolation.

Parameters:
  • X (ndarray) – Input data.

  • apply_on (str, optional) – Apply augmentation on “samples” or “global” (default: “samples”).

  • spline_degree (int, optional) – Degree of the spline. Default is 3 (cubic).

  • perturbation_density (float, optional) – Density of perturbation points relative to data size. Default is 0.05.

  • perturbation_range (tuple, optional) – Range of perturbation values (min, max). Default is (-10, 10).

augment(X, apply_on='samples')

Augment the data with a perturbation on the y-axis using B-spline interpolation.

Parameters:
  • X (ndarray) – Input data to be augmented.

  • apply_on (str, optional) – Apply augmentation on “samples” or “global” data. Default is “samples”.

Returns:

Augmented data.

Return type:

ndarray

Preprocessing API

The pinard.preprocessing module includes savitzky golay, baseline, haar, gaussian, etc. TransformerMixins to preprocess NIR spectra.

class pinard.preprocessing.Baseline(*, copy=True)

Bases: TransformerMixin, BaseEstimator

Removes baseline (mean) from each spectrum.

Parameters:

copy (bool, optional) – Flag to indicate whether to make a copy of the object, by default True.

fit(X, y=None)

Compute the minimum and maximum to be used for later scaling.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The data used to compute the per-feature minimum and maximum used for later scaling along the features axis.

  • y (None) – Ignored.

Returns:

self – Fitted Baseline object.

Return type:

object

inverse_transform(X, y=None)
partial_fit(X, y=None)
transform(X, y=None)
class pinard.preprocessing.Derivate(order=1, delta=1, copy=True)

Bases: TransformerMixin, BaseEstimator

fit(X, y=None)
transform(X, copy=None)
class pinard.preprocessing.Detrend(bp=0, *, copy=True)

Bases: TransformerMixin, BaseEstimator

Perform spectral detrending to remove linear trend from data.

Parameters:
  • bp (int, optional) – Breakpoints for piecewise linear detrending. Default is 0.

  • copy (bool, optional) – Whether to make a copy of the input data. Default is True.

fit(X, y=None)

Fit the transformer to the data.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The input data.

  • y (None) – Ignored.

Returns:

self – Returns self.

Return type:

object

transform(X, copy=None)

Transform the data by removing linear trend.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The input data.

  • copy (bool or None, optional) – Whether to make a copy of the input data. If None, self.copy is used. Default is None.

Returns:

The transformed data.

Return type:

numpy.ndarray

class pinard.preprocessing.Gaussian(order=2, sigma=1, *, copy=True)

Bases: TransformerMixin, BaseEstimator

fit(X, y=None)

Fit the Gaussian filter.

Parameters:
  • X (numpy.ndarray) – Input data.

  • y (None) – Ignored.

Returns:

self – Returns the instance itself.

Return type:

object

transform(X, copy=None)

Transform the input data using the Gaussian filter.

Parameters:
  • X (numpy.ndarray) – Input data.

  • copy (bool, default=None) – Whether to make a copy of the input data.

Returns:

Transformed data.

Return type:

numpy.ndarray

class pinard.preprocessing.Haar(*, copy: bool = True)

Bases: Wavelet

Shortcut to the Wavelet haar transform.

pinard.preprocessing.IdentityTransformer

alias of FunctionTransformer

class pinard.preprocessing.MultiplicativeScatterCorrection(scale=True, *, copy=True)

Bases: TransformerMixin, BaseEstimator

fit(X, y=None)
inverse_transform(X)
partial_fit(X, y=None)
transform(X)
class pinard.preprocessing.Normalize(feature_range=(-1, 1), *, copy=True)

Bases: TransformerMixin, BaseEstimator

Normalize spectrum using either custom range of linalg normalization

Parameters:
  • feature_range (tuple (min, max), default=(-1, -1)) – Desired range of transformed data. If range min and max equals -1, linalg normalization is applied, otherwise user defined normalization is applied

  • copy (bool, default=True) – Set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array).

fit(X, y=None)

Fit the Normalize transformer on the training data.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The training data.

  • y (None) – Ignored variable.

Returns:

self – Returns the instance itself.

Return type:

object

inverse_transform(X)

Transform the normalized data back to the original representation.

Parameters:

X (array-like of shape (n_samples, n_features)) – The normalized data to be transformed back.

Returns:

X – The inverse transformed data.

Return type:

ndarray of shape (n_samples, n_features)

partial_fit(X, y=None)

Perform incremental fit on the training data.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The training data.

  • y (None) – Ignored variable.

Returns:

self – Returns the instance itself.

Return type:

object

transform(X)

Transform the input data.

Parameters:

X (array-like of shape (n_samples, n_features)) – The input data to be transformed.

Returns:

X – The transformed data.

Return type:

ndarray of shape (n_samples, n_features)

pinard.preprocessing.RobustNormalVariate

alias of RobustScaler

class pinard.preprocessing.SavitzkyGolay(window_length: int = 11, polyorder: int = 3, deriv: int = 0, delta: float = 1.0, *, copy: bool = True)

Bases: TransformerMixin, BaseEstimator

A class for smoothing and differentiating data using the Savitzky-Golay filter.

Parameters:

window_lengthint, optional (default=11)

The length of the window used for smoothing.

polyorderint, optional (default=3)

The order of the polynomial used for fitting the samples within the window.

derivint, optional (default=0)

The order of the derivative to compute.

deltafloat, optional (default=1.0)

The sampling distance of the data.

copybool, optional (default=True)

Whether to copy the input data.

Methods:

fit(X, y=None)

Fits the transformer to the data X.

transform(X, copy=None)

Applies the Savitzky-Golay filter to the data X.

fit(X, y=None)

Verify the X data compliance with Savitzky-Golay filter.

Parameters:
  • X (array-like) – The data to transform.

  • y (None) – Ignored.

Raises:

ValueError – If the input X is a sparse matrix.

Returns:

The fitted object.

Return type:

SavitzkyGolay

transform(X, copy=None)

Apply the Savitzky-Golay filter to the data X.

Parameters:
  • X (array-like) – The data to transform.

  • copy (bool or None, optional) – Whether to copy the input data.

Returns:

The transformed data.

Return type:

numpy.ndarray

class pinard.preprocessing.SimpleScale(copy=True)

Bases: TransformerMixin, BaseEstimator

fit(X, y=None)
inverse_transform(X)
partial_fit(X, y=None)
transform(X)
pinard.preprocessing.StandardNormalVariate

alias of StandardScaler

class pinard.preprocessing.Wavelet(wavelet: str = 'haar', mode: str = 'periodization', *, copy: bool = True)

Bases: TransformerMixin, BaseEstimator

Single level Discrete Wavelet Transform.

Performs a discrete wavelet transform on data, using a wavelet function.

Parameters:
  • wavelet (Wavelet object or name, default='haar') – Wavelet to use: [‘Haar’, ‘Daubechies’, ‘Symlets’, ‘Coiflets’, ‘Biorthogonal’, ‘Reverse biorthogonal’, ‘Discrete Meyer (FIR Approximation)’…]

  • mode (str, optional, default='periodization') – Signal extension mode.

fit(X, y=None)

Verify the X data compliance with wavelet transform.

Parameters:
  • X (array-like, spectra) – The data to transform.

  • y (None) – Ignored.

Raises:

ValueError – If the input X is a sparse matrix.

Returns:

The fitted object.

Return type:

Wavelet

transform(X, copy=None)

Apply wavelet transform to the data X.

Parameters:
  • X (array-like) – The data to transform.

  • copy (bool or None, optional) – Whether to copy the input data.

Returns:

The transformed data.

Return type:

numpy.ndarray

pinard.preprocessing.baseline(spectra)

Removes baseline (mean) from each spectrum.

Parameters:

spectra (numpy.ndarray) – NIRS data matrix.

Returns:

Mean-centered NIRS data matrix.

Return type:

numpy.ndarray

pinard.preprocessing.derivate(spectra, order=1, delta=1)

Computes Nth order derivatives with the desired spacing using numpy.gradient.

Parameters:
  • spectra (numpy.ndarray) – NIRS data matrix.

  • order (float, optional) – Order of the derivation, by default 1.

  • delta (int, optional) – Delta of the derivative (in samples), by default 1.

Returns:

spectra – Derived NIR spectra.

Return type:

numpy.ndarray

pinard.preprocessing.detrend(spectra, bp=0)

Perform spectral detrending to remove linear trend from data.

Parameters:
  • spectra (numpy.ndarray) – NIRS data matrix.

  • bp (list, optional) – A sequence of break points. If given, an individual linear fit is performed for each part of data between two break points. Break points are specified as indices into data. Default is 0.

Returns:

Detrended NIR spectra.

Return type:

numpy.ndarray

pinard.preprocessing.gaussian(spectra, order=2, sigma=1)

Computes 1D gaussian filter using scipy.ndimage gaussian 1d filter.

Parameters:
  • spectra (numpy.ndarray) – NIRS data matrix.

  • order (float, optional) – Order of the derivation.

  • sigma (int, optional) – Sigma of the gaussian.

Returns:

Gaussian NIR spectra.

Return type:

numpy.ndarray

pinard.preprocessing.msc(spectra, scaled=True)

Performs multiplicative scatter correction to the mean.

Parameters:
  • spectra (numpy.ndarray) – NIRS data matrix.

  • scaled (bool) – Whether to scale the data. Defaults to True.

Returns:

Scatter-corrected NIR spectra.

Return type:

numpy.ndarray

pinard.preprocessing.norml(spectra, feature_range=(-1, 1))

Perform spectral normalization with user-defined limits.

Parameters:
  • spectra (numpy.ndarray) – NIRS data matrix.

  • feature_range (tuple (min, max), default=(-1, 1)) – Desired range of transformed data. If range min and max equals -1, linalg normalization is applied; otherwise, user bounds-defined normalization is applied.

Returns:

spectra – Normalized NIR spectra.

Return type:

numpy.ndarray

pinard.preprocessing.savgol(spectra: ndarray, window_length: int = 11, polyorder: int = 3, deriv: int = 0, delta: float = 1.0) ndarray

Perform Savitzky–Golay filtering on the data (also calculates derivatives). This function is a wrapper for scipy.signal.savgol_filter.

Parameters:
  • spectra (numpy.ndarray) – NIRS data matrix.

  • window_length (int) – Size of the filter window in samples (default 11).

  • polyorder (int) – Order of the polynomial estimation (default 3).

  • deriv (int) – Order of the derivation (default 0).

  • delta (float) – Sampling distance of the data.

Returns:

NIRS data smoothed with Savitzky-Golay filtering.

Return type:

numpy.ndarray

pinard.preprocessing.spl_norml(spectra)

Perform simple spectral normalization.

Parameters:

spectra (numpy.ndarray) – NIRS data matrix.

Returns:

spectra – Normalized NIR spectra.

Return type:

numpy.ndarray

pinard.preprocessing.wavelet_transform(spectra: ndarray, wavelet: str, mode: str = 'periodization') ndarray

Computes transform using pywavelet transform.

Parameters:
  • spectra (numpy.ndarray) – NIRS data matrix.

  • wavelet (str) – wavelet family transformation.

  • mode (str) – signal extension mode.

Returns:

wavelet and resampled spectra.

Return type:

numpy.ndarray

Model Selection API

pinard.model_selection.kbins_stratified_sampling(data, y, test_size, random_state=None, n_bins=10, strategy='uniform', encode='ordinal')

Perform stratified sampling using KBins discretization.

Parameters:
  • data (array-like of shape (n_samples, n_features)) – The input data.

  • y (array-like of shape (n_samples,)) – The target variable.

  • test_size (float or int) – If float, represents the proportion of the dataset to include in the test split. If int, represents the absolute number of samples to include in the test split.

  • random_state (int, RandomState instance or None, optional (default=None)) – Controls the random seed used to shuffle the data.

  • n_bins (int, optional (default=10)) – The number of bins to use for discretization.

  • strategy ({'uniform', 'quantile', 'kmeans'}, optional (default='uniform')) – The strategy used to define the widths of the bins.

  • encode ({'ordinal', 'onehot', 'onehot-dense'}, optional (default='ordinal')) – The encoding scheme used to encode the transformed result.

Returns:

  • train_index (ndarray) – The indices of the training samples.

  • test_index (ndarray) – The indices of the test samples.

pinard.model_selection.kmean_sampling(data, test_size, *, random_state=None, pca_components=None, metric='euclidean')

Perform sampling using K-means clustering.

Parameters:
  • data (array-like of shape (n_samples, n_features)) – The input data.

  • test_size (float or int) – If float, represents the proportion of the dataset to include in the test split. If int, represents the absolute number of samples to include in the test split.

  • random_state (int, RandomState instance or None, optional (default=None)) – Controls the random seed used to initialize the centroids.

  • pca_components (int or None, optional (default=None)) – The number of principal components to use for dimensionality reduction using PCA.

  • metric (str, optional (default='euclidean')) – The distance metric to use. Possible values are: ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulczynski1’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’.

Returns:

  • train_index (ndarray) – The indices of the training samples.

  • test_index (ndarray) – The indices of the test samples.

pinard.model_selection.ks_sampling(data, test_size, *, random_state=None, pca_components=None, metric='euclidean')

Samples data using the Kennard Stone method.

Parameters:
  • size (float/int) – Size of the test set.

  • data (DataFrame) – Dataset used to get a train set and a test set.

  • pca_components (int/float, default=None) – Value to perform PCA.

  • metric (str, default="euclidean") – The distance metric to use, by default ‘euclidean’. See scipy.spatial.distance.cdist for more information.

Returns:

(List of int, List of int) Index of selected spectra as train data, index is zero-based. Index of remaining spectra as test data, index is zero-based.

Return type:

tuple

Raises:

ValueError – If train sample size is not at least 2.

Example

>>> index_train, index_test = ks_sampling(data, 0.2, None, "euclidean")
>>> print(index_test[0:4])
[22, 23, 33, 66]

References

Kennard, R. W., & Stone, L. A. (1969). Computer aided design of experiments. Technometrics, 11(1), 137-148. (https://www.jstor.org/stable/1266770)

pinard.model_selection.shuffle_sampling(data, test_size, *, random_state=None)

Performs random shuffling of the data and splits it into train and test sets.

Parameters:
  • data (array-like) – The input data samples.

  • test_size (float) – The proportion of the data to be used as the test set.

  • random_state (int, default=None) – Seed value for random number generation.

Returns:

  • train_index (ndarray) – The indices of the samples in the train set.

  • test_index (ndarray) – The indices of the samples in the test set.

pinard.model_selection.spxy_sampling(data, y, test_size, *, random_state=None, pca_components=None, metric='euclidean')

Samples data using the SPXY method.

Parameters:
  • size (float/int) – Size of the test set.

  • data (DataFrame) – Features used to get a train set and a test set.

  • y (DataFrame) – Labels used to get a train set and a test set.

  • pca_components (int/float, default=None) – Value to perform PCA.

  • metric (str, default="euclidean") – The distance metric to use, by default ‘euclidean’. See scipy.spatial.distance.cdist for more information.

Returns:

(List of int, List of int) Index of selected spectra as train data, index is zero-based. Index of remaining spectra as test data, index is zero-based.

Return type:

tuple

Raises:

ValueError – If train sample size is not at least 2. If y data is not provided.

Example

>>> index_train, index_test = spxy_sampling(data, y, 0.2, None, "euclidean")
>>> print(index_test[0:4])
[6, 22, 33, 39]

References

Galvao et al. (2005). A method for calibration and validation subset partitioning. Talanta, 67(4), 736-740. (https://www.sciencedirect.com/science/article/pii/S003991400500192X) Li, Wenze, et al. “HSPXY: A hybrid‐correlation and diversity‐distances based data partition method.” Journal of Chemometrics 33.4 (2019): e3109.

pinard.model_selection.systematic_circular_sampling(data, y, test_size, random_state)

Performs non-random sampling based on the systematic circular sampling method. The starting point and the number of rotations are randomly determined.

Parameters:
  • size (int/float) – The number of samples to be selected, can be expressed as either the count or the proportion.

  • data (DataFrame) – The DataFrame containing the samples.

  • random_state (int, default=None) – Seed value for result reproducibility.

Returns:

  • train_index (ndarray) – The indices of the samples in the train set.

  • test_index (ndarray) – The indices of the samples in the test set.

Example

>>> index_test = systematic_circular_sampling(0.2, data, 1)
>>> print(sorted(index_test))
[3, 8, ..., 53, 58, ..., 101, 106]
pinard.model_selection.train_test_split_idx(*x, y=None, test_size=None, method='random', random_state=None, metric='euclidean', pca_components=None, n_bins=10, train_size=None)

Split the data into training and test sets based on the specified method.

Parameters:
  • *x (array-like) – Input arrays to be split. Can be one or more arrays.

  • y (array-like, optional) – The target variable.

  • test_size (float or int, optional) – If float, represents the proportion of the dataset to include in the test split. If int, represents the absolute number of samples to include in the test split. Defaults to None.

  • method ({'random', 'stratified', 'k_mean', 'kennard_stone', 'spxy', 'circular', 'SPlit'}, optional) – The method used for splitting the data. Defaults to ‘random’.

  • random_state (int, RandomState instance or None, optional) – Determines random number generation for dataset shuffling. Pass an int for reproducible output across multiple function calls. Defaults to None.

  • metric (str, optional) – The distance metric to use. If a string, the distance function can be one of the following: ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulczynski1’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’. Defaults to ‘euclidean’.

  • pca_components (int or None, optional) – The number of components to keep in PCA transformation. If None, no PCA transformation is applied. Defaults to None.

  • n_bins (int, optional) – The number of bins for stratified sampling. Defaults to 10.

  • train_size (float or int, optional) – If float, represents the proportion of the dataset to include in the train split. If int, represents the absolute number of samples to include in the train split. Defaults to None.

Returns:

  • train_index (ndarray) – The indices of the samples in the training set.

  • test_index (ndarray) – The indices of the samples in the test set.

Raises:
  • ValueError – If method is not one of the supported methods.

  • ModuleNotFoundError – If the ‘tweening’ package is not found when using ‘SPlit’ method.

Notes

The ‘SPlit’ method requires the ‘tweening’ package to be installed. See https://github.com/GBeurier/pinard for more information.

Sci-kit learn API

The pinard.sklearn module includes various tools to extend or adapt sklearn features.

class pinard.sklearn.FeatureAugmentation(transformer_list, *, n_jobs=None, transformer_weights=None, verbose=False)

Bases: FeatureUnion

Stacks results of multiple transformer objects in a new axis.

This estimator extends sklearn.pipeline.FeatureUnion and applies a list of transformer objects in parallel to the input data. Then it stacks the results. This is useful to combine several feature extraction mechanisms into a single transformer.

fit_transform(X, y=None, **fit_params)

Fit all transformers, transform the data and concatenate results.

Parameters:
  • X (iterable or array-like, depending on transformers) – Input data to be transformed.

  • y (array-like of shape (n_samples, n_outputs), default=None) – Targets for supervised learning.

  • **fit_params (dict, default=None) – Parameters to pass to the fit method of the estimator.

Returns:

X_t – The hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers.

Return type:

array-like or sparse matrix of shape (n_samples, sum_n_components)

steps: List[Any]
transform(X, **params)

Transform X separately by each transformer, concatenate results.

Parameters:

X (iterable or array-like, depending on transformers) – Input data to be transformed.

Returns:

X_t – The hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers.

Return type:

array-like or sparse matrix of shape (n_samples, sum_n_components)

class pinard.sklearn.SampleAugmentation(transformer_list, n_jobs=None, transformer_weights=None, verbose=False)

Bases: FeatureUnion

Applies multiple feature extraction mechanisms to the same input data and concatenates the results.

Inherits from the FeatureUnion class of the sklearn.pipeline module.

Parameters:
  • transformer_list (list of (str, transformer) or (int, str, transformer) tuples) – List of transformer tuples to be applied to the data. Each tuple contains either two or three elements. If the tuple has two elements, it is a (str, transformer) tuple where the first element is the name of the transformer and the second element is the transformer object. If the tuple has three elements, it is an (int, str, transformer) tuple where the first element is the count of augmentations for that transformer, the second element is the name of the transformer and the third element is the transformer object.

  • n_jobs (int or None, optional (default=None)) – The number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

  • transformer_weights (dict or None, optional (default=None)) – Multiplicative weights for features per transformer. Keys are transformer names, values are weights.

  • verbose (bool, optional (default=False)) – If True, the time elapsed while fitting each transformer will be printed as it is completed.

transformer_list

List of (name, trans) tuples specifying the transformer objects to be applied to the data.

Type:

list of (str, transformer) tuples

steps: List[Any]
transform(X, y=None, **transform_params)

Transform X separately by each transformer, concatenate results.

Parameters:

X (iterable or array-like, depending on transformers) – Input data to be transformed.

Returns:

X_t – The hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers.

Return type:

array-like or sparse matrix of shape (n_samples, sum_n_components)