Preprocessing

MDLPDiscretizer

class skmine.preprocessing.MDLPDiscretizer(random_state=None, n_jobs=1)[source]

Implementation of “Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning”.

Given class labels y, MDLPDIscretizer discretizes continuous variables from X by minimizing the entropy in each interval.

Parameters

random_state (int, RandomState instance, default=None) – random state to use to shuffle the data. Can affect the outcome, leading to slightly different cut points if a variable contains samples with the same value but different labels.

Variables

cut_points_ (dict) – A mapping between columns and their respective cut points. If fitted on a pandas DataFrame, keys will be the DataFrame column names.

References

Usama M. Fayyad, Keki B. Irani “Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning”, 1993

Examples

>>> from skmine.preprocessing import MDLPDiscretizer
>>> from sklearn.datasets import load_iris  
>>> iris = load_iris()                      
>>> X, y = iris.data, iris.target           
>>> disc = MDLPDiscretizer()                
>>> disc.fit(X, y)                          
>>> disc.cut_points_                        
{0: array([5.5, 6.2]), 1: array([2.9, 3.3]), 2: array([2.45, 4.9 ]), 3: array([0.8, 1.7])}
fit(X, y)[source]

fit the MLDP discretizer on an input matrix X, given a label vector y.

Parameters
  • X (np.ndarray or pd.DataFrame of shape (n_samples, n_features)) – The input matrix containing features. A set of cut points will be affected to each feature

  • y (np.ndarray of pd.Series of shape(n_samples,)) – The label vector used to discretize X

discover()[source]

user-friendly view on cut points

transform(X, y=None)[source]

Discretizes the input matrix X

This applies the cutpoints their respective columns

fit_transform(X, y=None)[source]

fit on X and y, then transform X