`transparentai.datasets`¶

Variable submodule¶

transparentai.datasets.variable.variable.describe_number(arr)[source]¶

Descriptive statistics about a number array.

Returned statistics:

Count of valid values
Count of missing values
Mean
Mode
Min
Quantitle 25%
Median
Quantile 75%
Max

Parameters:	arr (array like) – Array of value to get desriptive statistics from
Raises:	TypeError: – arr is not an array like TypeError: – arr is not a number array

transparentai.datasets.variable.variable.describe_datetime(arr, format='%Y-%m-%d')[source]¶

Descriptive statistics about a datetime array.

Returned statistics:

Count of valid values
Count of missing values
Count of unique values
Most common value
Min
Mean
Max

Parameters:	arr (array like) – Array of value to get desriptive statistics from format (str) – String format for datetime value
Raises:	TypeError: – arr is not an array like TypeError: – arr is not a datetime array

transparentai.datasets.variable.variable.describe_object(arr)[source]¶

Descriptive statistics about an object array.

Returned statistics:

Count of valid values
Count of missing values
Count of unique values
Most common value

Parameters:	arr (array like) – Array of value to get desriptive statistics from
Raises:	TypeError: – arr is not an array like TypeError: – arr is not an object array

transparentai.datasets.variable.variable.describe(arr)[source]¶

Descriptive statistics about an array. Depending on the detected dtype (number, date, object) it returns specific stats.

Common statistics for all dtype (using describe_common):

Count of valid values
Count of missing values

Number statistics (using describe_number):

Mean
Mode
Min
Quantitle 25%
Median
Quantile 75%
Max

Datetime statistics (using describe_datetime):

Count of unique values
Most common value
Min
Mean
Max

Object statistics (using describe_datetime):

Count of unique values
Most common value

Parameters:	arr (array like) – Array of value to get desriptive statistics from
Returns:	Dictionnary with descriptive statistics
Return type:	dict
Raises:	TypeError: – arr is not an array like

transparentai.datasets.variable.correlation.compute_correlation(df, nrows=None, max_cat_val=100)[source]¶

Computes differents correlations matrix for three cases and merge them:

numerical to numerical (using Pearson coeff)
categorical to categorical (using Cramers V & Chi square)
numerical to categorical (discrete) (using Point Biserial)

/!\ ==== Caution ==== /!\

This matrix has a default : the cramers_v_corr is scale from 0 to 1, but the others are from to -1 to 1. Be sure to understand this.

Pearson coeff Wikipedia definition :

In statistics, the Pearson correlation coefficient, also referred to as Pearson’s r, the Pearson product-moment correlation coefficient (PPMCC) or the bivariate correlation, is a statistic that measures linear correlation between two variables X and Y. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation (that the value lies between -1 and 1 is a consequence of the Cauchy–Schwarz inequality). It is widely used in the sciences.

Cramers V Wikipedia definition :

In statistics, Cramér’s V (sometimes referred to as Cramér’s phi and denoted as φc) is a measure of association between two nominal variables, giving a value between 0 and +1 (inclusive). It is based on Pearson’s chi-squared statistic and was published by Harald Cramér in 1946.

Point Biserial Wikipedia definition :

The point biserial correlation coefficient (rpb) is a correlation coefficient used when one variable (e.g. Y) is dichotomous; Y can either be “naturally” dichotomous, like whether a coin lands heads or tails, or an artificially dichotomized variable. In most situations it is not advisable to dichotomize variables artificially[citation needed]. When a new variable is artificially dichotomized the new dichotomous variable may be conceptualized as having an underlying continuity. If this is the case, a biserial correlation would be the more appropriate calculation.

Parameters:	df (pd.DataFrame) – pandas Dataframe with values to compute correlation nrows (None or int or float (default None)) – If not None reduce the data to a sample of nrows if int else if float reduce to len(df) * nrows max_cat_val (int or None (default 100)) – Number max of unique values in a categorical feature if there are more distinct values than this number then the feature is ignored
Returns:	Correlation matrix computed with Pearson coeff for numerical features to numerical features, Cramers V for categorical features to categorical features and Point Biserial for categorical features to numerical features
Return type:	pd.DataFrame
Raises:	TypeError: – Must provide a pandas DataFrame representing the data

transparentai.datasets.variable.correlation.compute_cramers_v_corr(df)[source]¶

Computes Cramers V correlation for a dataframe.

Cramers V Wikipedia definition :

In statistics, Cramér’s V (sometimes referred to as Cramér’s phi and denoted as φc) is a measure of association between two nominal variables, giving a value between 0 and +1 (inclusive). It is based on Pearson’s chi-squared statistic and was published by Harald Cramér in 1946.

Parameters:	df (pd.DataFrame) – pandas Dataframe with values to compute Cramers V correlation
Returns:	Correlation matrix computed for Cramers V coeff
Return type:	pd.DataFrame
Raises:	TypeError: – Must provide a pandas DataFrame representing the data

transparentai.datasets.variable.correlation.compute_pointbiserialr_corr(df, cat_feats=None, num_feats=None)[source]¶

Computes Point Biserial correlation for a dataframe.

Point Biserial Wikipedia definition :

The point biserial correlation coefficient (rpb) is a correlation coefficient used when one variable (e.g. Y) is dichotomous; Y can either be “naturally” dichotomous, like whether a coin lands heads or tails, or an artificially dichotomized variable. In most situations it is not advisable to dichotomize variables artificially[citation needed]. When a new variable is artificially dichotomized the new dichotomous variable may be conceptualized as having an underlying continuity. If this is the case, a biserial correlation would be the more appropriate calculation.

Parameters:	df (pd.DataFrame) – pandas Dataframe with values to compute Point Biserial correlation
Returns:	Correlation matrix computed for Point Biserial coeff
Return type:	pd.DataFrame
Raises:	TypeError: – Must provide a pandas DataFrame representing the data ValueError: – cat_feats and num_feats must be set or be both None TypeError: – cat_feats must be a list TypeError: – num_feats must be a list

transparentai.datasets.variable.correlation.cramers_v(x, y)[source]¶

Returns the Cramer V value of two categorical variables using chi square. This correlation metric is between 0 and 1.

Code source found in this article : https://towardsdatascience.com/the-search-for-categorical-correlation-a1cf7f1888c9

Parameters:	x (array like) – first categorical variable y (array like) – second categorical variable
Returns:	Cramer V value
Return type:	float

transparentai.datasets.variable.correlation.merge_corr_df(df_list)[source]¶

Merges correlation matrix from compute_correlation() function to one. Needs 3 dataframe : pearson_corr, cramers_v_corr and pbs_corr.

This matrix has a default : the cramers_v_corr is scale from 0 to 1, but the others are from to -1 to 1. Be sure to understand this.

Parameters:	df_list (list) – List of correlation matrices
Returns:	Merged dataframe of correlation matrices
Return type:	pd.DataFrame

Preformated datasets¶

transparentai.datasets.datasets.load_adult()[source]¶: Load Adult dataset. Source : https://archive.ics.uci.edu/ml/datasets/Adult

transparentai.datasets.datasets.load_boston()[source]¶: Load boston dataset Source : https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

transparentai.datasets.datasets.load_iris()[source]¶: Load Iris dataset. Source : http://archive.ics.uci.edu/ml/datasets/Iris/

transparentai.datasets¶

Variable submodule¶

Preformated datasets¶

`transparentai.datasets`¶