transparentai.datasets

Variable submodule

transparentai.datasets.variable.variable.describe_number(arr)[source]

Descriptive statistics about a number array.

Returned statistics:

  • Count of valid values
  • Count of missing values
  • Mean
  • Mode
  • Min
  • Quantitle 25%
  • Median
  • Quantile 75%
  • Max
Parameters:

arr (array like) – Array of value to get desriptive statistics from

Raises:
  • TypeError: – arr is not an array like
  • TypeError: – arr is not a number array
transparentai.datasets.variable.variable.describe_datetime(arr, format='%Y-%m-%d')[source]

Descriptive statistics about a datetime array.

Returned statistics:

  • Count of valid values
  • Count of missing values
  • Count of unique values
  • Most common value
  • Min
  • Mean
  • Max
Parameters:
  • arr (array like) – Array of value to get desriptive statistics from
  • format (str) – String format for datetime value
Raises:
  • TypeError: – arr is not an array like
  • TypeError: – arr is not a datetime array
transparentai.datasets.variable.variable.describe_object(arr)[source]

Descriptive statistics about an object array.

Returned statistics:

  • Count of valid values
  • Count of missing values
  • Count of unique values
  • Most common value
Parameters:

arr (array like) – Array of value to get desriptive statistics from

Raises:
  • TypeError: – arr is not an array like
  • TypeError: – arr is not an object array
transparentai.datasets.variable.variable.describe(arr)[source]

Descriptive statistics about an array. Depending on the detected dtype (number, date, object) it returns specific stats.

Common statistics for all dtype (using describe_common):

  • Count of valid values
  • Count of missing values

Number statistics (using describe_number):

  • Mean
  • Mode
  • Min
  • Quantitle 25%
  • Median
  • Quantile 75%
  • Max

Datetime statistics (using describe_datetime):

  • Count of unique values
  • Most common value
  • Min
  • Mean
  • Max

Object statistics (using describe_datetime):

  • Count of unique values
  • Most common value
Parameters:arr (array like) – Array of value to get desriptive statistics from
Returns:Dictionnary with descriptive statistics
Return type:dict
Raises:TypeError: – arr is not an array like
transparentai.datasets.variable.correlation.compute_correlation(df, nrows=None, max_cat_val=100)[source]

Computes differents correlations matrix for three cases and merge them:

  • numerical to numerical (using Pearson coeff)
  • categorical to categorical (using Cramers V & Chi square)
  • numerical to categorical (discrete) (using Point Biserial)
/!\ ==== Caution ==== /!\

This matrix has a default : the cramers_v_corr is scale from 0 to 1, but the others are from to -1 to 1. Be sure to understand this.

Pearson coeff Wikipedia definition :

In statistics, the Pearson correlation coefficient, also referred to as Pearson’s r, the Pearson product-moment correlation coefficient (PPMCC) or the bivariate correlation, is a statistic that measures linear correlation between two variables X and Y. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation (that the value lies between -1 and 1 is a consequence of the Cauchy–Schwarz inequality). It is widely used in the sciences.

Cramers V Wikipedia definition :

In statistics, Cramér’s V (sometimes referred to as Cramér’s phi and denoted as φc) is a measure of association between two nominal variables, giving a value between 0 and +1 (inclusive). It is based on Pearson’s chi-squared statistic and was published by Harald Cramér in 1946.

Point Biserial Wikipedia definition :

The point biserial correlation coefficient (rpb) is a correlation coefficient used when one variable (e.g. Y) is dichotomous; Y can either be “naturally” dichotomous, like whether a coin lands heads or tails, or an artificially dichotomized variable. In most situations it is not advisable to dichotomize variables artificially[citation needed]. When a new variable is artificially dichotomized the new dichotomous variable may be conceptualized as having an underlying continuity. If this is the case, a biserial correlation would be the more appropriate calculation.

Parameters:
  • df (pd.DataFrame) – pandas Dataframe with values to compute correlation
  • nrows (None or int or float (default None)) – If not None reduce the data to a sample of nrows if int else if float reduce to len(df) * nrows
  • max_cat_val (int or None (default 100)) – Number max of unique values in a categorical feature if there are more distinct values than this number then the feature is ignored
Returns:

Correlation matrix computed with Pearson coeff for numerical features to numerical features, Cramers V for categorical features to categorical features and Point Biserial for categorical features to numerical features

Return type:

pd.DataFrame

Raises:

TypeError: – Must provide a pandas DataFrame representing the data

transparentai.datasets.variable.correlation.compute_cramers_v_corr(df)[source]

Computes Cramers V correlation for a dataframe.

Cramers V Wikipedia definition :

In statistics, Cramér’s V (sometimes referred to as Cramér’s phi and denoted as φc) is a measure of association between two nominal variables, giving a value between 0 and +1 (inclusive). It is based on Pearson’s chi-squared statistic and was published by Harald Cramér in 1946.

Parameters:df (pd.DataFrame) – pandas Dataframe with values to compute Cramers V correlation
Returns:Correlation matrix computed for Cramers V coeff
Return type:pd.DataFrame
Raises:TypeError: – Must provide a pandas DataFrame representing the data
transparentai.datasets.variable.correlation.compute_pointbiserialr_corr(df, cat_feats=None, num_feats=None)[source]

Computes Point Biserial correlation for a dataframe.

Point Biserial Wikipedia definition :

The point biserial correlation coefficient (rpb) is a correlation coefficient used when one variable (e.g. Y) is dichotomous; Y can either be “naturally” dichotomous, like whether a coin lands heads or tails, or an artificially dichotomized variable. In most situations it is not advisable to dichotomize variables artificially[citation needed]. When a new variable is artificially dichotomized the new dichotomous variable may be conceptualized as having an underlying continuity. If this is the case, a biserial correlation would be the more appropriate calculation.

Parameters:

df (pd.DataFrame) – pandas Dataframe with values to compute Point Biserial correlation

Returns:

Correlation matrix computed for Point Biserial coeff

Return type:

pd.DataFrame

Raises:
  • TypeError: – Must provide a pandas DataFrame representing the data
  • ValueError: – cat_feats and num_feats must be set or be both None
  • TypeError: – cat_feats must be a list
  • TypeError: – num_feats must be a list
transparentai.datasets.variable.correlation.cramers_v(x, y)[source]

Returns the Cramer V value of two categorical variables using chi square. This correlation metric is between 0 and 1.

Code source found in this article : https://towardsdatascience.com/the-search-for-categorical-correlation-a1cf7f1888c9

Parameters:
  • x (array like) – first categorical variable
  • y (array like) – second categorical variable
Returns:

Cramer V value

Return type:

float

transparentai.datasets.variable.correlation.merge_corr_df(df_list)[source]

Merges correlation matrix from compute_correlation() function to one. Needs 3 dataframe : pearson_corr, cramers_v_corr and pbs_corr.

This matrix has a default : the cramers_v_corr is scale from 0 to 1, but the others are from to -1 to 1. Be sure to understand this.

Parameters:df_list (list) – List of correlation matrices
Returns:Merged dataframe of correlation matrices
Return type:pd.DataFrame

Preformated datasets

transparentai.datasets.datasets.load_adult()[source]

Load Adult dataset. Source : https://archive.ics.uci.edu/ml/datasets/Adult

transparentai.datasets.datasets.load_boston()[source]

Load boston dataset Source : https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

transparentai.datasets.datasets.load_iris()[source]

Load Iris dataset. Source : http://archive.ics.uci.edu/ml/datasets/Iris/