transparentai.datasets
¶
Variable submodule¶
-
transparentai.datasets.variable.variable.
describe_number
(arr)[source]¶ Descriptive statistics about a number array.
Returned statistics:
- Count of valid values
- Count of missing values
- Mean
- Mode
- Min
- Quantitle 25%
- Median
- Quantile 75%
- Max
Parameters: arr (array like) – Array of value to get desriptive statistics from
Raises: - TypeError: – arr is not an array like
- TypeError: – arr is not a number array
-
transparentai.datasets.variable.variable.
describe_datetime
(arr, format='%Y-%m-%d')[source]¶ Descriptive statistics about a datetime array.
Returned statistics:
- Count of valid values
- Count of missing values
- Count of unique values
- Most common value
- Min
- Mean
- Max
Parameters: - arr (array like) – Array of value to get desriptive statistics from
- format (str) – String format for datetime value
Raises: - TypeError: – arr is not an array like
- TypeError: – arr is not a datetime array
-
transparentai.datasets.variable.variable.
describe_object
(arr)[source]¶ Descriptive statistics about an object array.
Returned statistics:
- Count of valid values
- Count of missing values
- Count of unique values
- Most common value
Parameters: arr (array like) – Array of value to get desriptive statistics from
Raises: - TypeError: – arr is not an array like
- TypeError: – arr is not an object array
-
transparentai.datasets.variable.variable.
describe
(arr)[source]¶ Descriptive statistics about an array. Depending on the detected dtype (number, date, object) it returns specific stats.
Common statistics for all dtype (using describe_common):
- Count of valid values
- Count of missing values
Number statistics (using describe_number):
- Mean
- Mode
- Min
- Quantitle 25%
- Median
- Quantile 75%
- Max
Datetime statistics (using describe_datetime):
- Count of unique values
- Most common value
- Min
- Mean
- Max
Object statistics (using describe_datetime):
- Count of unique values
- Most common value
Parameters: arr (array like) – Array of value to get desriptive statistics from Returns: Dictionnary with descriptive statistics Return type: dict Raises: TypeError: – arr is not an array like
-
transparentai.datasets.variable.correlation.
compute_correlation
(df, nrows=None, max_cat_val=100)[source]¶ Computes differents correlations matrix for three cases and merge them:
- numerical to numerical (using Pearson coeff)
- categorical to categorical (using Cramers V & Chi square)
- numerical to categorical (discrete) (using Point Biserial)
This matrix has a default : the cramers_v_corr is scale from 0 to 1, but the others are from to -1 to 1. Be sure to understand this.
Pearson coeff Wikipedia definition :
In statistics, the Pearson correlation coefficient, also referred to as Pearson’s r, the Pearson product-moment correlation coefficient (PPMCC) or the bivariate correlation, is a statistic that measures linear correlation between two variables X and Y. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation (that the value lies between -1 and 1 is a consequence of the Cauchy–Schwarz inequality). It is widely used in the sciences.
Cramers V Wikipedia definition :
In statistics, Cramér’s V (sometimes referred to as Cramér’s phi and denoted as φc) is a measure of association between two nominal variables, giving a value between 0 and +1 (inclusive). It is based on Pearson’s chi-squared statistic and was published by Harald Cramér in 1946.
Point Biserial Wikipedia definition :
The point biserial correlation coefficient (rpb) is a correlation coefficient used when one variable (e.g. Y) is dichotomous; Y can either be “naturally” dichotomous, like whether a coin lands heads or tails, or an artificially dichotomized variable. In most situations it is not advisable to dichotomize variables artificially[citation needed]. When a new variable is artificially dichotomized the new dichotomous variable may be conceptualized as having an underlying continuity. If this is the case, a biserial correlation would be the more appropriate calculation.
Parameters: - df (pd.DataFrame) – pandas Dataframe with values to compute correlation
- nrows (None or int or float (default None)) – If not None reduce the data to a sample of nrows if int else if float reduce to len(df) * nrows
- max_cat_val (int or None (default 100)) – Number max of unique values in a categorical feature if there are more distinct values than this number then the feature is ignored
Returns: Correlation matrix computed with Pearson coeff for numerical features to numerical features, Cramers V for categorical features to categorical features and Point Biserial for categorical features to numerical features
Return type: pd.DataFrame
Raises: TypeError: – Must provide a pandas DataFrame representing the data
-
transparentai.datasets.variable.correlation.
compute_cramers_v_corr
(df)[source]¶ Computes Cramers V correlation for a dataframe.
Cramers V Wikipedia definition :
In statistics, Cramér’s V (sometimes referred to as Cramér’s phi and denoted as φc) is a measure of association between two nominal variables, giving a value between 0 and +1 (inclusive). It is based on Pearson’s chi-squared statistic and was published by Harald Cramér in 1946.
Parameters: df (pd.DataFrame) – pandas Dataframe with values to compute Cramers V correlation Returns: Correlation matrix computed for Cramers V coeff Return type: pd.DataFrame Raises: TypeError: – Must provide a pandas DataFrame representing the data
-
transparentai.datasets.variable.correlation.
compute_pointbiserialr_corr
(df, cat_feats=None, num_feats=None)[source]¶ Computes Point Biserial correlation for a dataframe.
Point Biserial Wikipedia definition :
The point biserial correlation coefficient (rpb) is a correlation coefficient used when one variable (e.g. Y) is dichotomous; Y can either be “naturally” dichotomous, like whether a coin lands heads or tails, or an artificially dichotomized variable. In most situations it is not advisable to dichotomize variables artificially[citation needed]. When a new variable is artificially dichotomized the new dichotomous variable may be conceptualized as having an underlying continuity. If this is the case, a biserial correlation would be the more appropriate calculation.
Parameters: df (pd.DataFrame) – pandas Dataframe with values to compute Point Biserial correlation
Returns: Correlation matrix computed for Point Biserial coeff
Return type: pd.DataFrame
Raises: - TypeError: – Must provide a pandas DataFrame representing the data
- ValueError: – cat_feats and num_feats must be set or be both None
- TypeError: – cat_feats must be a list
- TypeError: – num_feats must be a list
-
transparentai.datasets.variable.correlation.
cramers_v
(x, y)[source]¶ Returns the Cramer V value of two categorical variables using chi square. This correlation metric is between 0 and 1.
Code source found in this article : https://towardsdatascience.com/the-search-for-categorical-correlation-a1cf7f1888c9
Parameters: - x (array like) – first categorical variable
- y (array like) – second categorical variable
Returns: Cramer V value
Return type:
-
transparentai.datasets.variable.correlation.
merge_corr_df
(df_list)[source]¶ Merges correlation matrix from compute_correlation() function to one. Needs 3 dataframe : pearson_corr, cramers_v_corr and pbs_corr.
This matrix has a default : the cramers_v_corr is scale from 0 to 1, but the others are from to -1 to 1. Be sure to understand this.
Parameters: df_list (list) – List of correlation matrices Returns: Merged dataframe of correlation matrices Return type: pd.DataFrame
Preformated datasets¶
-
transparentai.datasets.datasets.
load_adult
()[source]¶ Load Adult dataset. Source : https://archive.ics.uci.edu/ml/datasets/Adult
-
transparentai.datasets.datasets.
load_boston
()[source]¶ Load boston dataset Source : https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
-
transparentai.datasets.datasets.
load_iris
()[source]¶ Load Iris dataset. Source : http://archive.ics.uci.edu/ml/datasets/Iris/