How do you assess the data quality using python?
There is an amazing open-source python library ‘ydata_quality . It assesses Data Quality throughout the multiple stages of a data pipeline development.
Once you have a dataset available, running DataQuality(df=my_df).evaluate()
provides a comprehensive overview of the
details and intricacies of the data, through the perspective of the multiple
modules available in the package.
The library focuses on the
following individual modules as well.
Checks the bias and fairness in the
dataset.
Ø Bias: A systematic, non-neglectable treatment which is
differentiated towards a specific sub-group of individuals
Ø Fairness: The absence of
differentiated treatment (assistive or punitive) based on sensitive attributes. Fairness can also be
thought of as the absence of unjustified basis for differentiated treatment.
To define an expectation about data is
to develop a unit test that asserts a certain property about the data and
provides an actionable output in any deviation.
Checks the data correlation and causality
between variables in the dataset. This library implements the checks by
leveraging techniques such as correlation methods, full order partial
correlations (i.e. controlling the effect of all covariates) and with Variance
Inflation Factor, which is used to assess multicollinearity of numerical
covariates.
Data
drift is a broad term used for differences in the data observed by a model
during training and prediction time. The library uses different methods like Maximum
Mean Discrepancy, Kolmogorov-Smirnov
test, Chi-Squared
test and Bonferroni
correction depends on the type of data.
5. Duplicates
Functionality to detect
duplicate values.
6. Labelling:
Categoricals
and Numericals
Detects the quality issues such as imbalanced class, missing labels,
outlier detection, etc in the dataset based on the type of labels.
7. Missings
Detects null/missing
information in your dataset
Functionality to detect the erroneous data values in the data frame.
Comments
Post a Comment