How do you assess the data quality using python?

January 12, 2023

There is an amazing open-source python library ‘ydata_quality . It assesses Data Quality throughout the multiple stages of a data pipeline development.

Once you have a dataset available, running DataQuality(df=my_df).evaluate() provides a comprehensive overview of the details and intricacies of the data, through the perspective of the multiple modules available in the package.

The library focuses on the following individual modules as well.

1. Bias and Fairness

Checks the bias and fairness in the dataset.

Ø Bias: A systematic, non-neglectable treatment which is differentiated towards a specific sub-group of individuals

Ø Fairness: The absence of differentiated treatment (assistive or punitive) based on sensitive attributes. Fairness can also be thought of as the absence of unjustified basis for differentiated treatment.

2. Data Expectations

To define an expectation about data is to develop a unit test that asserts a certain property about the data and provides an actionable output in any deviation.

3. Data Relations

Checks the data correlation and causality between variables in the dataset. This library implements the checks by leveraging techniques such as correlation methods, full order partial correlations (i.e. controlling the effect of all covariates) and with Variance Inflation Factor, which is used to assess multicollinearity of numerical covariates.

4. Drift Analysis

Data drift is a broad term used for differences in the data observed by a model during training and prediction time. The library uses different methods like Maximum Mean Discrepancy, Kolmogorov-Smirnov test, Chi-Squared test and Bonferroni correction depends on the type of data.

5. Duplicates

Functionality to detect duplicate values.

6. Labelling: Categoricals and Numericals

Detects the quality issues such as imbalanced class, missing labels, outlier detection, etc in the dataset based on the type of labels.

7. Missings

Detects null/missing information in your dataset

8. Erroneous Data

Functionality to detect the erroneous data values in the data frame.

Source:https://github.com/ydataai/ydata-quality