Example usage

Here we will demonstrate how to use the datpro package to summarize data, detect anomalies, and create visualizations for a dataset.

Imports

import datpro as dp
import pandas as pd
import numpy as np
import altair as alt
from itertools import combinations

Load example dataset

We’ll use a sample dataset to demonstrate the functionalities of the datpro package. The dataset contains demographic and transactional data, with the goal of predicting income based on other features such as age, gender, spending_score, and region.

df =  pd.read_csv('../data/example_data.csv')
df
Age Income Spending_Score Gender Region
0 66 NaN 26.373678 Male South
1 65 66369.651809 20.906870 Female South
2 59 70764.092278 47.990597 Male West
3 64 41432.315153 31.120625 Female North
4 53 52963.994070 12.016596 Female East
... ... ... ... ... ...
1005 18 62455.037248 22.795113 Female North
1006 24 35361.901205 18.846863 Male South
1007 51 56554.072546 17.076530 Female South
1008 63 52799.136847 42.219961 Male East
1009 42 52727.993826 27.395330 Male East

1010 rows × 5 columns

In this dataset:

  • Age is the age of the individual.

  • Income is the annual income (our target variable for prediction).

  • Spending_Score quantifies spending behavior.

  • Gender specifies the gender of the individual.

  • Region indicates the geographical region.

If you’d like to follow along with the same dataset, you can download our example CSV file here.

Summarize data

To summarize numeric columns in our data set by calculating their the minimum, 25th percentile (Q1), median (50th percentile), 75th percentile (Q3), and maximum values.

dp.summarize_data(df)
min 25% 50% 75% max
Age 18.000000 31.000000 44.000000 56.000000 69.000000
Income 6556.169327 40915.394217 51146.204619 60893.485307 443001.985244
Spending_Score 0.536808 16.880278 26.670824 38.786205 75.010095

Detect Anomalies

To detect missing values, outliers, and duplicates, use the detect_anomalies() function. This function allows you to analyze a dataset and identify potential issues that may impact data quality and analysis results. By specifying a particular anomaly type, you can focus on specific data integrity concerns.

Detect all anomalies

dp.detect_anomalies(df)
{'missing_values': {'Income': {'missing_count': 50,
   'missing_percentage': np.float64(4.95)},
  'Spending_Score': {'missing_count': 30,
   'missing_percentage': np.float64(2.97)}},
 'outliers': {'Income': {'outlier_count': 24, 'outlier_percentage': 2.38},
  'Spending_Score': {'outlier_count': 4, 'outlier_percentage': 0.4}},
 'duplicates': {'duplicate_count': np.int64(10),
  'duplicate_percentage': np.float64(0.99)}}

Detect Specific Anomalies

You can specify an anomaly type to check only for particular data issues:

  • Missing Values: dp.detect_anomalies(df, anomaly_type='missing_values')

  • Outliers: dp.detect_anomalies(df, anomaly_type='outliers')

  • Duplicates: dp.detect_anomalies(df, anomaly_type='duplicates')

For example, if you only want to check for missing values:

# Detect only missing values
dp.detect_anomalies(df, anomaly_type='missing_values')
{'missing_values': {'Income': {'missing_count': 50,
   'missing_percentage': np.float64(4.95)},
  'Spending_Score': {'missing_count': 30,
   'missing_percentage': np.float64(2.97)}}}

The results from detect_anomalies() complement those of summarize_data() by identifying specific quality issues that require attention. For instance, anomalies such as missing data can guide imputation strategies, while outliers and duplicates may impact model accuracy if not properly addressed. Addressing these issues early ensures a more robust and reliable downstream analysis and modeling process.

Plotify

plotify() is a function that automatically generates various visualizations for a given Pandas DataFrame. It supports different plot types, including histograms, scatter plots, correlation heatmaps, box plots, and stacked bar charts. The function is designed to handle both numeric and categorical data, making it useful for exploratory data analysis.

Function Signature

def plotify(df: pd.DataFrame, plot_types: list = None) -> dict:

Parameters

  • df (pandas.DataFrame): The dataset for which plots need to be generated. Must be a non-empty DataFrame.

  • plot_types (list, optional): A list of plot types to generate. If None, all supported plots will be generated.

Returns

  • A dictionary where keys represent the type of plots generated, and values are the corresponding plot objects.

Raises

  • ValueError: If an empty DataFrame is provided.

  • TypeError: If the input is not a Pandas DataFrame.

Example Usage

To generate all the plots for a particular dataset.

plots = dp.plotify(df)
plots
{'histogram_Age': alt.Chart(...),
 'density_Age': alt.Chart(...),
 'histogram_Income': alt.Chart(...),
 'density_Income': alt.Chart(...),
 'histogram_Spending_Score': alt.Chart(...),
 'density_Spending_Score': alt.Chart(...),
 'bar_Gender': alt.Chart(...),
 'bar_Region': alt.Chart(...),
 'scatter_Age_Income': alt.Chart(...),
 'scatter_Age_Spending_Score': alt.Chart(...),
 'scatter_Income_Spending_Score': alt.Chart(...),
 'correlation_heatmap': alt.Chart(...),
 'box_Age_Gender': alt.Chart(...),
 'box_Age_Region': alt.Chart(...),
 'box_Income_Gender': alt.Chart(...),
 'box_Income_Region': alt.Chart(...),
 'box_Spending_Score_Gender': alt.Chart(...),
 'box_Spending_Score_Region': alt.Chart(...),
 'stacked_bar_Gender_Region': alt.Chart(...)}

This generates:

  • Histograms and density plots for numeric columns like Age, Income, and Spending_Score.

  • Bar charts for categorical columns like Gender and Region.

  • Scatter plots for pairwise numeric columns.

  • A correlation heatmap for numeric columns.

  • Box plots comparing numeric columns with categorical columns.

  • Stacked bar charts for pairwise categorical columns.

To visualize the plots:

plots['histogram_Age'].show()

Specific Plot Types

To visualize specific plot types, specify them in the plot_types parameter. For example:

# Generate denisty and bar plots only
plot_density_bar = dp.plotify(df, plot_types=['density', 'bar'])
plot_density_bar
{'density_Age': alt.Chart(...),
 'density_Income': alt.Chart(...),
 'density_Spending_Score': alt.Chart(...),
 'bar_Gender': alt.Chart(...),
 'bar_Region': alt.Chart(...)}
plot_density_bar['density_Spending_Score']
plot_density_bar['bar_Region']
# Generate correlation heatmap
plot_correlation = dp.plotify(df, plot_types=['correlation'])
plot_correlation
{'correlation_heatmap': alt.Chart(...)}
plot_correlation['correlation_heatmap'].show()

This generates:

  • Histograms for numeric columns like Age, Income, and Spending_Score.

  • Bar charts for categorical columns like Gender and Region.

  • A correlation heatmap for numeric columns.

Saving Plots to a Directory

plots_save = dp.plotify(df, save=True, save_path="example_plots", file_prefix="analysis")

This will save the plots in the example_plots directory with filenames starting with analysis_.

plotify() automatically handles missing values by ignoring them in the visualizations. For instance, density plots and histograms will exclude NaN values. Outliers are included in the visualizations, offering insights into their impact on the data distribution.

Conclusion

The datpro package provides a modular and efficient way to explore and profile your dataset. While we demonstrated its functionalities, additional cleaning steps such as handling missing values or outliers may be needed based on your analysis goals.

Feel free to replace the example dataset with your own data and adjust the function calls as needed.