Example usage

Here we will demonstrate how to use the datpro package to summarize data, detect anomalies, and create visualizations for a dataset.

Imports

import datpro as dp
import pandas as pd
import numpy as np
import altair as alt
from itertools import combinations

Load example dataset

We’ll use a sample dataset to demonstrate the functionalities of the datpro package. The dataset contains demographic and transactional data, with the goal of predicting income based on other features such as age, gender, spending_score, and region.

df =  pd.read_csv('../data/example_data.csv')
df

	Age	Income	Spending_Score	Gender	Region
0	66	NaN	26.373678	Male	South
1	65	66369.651809	20.906870	Female	South
2	59	70764.092278	47.990597	Male	West
3	64	41432.315153	31.120625	Female	North
4	53	52963.994070	12.016596	Female	East
...	...	...	...	...	...
1005	18	62455.037248	22.795113	Female	North
1006	24	35361.901205	18.846863	Male	South
1007	51	56554.072546	17.076530	Female	South
1008	63	52799.136847	42.219961	Male	East
1009	42	52727.993826	27.395330	Male	East

1010 rows × 5 columns

In this dataset:

Age is the age of the individual.
Income is the annual income (our target variable for prediction).
Spending_Score quantifies spending behavior.
Gender specifies the gender of the individual.
Region indicates the geographical region.

If you’d like to follow along with the same dataset, you can download our example CSV file here.

Summarize data

To summarize numeric columns in our data set by calculating their the minimum, 25th percentile (Q1), median (50th percentile), 75th percentile (Q3), and maximum values.

dp.summarize_data(df)

	min	25%	50%	75%	max
Age	18.000000	31.000000	44.000000	56.000000	69.000000
Income	6556.169327	40915.394217	51146.204619	60893.485307	443001.985244
Spending_Score	0.536808	16.880278	26.670824	38.786205	75.010095

Detect Anomalies

To detect missing values, outliers, and duplicates, use the detect_anomalies() function. This function allows you to analyze a dataset and identify potential issues that may impact data quality and analysis results. By specifying a particular anomaly type, you can focus on specific data integrity concerns.

Detect all anomalies

dp.detect_anomalies(df)

{'missing_values': {'Income': {'missing_count': 50,
   'missing_percentage': np.float64(4.95)},
  'Spending_Score': {'missing_count': 30,
   'missing_percentage': np.float64(2.97)}},
 'outliers': {'Income': {'outlier_count': 24, 'outlier_percentage': 2.38},
  'Spending_Score': {'outlier_count': 4, 'outlier_percentage': 0.4}},
 'duplicates': {'duplicate_count': np.int64(10),
  'duplicate_percentage': np.float64(0.99)}}

Detect Specific Anomalies

You can specify an anomaly type to check only for particular data issues:

Missing Values: dp.detect_anomalies(df, anomaly_type='missing_values')
Outliers: dp.detect_anomalies(df, anomaly_type='outliers')
Duplicates: dp.detect_anomalies(df, anomaly_type='duplicates')

For example, if you only want to check for missing values:

# Detect only missing values
dp.detect_anomalies(df, anomaly_type='missing_values')

{'missing_values': {'Income': {'missing_count': 50,
   'missing_percentage': np.float64(4.95)},
  'Spending_Score': {'missing_count': 30,
   'missing_percentage': np.float64(2.97)}}}

The results from detect_anomalies() complement those of summarize_data() by identifying specific quality issues that require attention. For instance, anomalies such as missing data can guide imputation strategies, while outliers and duplicates may impact model accuracy if not properly addressed. Addressing these issues early ensures a more robust and reliable downstream analysis and modeling process.

Plotify

plotify() is a function that automatically generates various visualizations for a given Pandas DataFrame. It supports different plot types, including histograms, scatter plots, correlation heatmaps, box plots, and stacked bar charts. The function is designed to handle both numeric and categorical data, making it useful for exploratory data analysis.

Function Signature

def plotify(df: pd.DataFrame, plot_types: list = None) -> dict:

Parameters

df (pandas.DataFrame): The dataset for which plots need to be generated. Must be a non-empty DataFrame.
plot_types (list, optional): A list of plot types to generate. If None, all supported plots will be generated.

Returns

A dictionary where keys represent the type of plots generated, and values are the corresponding plot objects.

Raises

ValueError: If an empty DataFrame is provided.
TypeError: If the input is not a Pandas DataFrame.

Example Usage

To generate all the plots for a particular dataset.

plots = dp.plotify(df)
plots

{'histogram_Age': alt.Chart(...),
 'density_Age': alt.Chart(...),
 'histogram_Income': alt.Chart(...),
 'density_Income': alt.Chart(...),
 'histogram_Spending_Score': alt.Chart(...),
 'density_Spending_Score': alt.Chart(...),
 'bar_Gender': alt.Chart(...),
 'bar_Region': alt.Chart(...),
 'scatter_Age_Income': alt.Chart(...),
 'scatter_Age_Spending_Score': alt.Chart(...),
 'scatter_Income_Spending_Score': alt.Chart(...),
 'correlation_heatmap': alt.Chart(...),
 'box_Age_Gender': alt.Chart(...),
 'box_Age_Region': alt.Chart(...),
 'box_Income_Gender': alt.Chart(...),
 'box_Income_Region': alt.Chart(...),
 'box_Spending_Score_Gender': alt.Chart(...),
 'box_Spending_Score_Region': alt.Chart(...),
 'stacked_bar_Gender_Region': alt.Chart(...)}

This generates:

Histograms and density plots for numeric columns like Age, Income, and Spending_Score.
Bar charts for categorical columns like Gender and Region.
Scatter plots for pairwise numeric columns.
A correlation heatmap for numeric columns.
Box plots comparing numeric columns with categorical columns.
Stacked bar charts for pairwise categorical columns.

To visualize the plots:

plots['histogram_Age'].show()

Specific Plot Types

To visualize specific plot types, specify them in the plot_types parameter. For example:

# Generate denisty and bar plots only
plot_density_bar = dp.plotify(df, plot_types=['density', 'bar'])
plot_density_bar

{'density_Age': alt.Chart(...),
 'density_Income': alt.Chart(...),
 'density_Spending_Score': alt.Chart(...),
 'bar_Gender': alt.Chart(...),
 'bar_Region': alt.Chart(...)}

plot_density_bar['density_Spending_Score']

plot_density_bar['bar_Region']

# Generate correlation heatmap
plot_correlation = dp.plotify(df, plot_types=['correlation'])
plot_correlation

{'correlation_heatmap': alt.Chart(...)}

plot_correlation['correlation_heatmap'].show()

This generates:

Histograms for numeric columns like Age, Income, and Spending_Score.
Bar charts for categorical columns like Gender and Region.
A correlation heatmap for numeric columns.

Saving Plots to a Directory

plots_save = dp.plotify(df, save=True, save_path="example_plots", file_prefix="analysis")

This will save the plots in the example_plots directory with filenames starting with analysis_.

plotify() automatically handles missing values by ignoring them in the visualizations. For instance, density plots and histograms will exclude NaN values. Outliers are included in the visualizations, offering insights into their impact on the data distribution.

Conclusion

The datpro package provides a modular and efficient way to explore and profile your dataset. While we demonstrated its functionalities, additional cleaning steps such as handling missing values or outliers may be needed based on your analysis goals.

Feel free to replace the example dataset with your own data and adjust the function calls as needed.