Example usage
Here we will demonstrate how to use the datpro package to summarize data, detect anomalies, and create visualizations for a dataset.
Imports
import datpro as dp
import pandas as pd
import numpy as np
import altair as alt
from itertools import combinations
Load example dataset
We’ll use a sample dataset to demonstrate the functionalities of the datpro package. The dataset contains demographic and transactional data, with the goal of predicting income based on other features such as age, gender, spending_score, and region.
df = pd.read_csv('../data/example_data.csv')
df
| Age | Income | Spending_Score | Gender | Region | |
|---|---|---|---|---|---|
| 0 | 66 | NaN | 26.373678 | Male | South |
| 1 | 65 | 66369.651809 | 20.906870 | Female | South |
| 2 | 59 | 70764.092278 | 47.990597 | Male | West |
| 3 | 64 | 41432.315153 | 31.120625 | Female | North |
| 4 | 53 | 52963.994070 | 12.016596 | Female | East |
| ... | ... | ... | ... | ... | ... |
| 1005 | 18 | 62455.037248 | 22.795113 | Female | North |
| 1006 | 24 | 35361.901205 | 18.846863 | Male | South |
| 1007 | 51 | 56554.072546 | 17.076530 | Female | South |
| 1008 | 63 | 52799.136847 | 42.219961 | Male | East |
| 1009 | 42 | 52727.993826 | 27.395330 | Male | East |
1010 rows × 5 columns
In this dataset:
Ageis the age of the individual.Incomeis the annual income (our target variable for prediction).Spending_Scorequantifies spending behavior.Genderspecifies the gender of the individual.Regionindicates the geographical region.
If you’d like to follow along with the same dataset, you can download our example CSV file here.
Summarize data
To summarize numeric columns in our data set by calculating their the minimum, 25th percentile (Q1), median (50th percentile), 75th percentile (Q3), and maximum values.
dp.summarize_data(df)
| min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|
| Age | 18.000000 | 31.000000 | 44.000000 | 56.000000 | 69.000000 |
| Income | 6556.169327 | 40915.394217 | 51146.204619 | 60893.485307 | 443001.985244 |
| Spending_Score | 0.536808 | 16.880278 | 26.670824 | 38.786205 | 75.010095 |
Detect Anomalies
To detect missing values, outliers, and duplicates, use the detect_anomalies() function. This function allows you to analyze a dataset and identify potential issues that may impact data quality and analysis results. By specifying a particular anomaly type, you can focus on specific data integrity concerns.
Detect all anomalies
dp.detect_anomalies(df)
{'missing_values': {'Income': {'missing_count': 50,
'missing_percentage': np.float64(4.95)},
'Spending_Score': {'missing_count': 30,
'missing_percentage': np.float64(2.97)}},
'outliers': {'Income': {'outlier_count': 24, 'outlier_percentage': 2.38},
'Spending_Score': {'outlier_count': 4, 'outlier_percentage': 0.4}},
'duplicates': {'duplicate_count': np.int64(10),
'duplicate_percentage': np.float64(0.99)}}
Detect Specific Anomalies
You can specify an anomaly type to check only for particular data issues:
Missing Values:
dp.detect_anomalies(df, anomaly_type='missing_values')Outliers:
dp.detect_anomalies(df, anomaly_type='outliers')Duplicates:
dp.detect_anomalies(df, anomaly_type='duplicates')
For example, if you only want to check for missing values:
# Detect only missing values
dp.detect_anomalies(df, anomaly_type='missing_values')
{'missing_values': {'Income': {'missing_count': 50,
'missing_percentage': np.float64(4.95)},
'Spending_Score': {'missing_count': 30,
'missing_percentage': np.float64(2.97)}}}
The results from detect_anomalies() complement those of summarize_data() by identifying specific quality issues that require attention. For instance, anomalies such as missing data can guide imputation strategies, while outliers and duplicates may impact model accuracy if not properly addressed. Addressing these issues early ensures a more robust and reliable downstream analysis and modeling process.
Plotify
plotify() is a function that automatically generates various visualizations for a given Pandas DataFrame. It supports different plot types, including histograms, scatter plots, correlation heatmaps, box plots, and stacked bar charts. The function is designed to handle both numeric and categorical data, making it useful for exploratory data analysis.
Function Signature
def plotify(df: pd.DataFrame, plot_types: list = None) -> dict:
Parameters
df (pandas.DataFrame): The dataset for which plots need to be generated. Must be a non-empty DataFrame.
plot_types (list, optional): A list of plot types to generate. If None, all supported plots will be generated.
Returns
A dictionary where keys represent the type of plots generated, and values are the corresponding plot objects.
Raises
ValueError: If an empty DataFrame is provided.
TypeError: If the input is not a Pandas DataFrame.
Example Usage
To generate all the plots for a particular dataset.
plots = dp.plotify(df)
plots
{'histogram_Age': alt.Chart(...),
'density_Age': alt.Chart(...),
'histogram_Income': alt.Chart(...),
'density_Income': alt.Chart(...),
'histogram_Spending_Score': alt.Chart(...),
'density_Spending_Score': alt.Chart(...),
'bar_Gender': alt.Chart(...),
'bar_Region': alt.Chart(...),
'scatter_Age_Income': alt.Chart(...),
'scatter_Age_Spending_Score': alt.Chart(...),
'scatter_Income_Spending_Score': alt.Chart(...),
'correlation_heatmap': alt.Chart(...),
'box_Age_Gender': alt.Chart(...),
'box_Age_Region': alt.Chart(...),
'box_Income_Gender': alt.Chart(...),
'box_Income_Region': alt.Chart(...),
'box_Spending_Score_Gender': alt.Chart(...),
'box_Spending_Score_Region': alt.Chart(...),
'stacked_bar_Gender_Region': alt.Chart(...)}
This generates:
Histograms and density plots for numeric columns like Age, Income, and Spending_Score.
Bar charts for categorical columns like Gender and Region.
Scatter plots for pairwise numeric columns.
A correlation heatmap for numeric columns.
Box plots comparing numeric columns with categorical columns.
Stacked bar charts for pairwise categorical columns.
To visualize the plots:
plots['histogram_Age'].show()
Specific Plot Types
To visualize specific plot types, specify them in the plot_types parameter. For example:
# Generate denisty and bar plots only
plot_density_bar = dp.plotify(df, plot_types=['density', 'bar'])
plot_density_bar
{'density_Age': alt.Chart(...),
'density_Income': alt.Chart(...),
'density_Spending_Score': alt.Chart(...),
'bar_Gender': alt.Chart(...),
'bar_Region': alt.Chart(...)}
plot_density_bar['density_Spending_Score']
plot_density_bar['bar_Region']
# Generate correlation heatmap
plot_correlation = dp.plotify(df, plot_types=['correlation'])
plot_correlation
{'correlation_heatmap': alt.Chart(...)}
plot_correlation['correlation_heatmap'].show()
This generates:
Histograms for numeric columns like Age, Income, and Spending_Score.
Bar charts for categorical columns like Gender and Region.
A correlation heatmap for numeric columns.
Saving Plots to a Directory
plots_save = dp.plotify(df, save=True, save_path="example_plots", file_prefix="analysis")
This will save the plots in the example_plots directory with filenames starting with analysis_.
plotify() automatically handles missing values by ignoring them in the visualizations. For instance, density plots and histograms will exclude NaN values. Outliers are included in the visualizations, offering insights into their impact on the data distribution.
Conclusion
The datpro package provides a modular and efficient way to explore and profile your dataset. While we demonstrated its functionalities, additional cleaning steps such as handling missing values or outliers may be needed based on your analysis goals.
Feel free to replace the example dataset with your own data and adjust the function calls as needed.