{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Example usage\n", "\n", "Here we will demonstrate how to use the `datpro` package to summarize data, detect anomalies, and create visualizations for a dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import datpro as dp\n", "import pandas as pd\n", "import numpy as np\n", "import altair as alt\n", "from itertools import combinations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load example dataset\n", "We'll use a sample dataset to demonstrate the functionalities of the `datpro` package. The dataset contains demographic and transactional data, with the goal of predicting income based on other features such as age, gender, spending_score, and region.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeIncomeSpending_ScoreGenderRegion
066NaN26.373678MaleSouth
16566369.65180920.906870FemaleSouth
25970764.09227847.990597MaleWest
36441432.31515331.120625FemaleNorth
45352963.99407012.016596FemaleEast
..................
10051862455.03724822.795113FemaleNorth
10062435361.90120518.846863MaleSouth
10075156554.07254617.076530FemaleSouth
10086352799.13684742.219961MaleEast
10094252727.99382627.395330MaleEast
\n", "

1010 rows × 5 columns

\n", "
" ], "text/plain": [ " Age Income Spending_Score Gender Region\n", "0 66 NaN 26.373678 Male South\n", "1 65 66369.651809 20.906870 Female South\n", "2 59 70764.092278 47.990597 Male West\n", "3 64 41432.315153 31.120625 Female North\n", "4 53 52963.994070 12.016596 Female East\n", "... ... ... ... ... ...\n", "1005 18 62455.037248 22.795113 Female North\n", "1006 24 35361.901205 18.846863 Male South\n", "1007 51 56554.072546 17.076530 Female South\n", "1008 63 52799.136847 42.219961 Male East\n", "1009 42 52727.993826 27.395330 Male East\n", "\n", "[1010 rows x 5 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('../data/example_data.csv')\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this dataset:\n", "\n", "- `Age` is the age of the individual.\n", "\n", "- `Income` is the annual income (our target variable for prediction).\n", "\n", "- `Spending_Score` quantifies spending behavior.\n", "\n", "- `Gender` specifies the gender of the individual.\n", "\n", "- `Region` indicates the geographical region.\n", "\n", "If you'd like to follow along with the same dataset, you can download our example CSV file [here](https://github.com/UBC-MDS/dataprofiler_group-30/blob/main/data/example_data.csv)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summarize data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To summarize numeric columns in our data set by calculating their the minimum, 25th percentile (Q1), median (50th percentile), 75th percentile (Q3), and maximum values." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
min25%50%75%max
Age18.00000031.00000044.00000056.00000069.000000
Income6556.16932740915.39421751146.20461960893.485307443001.985244
Spending_Score0.53680816.88027826.67082438.78620575.010095
\n", "
" ], "text/plain": [ " min 25% 50% 75% \\\n", "Age 18.000000 31.000000 44.000000 56.000000 \n", "Income 6556.169327 40915.394217 51146.204619 60893.485307 \n", "Spending_Score 0.536808 16.880278 26.670824 38.786205 \n", "\n", " max \n", "Age 69.000000 \n", "Income 443001.985244 \n", "Spending_Score 75.010095 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dp.summarize_data(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Detect Anomalies\n", "To detect missing values, outliers, and duplicates, use the `detect_anomalies()` function. This function allows you to analyze a dataset and identify potential issues that may impact data quality and analysis results. By specifying a particular anomaly type, you can focus on specific data integrity concerns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Detect all anomalies" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'missing_values': {'Income': {'missing_count': 50,\n", " 'missing_percentage': np.float64(4.95)},\n", " 'Spending_Score': {'missing_count': 30,\n", " 'missing_percentage': np.float64(2.97)}},\n", " 'outliers': {'Income': {'outlier_count': 24, 'outlier_percentage': 2.38},\n", " 'Spending_Score': {'outlier_count': 4, 'outlier_percentage': 0.4}},\n", " 'duplicates': {'duplicate_count': np.int64(10),\n", " 'duplicate_percentage': np.float64(0.99)}}" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dp.detect_anomalies(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Detect Specific Anomalies\n", "You can specify an anomaly type to check only for particular data issues:\n", "\n", "- **Missing Values:** `dp.detect_anomalies(df, anomaly_type='missing_values')`\n", "- **Outliers:** `dp.detect_anomalies(df, anomaly_type='outliers')`\n", "- **Duplicates:** `dp.detect_anomalies(df, anomaly_type='duplicates')`\n", "\n", "For example, if you only want to check for missing values:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'missing_values': {'Income': {'missing_count': 50,\n", " 'missing_percentage': np.float64(4.95)},\n", " 'Spending_Score': {'missing_count': 30,\n", " 'missing_percentage': np.float64(2.97)}}}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Detect only missing values\n", "dp.detect_anomalies(df, anomaly_type='missing_values')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results from `detect_anomalies()` complement those of `summarize_data()` by identifying specific quality issues that require attention. For instance, anomalies such as missing data can guide imputation strategies, while outliers and duplicates may impact model accuracy if not properly addressed. Addressing these issues early ensures a more robust and reliable downstream analysis and modeling process." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotify" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`plotify()` is a function that automatically generates various visualizations for a given Pandas DataFrame. It supports different plot types, including histograms, scatter plots, correlation heatmaps, box plots, and stacked bar charts. The function is designed to handle both numeric and categorical data, making it useful for exploratory data analysis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Function Signature\n", "```python\n", "def plotify(df: pd.DataFrame, plot_types: list = None) -> dict:\n", "```\n", "**Parameters**\n", "\n", "- df (pandas.DataFrame): The dataset for which plots need to be generated. Must be a non-empty DataFrame.\n", "- plot_types (list, optional): A list of plot types to generate. If None, all supported plots will be generated.\n", "\n", "**Returns**\n", "\n", "- A dictionary where keys represent the type of plots generated, and values are the corresponding plot objects.\n", "\n", "**Raises**\n", "\n", "- ValueError: If an empty DataFrame is provided.\n", "- TypeError: If the input is not a Pandas DataFrame." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example Usage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To generate all the plots for a particular dataset." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'histogram_Age': alt.Chart(...),\n", " 'density_Age': alt.Chart(...),\n", " 'histogram_Income': alt.Chart(...),\n", " 'density_Income': alt.Chart(...),\n", " 'histogram_Spending_Score': alt.Chart(...),\n", " 'density_Spending_Score': alt.Chart(...),\n", " 'bar_Gender': alt.Chart(...),\n", " 'bar_Region': alt.Chart(...),\n", " 'scatter_Age_Income': alt.Chart(...),\n", " 'scatter_Age_Spending_Score': alt.Chart(...),\n", " 'scatter_Income_Spending_Score': alt.Chart(...),\n", " 'correlation_heatmap': alt.Chart(...),\n", " 'box_Age_Gender': alt.Chart(...),\n", " 'box_Age_Region': alt.Chart(...),\n", " 'box_Income_Gender': alt.Chart(...),\n", " 'box_Income_Region': alt.Chart(...),\n", " 'box_Spending_Score_Gender': alt.Chart(...),\n", " 'box_Spending_Score_Region': alt.Chart(...),\n", " 'stacked_bar_Gender_Region': alt.Chart(...)}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "plots = dp.plotify(df)\n", "plots" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This generates:\n", "\n", "- Histograms and density plots for numeric columns like Age, Income, and Spending_Score.\n", "- Bar charts for categorical columns like Gender and Region.\n", "- Scatter plots for pairwise numeric columns.\n", "- A correlation heatmap for numeric columns.\n", "- Box plots comparing numeric columns with categorical columns.\n", "- Stacked bar charts for pairwise categorical columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To visualize the plots:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plots['histogram_Age'].show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Specific Plot Types" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To visualize specific plot types, specify them in the plot_types parameter. For example:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'density_Age': alt.Chart(...),\n", " 'density_Income': alt.Chart(...),\n", " 'density_Spending_Score': alt.Chart(...),\n", " 'bar_Gender': alt.Chart(...),\n", " 'bar_Region': alt.Chart(...)}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Generate denisty and bar plots only\n", "plot_density_bar = dp.plotify(df, plot_types=['density', 'bar'])\n", "plot_density_bar" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "plot_density_bar['density_Spending_Score']" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "plot_density_bar['bar_Region']" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'correlation_heatmap': alt.Chart(...)}" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Generate correlation heatmap\n", "plot_correlation = dp.plotify(df, plot_types=['correlation'])\n", "plot_correlation" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plot_correlation['correlation_heatmap'].show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This generates:\n", "\n", "- Histograms for numeric columns like Age, Income, and Spending_Score.\n", "- Bar charts for categorical columns like Gender and Region.\n", "- A correlation heatmap for numeric columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Saving Plots to a Directory" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "plots_save = dp.plotify(df, save=True, save_path=\"example_plots\", file_prefix=\"analysis\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This will save the plots in the `example_plots` directory with filenames starting with `analysis_`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`plotify()` automatically handles missing values by ignoring them in the visualizations. For instance, density plots and histograms will exclude NaN values. Outliers are included in the visualizations, offering insights into their impact on the data distribution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "The `datpro` package provides a modular and efficient way to explore and profile your dataset. While we demonstrated its functionalities, additional cleaning steps such as handling missing values or outliers may be needed based on your analysis goals.\n", "\n", "Feel free to replace the example dataset with your own data and adjust the function calls as needed." ] } ], "metadata": { "kernelspec": { "display_name": "dataprofiler", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 4 }