datpro

Submodules

Attributes

__version__

Functions

detect_anomalies(→ Dict[str, Union[Dict[str, Dict[str, ...)

Detect anomalies in a dataframe, including missing values, outliers, and duplicates.

plotify(→ Dict[str, altair.Chart])

Visualize a DataFrame by generating specified plots based on column datatypes.

summarize_data(→ pandas.DataFrame)

Summarizes numeric columns in a given DataFrame by calculating key statistical metrics.

Package Contents

datpro.__version__
datpro.detect_anomalies(df: pandas.DataFrame, anomaly_type: str | None = None) Dict[str, Dict[str, Dict[str, int | float]] | str][source]

Detect anomalies in a dataframe, including missing values, outliers, and duplicates.

Parameters:
  • df (pandas.DataFrame) – The input dataframe to analyze.

  • anomaly_type (str, optional) – Specify which anomaly to check (‘missing_values’, ‘outliers’, or ‘duplicates’). If None, all anomaly types will be checked.

Returns:

A dictionary containing detected anomalies based on the specified anomaly_type.

Return type:

dict

Example

>>> import pandas as pd
>>> data = {'A': [1, 2, np.nan, 4], 'B': [100, 200, 300, 400], 'C': [1, 1, 1, 100]}
>>> df = pd.DataFrame(data)
>>> detect_anomalies(df, anomaly_type='missing_values')
{'missing_values': {'A': {'missing_count': 1, 'missing_percentage': 25.0}}}
datpro.plotify(df: pandas.DataFrame, plot_types: List[str] | None = None, save: bool = False, save_path: str = 'plots', file_prefix: str = 'plot') Dict[str, altair.Chart][source]

Visualize a DataFrame by generating specified plots based on column datatypes.

Parameters:
  • df (pandas.DataFrame) – The DataFrame containing the data to be visualized.

  • plot_types (list of str, optional) – A list of plot types to generate. Available options include: - ‘histogram’ : Plot a histogram for numeric columns. - ‘density’ : Plot a density plot for numeric columns. - ‘bar’ : Plot a bar chart for categorical columns. - ‘scatter’ : Plot scatter plots for pairwise numeric columns. - ‘correlation’ : Plot a correlation heatmap for numeric columns. - ‘box’ : Plot box plots for numeric vs categorical columns. - ‘stacked_bar’ : Plot stacked bar charts for pairwise categorical columns. If None, all plot types are generated by default.

  • save (bool, optional) – If True, saves the plots to the specified path. Default is False.

  • save_path (str, optional) – The directory where plots should be saved. Default is ‘plots’.

  • file_prefix (str, optional) – The prefix for saved plot filenames. Default is ‘plot’.

Returns:

A dictionary where keys are plot names and values are Altair Chart objects.

Return type:

dict

Raises:
  • TypeError – If the input is not a pandas DataFrame.

  • ValueError – If the input DataFrame is empty.

Notes

  • Numeric columns are those of types ‘int64’, ‘float64’.

  • Categorical columns are those of types ‘object’, ‘category’, and ‘bool’.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': ['x', 'y', 'x', 'y']})
>>> charts = plotify(df, plot_types=['histogram', 'bar'])
>>> charts['histogram_A'].show()
datpro.summarize_data(df: pandas.DataFrame) pandas.DataFrame[source]

Summarizes numeric columns in a given DataFrame by calculating key statistical metrics.

This function automatically detects numeric columns in the provided DataFrame and returns a summary DataFrame containing the minimum, 25th percentile (Q1), median (50th percentile), 75th percentile (Q3), and maximum values for each numeric column.

Parameters:

df (pandas.DataFrame) – The input DataFrame containing data to be summarized.

Returns:

A DataFrame where each row corresponds to a numeric column in the input DataFrame, and the columns represent the calculated statistics: min, 25%, 50% (median), 75%, and max.

Return type:

pandas.DataFrame

Example

>>> import pandas as pd
>>> import numpy as np
>>> data = {
...     "A": [1, 2, np.nan, 4],
...     "B": [100, 200, 300, 400],
...     "C": [1, 1, 1, 100]
... }
>>> df = pd.DataFrame(data)
>>> summarize_data(df)
     min   25%   50%   75%    max
A    1.0   1.5   2.0   3.0    4.0
B  100.0  175.0  250.0  325.0  400.0
C    1.0   1.0   1.0   50.5  100.0