datpro

Submodules

datpro.datpro

Attributes

__version__

Functions

`detect_anomalies`(→ Dict[str, Union[Dict[str, Dict[str, ...)	Detect anomalies in a dataframe, including missing values, outliers, and duplicates.
`plotify`(→ Dict[str, altair.Chart])	Visualize a DataFrame by generating specified plots based on column datatypes.
`summarize_data`(→ pandas.DataFrame)	Summarizes numeric columns in a given DataFrame by calculating key statistical metrics.

Package Contents

datpro.__version__

datpro.detect_anomalies(df: pandas.DataFrame, anomaly_type: str | None = None) → Dict[str, Dict[str, Dict[str, int | float]] | str][source]

Detect anomalies in a dataframe, including missing values, outliers, and duplicates.

Parameters:

df (pandas.DataFrame) – The input dataframe to analyze.
anomaly_type (str, optional) – Specify which anomaly to check (‘missing_values’, ‘outliers’, or ‘duplicates’). If None, all anomaly types will be checked.

Returns:

A dictionary containing detected anomalies based on the specified anomaly_type.

Return type:

dict

Example

>>> import pandas as pd
>>> data = {'A': [1, 2, np.nan, 4], 'B': [100, 200, 300, 400], 'C': [1, 1, 1, 100]}
>>> df = pd.DataFrame(data)
>>> detect_anomalies(df, anomaly_type='missing_values')
{'missing_values': {'A': {'missing_count': 1, 'missing_percentage': 25.0}}}

datpro.plotify(df: pandas.DataFrame, plot_types: List[str] | None = None, save: bool = False, save_path: str = 'plots', file_prefix: str = 'plot') → Dict[str, altair.Chart][source]

Visualize a DataFrame by generating specified plots based on column datatypes.

Parameters:

df (pandas.DataFrame) – The DataFrame containing the data to be visualized.
plot_types (list of str, optional) – A list of plot types to generate. Available options include: - ‘histogram’ : Plot a histogram for numeric columns. - ‘density’ : Plot a density plot for numeric columns. - ‘bar’ : Plot a bar chart for categorical columns. - ‘scatter’ : Plot scatter plots for pairwise numeric columns. - ‘correlation’ : Plot a correlation heatmap for numeric columns. - ‘box’ : Plot box plots for numeric vs categorical columns. - ‘stacked_bar’ : Plot stacked bar charts for pairwise categorical columns. If None, all plot types are generated by default.
save (bool, optional) – If True, saves the plots to the specified path. Default is False.
save_path (str, optional) – The directory where plots should be saved. Default is ‘plots’.
file_prefix (str, optional) – The prefix for saved plot filenames. Default is ‘plot’.

Returns:

A dictionary where keys are plot names and values are Altair Chart objects.

Return type:

dict

Raises:

TypeError – If the input is not a pandas DataFrame.
ValueError – If the input DataFrame is empty.

Notes

Numeric columns are those of types ‘int64’, ‘float64’.
Categorical columns are those of types ‘object’, ‘category’, and ‘bool’.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': ['x', 'y', 'x', 'y']})
>>> charts = plotify(df, plot_types=['histogram', 'bar'])
>>> charts['histogram_A'].show()

datpro.summarize_data(df: pandas.DataFrame) → pandas.DataFrame[source]

Summarizes numeric columns in a given DataFrame by calculating key statistical metrics.

This function automatically detects numeric columns in the provided DataFrame and returns a summary DataFrame containing the minimum, 25th percentile (Q1), median (50th percentile), 75th percentile (Q3), and maximum values for each numeric column.

Parameters:: df (pandas.DataFrame) – The input DataFrame containing data to be summarized.
Returns:: A DataFrame where each row corresponds to a numeric column in the input DataFrame, and the columns represent the calculated statistics: min, 25%, 50% (median), 75%, and max.
Return type:: pandas.DataFrame

Example

>>> import pandas as pd
>>> import numpy as np
>>> data = {
...     "A": [1, 2, np.nan, 4],
...     "B": [100, 200, 300, 400],
...     "C": [1, 1, 1, 100]
... }
>>> df = pd.DataFrame(data)
>>> summarize_data(df)
     min   25%   50%   75%    max
A    1.0   1.5   2.0   3.0    4.0
B  100.0  175.0  250.0  325.0  400.0
C    1.0   1.0   1.0   50.5  100.0