datpro
Submodules
Attributes
Functions
|
Detect anomalies in a dataframe, including missing values, outliers, and duplicates. |
|
Visualize a DataFrame by generating specified plots based on column datatypes. |
|
Summarizes numeric columns in a given DataFrame by calculating key statistical metrics. |
Package Contents
- datpro.__version__
- datpro.detect_anomalies(df: pandas.DataFrame, anomaly_type: str | None = None) Dict[str, Dict[str, Dict[str, int | float]] | str][source]
Detect anomalies in a dataframe, including missing values, outliers, and duplicates.
- Parameters:
df (pandas.DataFrame) – The input dataframe to analyze.
anomaly_type (str, optional) – Specify which anomaly to check (‘missing_values’, ‘outliers’, or ‘duplicates’). If None, all anomaly types will be checked.
- Returns:
A dictionary containing detected anomalies based on the specified anomaly_type.
- Return type:
dict
Example
>>> import pandas as pd >>> data = {'A': [1, 2, np.nan, 4], 'B': [100, 200, 300, 400], 'C': [1, 1, 1, 100]} >>> df = pd.DataFrame(data) >>> detect_anomalies(df, anomaly_type='missing_values') {'missing_values': {'A': {'missing_count': 1, 'missing_percentage': 25.0}}}
- datpro.plotify(df: pandas.DataFrame, plot_types: List[str] | None = None, save: bool = False, save_path: str = 'plots', file_prefix: str = 'plot') Dict[str, altair.Chart][source]
Visualize a DataFrame by generating specified plots based on column datatypes.
- Parameters:
df (pandas.DataFrame) – The DataFrame containing the data to be visualized.
plot_types (list of str, optional) – A list of plot types to generate. Available options include: - ‘histogram’ : Plot a histogram for numeric columns. - ‘density’ : Plot a density plot for numeric columns. - ‘bar’ : Plot a bar chart for categorical columns. - ‘scatter’ : Plot scatter plots for pairwise numeric columns. - ‘correlation’ : Plot a correlation heatmap for numeric columns. - ‘box’ : Plot box plots for numeric vs categorical columns. - ‘stacked_bar’ : Plot stacked bar charts for pairwise categorical columns. If None, all plot types are generated by default.
save (bool, optional) – If True, saves the plots to the specified path. Default is False.
save_path (str, optional) – The directory where plots should be saved. Default is ‘plots’.
file_prefix (str, optional) – The prefix for saved plot filenames. Default is ‘plot’.
- Returns:
A dictionary where keys are plot names and values are Altair Chart objects.
- Return type:
dict
- Raises:
TypeError – If the input is not a pandas DataFrame.
ValueError – If the input DataFrame is empty.
Notes
Numeric columns are those of types ‘int64’, ‘float64’.
Categorical columns are those of types ‘object’, ‘category’, and ‘bool’.
Examples
>>> import pandas as pd >>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': ['x', 'y', 'x', 'y']}) >>> charts = plotify(df, plot_types=['histogram', 'bar']) >>> charts['histogram_A'].show()
- datpro.summarize_data(df: pandas.DataFrame) pandas.DataFrame[source]
Summarizes numeric columns in a given DataFrame by calculating key statistical metrics.
This function automatically detects numeric columns in the provided DataFrame and returns a summary DataFrame containing the minimum, 25th percentile (Q1), median (50th percentile), 75th percentile (Q3), and maximum values for each numeric column.
- Parameters:
df (pandas.DataFrame) – The input DataFrame containing data to be summarized.
- Returns:
A DataFrame where each row corresponds to a numeric column in the input DataFrame, and the columns represent the calculated statistics: min, 25%, 50% (median), 75%, and max.
- Return type:
pandas.DataFrame
Example
>>> import pandas as pd >>> import numpy as np >>> data = { ... "A": [1, 2, np.nan, 4], ... "B": [100, 200, 300, 400], ... "C": [1, 1, 1, 100] ... } >>> df = pd.DataFrame(data) >>> summarize_data(df) min 25% 50% 75% max A 1.0 1.5 2.0 3.0 4.0 B 100.0 175.0 250.0 325.0 400.0 C 1.0 1.0 1.0 50.5 100.0