datpro ====== .. py:module:: datpro Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/datpro/datpro/index Attributes ---------- .. autoapisummary:: datpro.__version__ Functions --------- .. autoapisummary:: datpro.detect_anomalies datpro.plotify datpro.summarize_data Package Contents ---------------- .. py:data:: __version__ .. py:function:: detect_anomalies(df: pandas.DataFrame, anomaly_type: Optional[str] = None) -> Dict[str, Union[Dict[str, Dict[str, Union[int, float]]], str]] Detect anomalies in a dataframe, including missing values, outliers, and duplicates. :param df: The input dataframe to analyze. :type df: pandas.DataFrame :param anomaly_type: Specify which anomaly to check ('missing_values', 'outliers', or 'duplicates'). If None, all anomaly types will be checked. :type anomaly_type: str, optional :returns: A dictionary containing detected anomalies based on the specified anomaly_type. :rtype: dict .. rubric:: Example >>> import pandas as pd >>> data = {'A': [1, 2, np.nan, 4], 'B': [100, 200, 300, 400], 'C': [1, 1, 1, 100]} >>> df = pd.DataFrame(data) >>> detect_anomalies(df, anomaly_type='missing_values') {'missing_values': {'A': {'missing_count': 1, 'missing_percentage': 25.0}}} .. py:function:: plotify(df: pandas.DataFrame, plot_types: Optional[List[str]] = None, save: bool = False, save_path: str = 'plots', file_prefix: str = 'plot') -> Dict[str, altair.Chart] Visualize a DataFrame by generating specified plots based on column datatypes. :param df: The DataFrame containing the data to be visualized. :type df: pandas.DataFrame :param plot_types: A list of plot types to generate. Available options include: - 'histogram' : Plot a histogram for numeric columns. - 'density' : Plot a density plot for numeric columns. - 'bar' : Plot a bar chart for categorical columns. - 'scatter' : Plot scatter plots for pairwise numeric columns. - 'correlation' : Plot a correlation heatmap for numeric columns. - 'box' : Plot box plots for numeric vs categorical columns. - 'stacked_bar' : Plot stacked bar charts for pairwise categorical columns. If None, all plot types are generated by default. :type plot_types: list of str, optional :param save: If True, saves the plots to the specified path. Default is False. :type save: bool, optional :param save_path: The directory where plots should be saved. Default is 'plots'. :type save_path: str, optional :param file_prefix: The prefix for saved plot filenames. Default is 'plot'. :type file_prefix: str, optional :returns: A dictionary where keys are plot names and values are Altair Chart objects. :rtype: dict :raises TypeError: If the input is not a pandas DataFrame. :raises ValueError: If the input DataFrame is empty. .. rubric:: Notes - Numeric columns are those of types 'int64', 'float64'. - Categorical columns are those of types 'object', 'category', and 'bool'. .. rubric:: Examples >>> import pandas as pd >>> df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': ['x', 'y', 'x', 'y']}) >>> charts = plotify(df, plot_types=['histogram', 'bar']) >>> charts['histogram_A'].show() .. py:function:: summarize_data(df: pandas.DataFrame) -> pandas.DataFrame Summarizes numeric columns in a given DataFrame by calculating key statistical metrics. This function automatically detects numeric columns in the provided DataFrame and returns a summary DataFrame containing the minimum, 25th percentile (Q1), median (50th percentile), 75th percentile (Q3), and maximum values for each numeric column. :param df: The input DataFrame containing data to be summarized. :type df: pandas.DataFrame :returns: A DataFrame where each row corresponds to a numeric column in the input DataFrame, and the columns represent the calculated statistics: min, 25%, 50% (median), 75%, and max. :rtype: pandas.DataFrame .. rubric:: Example >>> import pandas as pd >>> import numpy as np >>> data = { ... "A": [1, 2, np.nan, 4], ... "B": [100, 200, 300, 400], ... "C": [1, 1, 1, 100] ... } >>> df = pd.DataFrame(data) >>> summarize_data(df) min 25% 50% 75% max A 1.0 1.5 2.0 3.0 4.0 B 100.0 175.0 250.0 325.0 400.0 C 1.0 1.0 1.0 50.5 100.0