stelardataprofiler package

Submodules

stelardataprofiler.main module

stelardataprofiler.main.main()[source]

stelardataprofiler.profiler module

stelardataprofiler.profiler.prepare_mapping(config: dict) None[source]

This method prepares the suitable mapping for subsequent generation of the RDF graph, if “rdf” and “serialization” options are specified in config.

Parameters:

config (dict) – a dictionary with all configuration settings.

Returns:

None.

Return type:

None

stelardataprofiler.profiler.run_profile(config: dict) None[source]

This method executes the specified profiler and writes the resulting profile dictionary based on a configuration dictionary.

Parameters:

config (dict) – a dictionary with all configuration settings.

Returns:

None.

Return type:

None

stelardataprofiler.utils module

stelardataprofiler.utils.read_config(json_file: str) dict[source]

This method reads configuration settings from a json file. Configuration includes all parameters for input/output.

Parameters:

json_file (str) – path to .json file that contains the configuration parameters.

Returns:

A dictionary with all configuration settings.

Return type:

dict

stelardataprofiler.utils.write_to_json(output_dict: dict, output_file: str | Path) None[source]

Write the profile dictionary to a file.

Parameters:
  • output_dict (dict) – the profile dictionary that will writen.

  • output_file (Union[str, Path]) – The name or the path of the file to generate including the extension (.json).

Returns:

a dict which contains the results of the profiler for the texts.

Return type:

dict

Module contents

stelardataprofiler.prepare_mapping(config: dict) None[source]

This method prepares the suitable mapping for subsequent generation of the RDF graph, if “rdf” and “serialization” options are specified in config.

Parameters:

config (dict) – a dictionary with all configuration settings.

Returns:

None.

Return type:

None

stelardataprofiler.profile_hierarchical(my_file_path: str) dict[source]

This method performs profiling and generates a profiling dictionary for a given json file that exists in the given path.

Parameters:

my_file_path (str) – the path to a json file.

Returns:

A dict which contains the results of the profiler for the json.

Return type:

dict

stelardataprofiler.profile_hierarchical_with_config(config: dict) None[source]

This method performs profiling on hierarchical data and writes the resulting profile dictionary based on a configuration dictionary.

Parameters:

config (dict) – a dictionary with all configuration settings.

Returns:

None.

Return type:

None

stelardataprofiler.profile_raster(my_path: str | List[str]) dict[source]

This method performs profiling and generates a profiling dictionary for either a single image or many images.

Parameters:

my_path (Union[str, List[str]]) – either the path to an image file or a list of paths to image files.

Returns:

A dict which contains the results of the profiler for the image or images.

Return type:

dict

stelardataprofiler.profile_raster_with_config(config: dict) None[source]

This method performs profiling on raster data and writes the resulting profile dictionary based on a configuration dictionary.

Parameters:

config (dict) – a dictionary with all configuration settings.

Returns:

None.

Return type:

None

stelardataprofiler.profile_rdfGraph(my_file_path: str, parse_format: str = 'application/rdf+xml')[source]

This method performs profiling and generates a profiling dictionary for a given rdf file that exists in the given path.

Parameters:
  • my_file_path (str) – the path to a rdf file.

  • parse_format (str, optional) – the format of the rdf file. (see rdflib package to find the available formats e.g. ‘turtle’, ‘application/rdf+xml’, ‘n3’, ‘nt’, etc.)

Returns:

A dict which contains the results of the profiler for the rdf.

Return type:

dict

stelardataprofiler.profile_rdfGraph_with_config(config: dict) None[source]

This method performs profiling on rdfGraph data and writes the resulting profile dictionary based on a configuration dictionary.

Parameters:

config (dict) – a dictionary with all configuration settings.

Returns:

None.

Return type:

None

stelardataprofiler.profile_tabular(input_path: str | DataFrame | GeoDataFrame, header: str | int = 0, sep: str = ',', light_mode: bool = False, crs: str = 'EPSG:4326', num_cat_perc_threshold: float = 0.5, max_freq_distr=10, eps_distance=1000, extra_geometry_columns: list | None = None, types_dict=None) dict[source]

Profiles a tabular DataFrame (or file path) and returns a profiling report as a dictionary.

Parameters:
  • input_path (str or pandas.DataFrame or geopandas.GeoDataFrame) – Path to input file or a DataFrame/GeoDataFrame to profile.

  • header (int or str) – Row to use as column names (passed to pandas.read).

  • sep (str) – Separator to use for reading CSV files.

  • light_mode (bool) – If True, skip expensive computations.

  • crs (str) – Coordinate reference system for geometry data.

  • num_cat_perc_threshold (float) – Threshold for treating numeric as categorical.

  • max_freq_distr (int) – Top-K most frequent values to be displayed in the frequency distribution.

  • eps_distance (int) – Distance tolerance for geometry heatmap calculations.

  • extra_geometry_columns (list) – Additional geometry columns to consider.

  • types_dict (dict or None) – Pre-computed types dictionary to use instead of detecting.

Returns:

Profiling report dictionary.

Return type:

dict

stelardataprofiler.profile_tabular_with_config(config: dict) None[source]

This method performs profiling on tabular and/or vector data and write the resulting profile dictionary based on a configuration dictionary.

Parameters:

config (dict) – a dictionary with all configuration settings.

Returns:

None.

Return type:

None

stelardataprofiler.profile_text(my_path: str | List[str])[source]

This method performs profiling and generates a profiling dictionary for either a single text or many texts.

Parameters:

my_path (Union[str, List[str]]) – either the path to a text file or a list of paths to text files.

Returns:

A dict which contains the results of the profiler for the text or texts.

Return type:

dict

stelardataprofiler.profile_text_with_config(config: dict) None[source]

This method performs profiling on text data and writes the resulting profile dictionary based on a configuration dictionary.

Parameters:

config (dict) – a dictionary with all configuration settings.

Returns:

None.

Return type:

None

stelardataprofiler.profile_timeseries(input_path: str | DataFrame, ts_mode_datetime_col: str = 'date', header: str | int = 0, sep: str = ',', light_mode: bool = False, num_cat_perc_threshold: float = 0.5, max_freq_distr=10, types_dict=None) dict[source]

Profiles a timeseries DataFrame or file and returns a profiling report dictionary.

Parameters:
  • input_path (str or pandas.DataFrame) – Path to input file or DataFrame to profile.

  • ts_mode_datetime_col (str) – Column name for datetime index.

  • header (int or str) – Row to use as column names.

  • sep (str) – Field separator for CSV.

  • light_mode (bool) – If True, skip expensive computations.

  • num_cat_perc_threshold (float) – Threshold for treating numeric as categorical.

  • max_freq_distr (int) – Top-K most frequent values to be displayed in the frequency distribution.

  • types_dict (dict or None) – Pre-computed types dictionary to use instead of detecting.

Returns:

Profiling report dictionary.

Return type:

dict

stelardataprofiler.profile_timeseries_with_config(config: dict) None[source]

This method performs profiling on timeseries data and writes the resulting profile dictionary based on a configuration dictionary.

Parameters:

config (dict) – a dictionary with all configuration settings.

Returns:

None.

Return type:

None

stelardataprofiler.profile_vista_rasters(rhd_datapath: str, ras_datapath: str)[source]

This method performs profiling and generates a profiling dictionary for a given ras file that exists in the given path using the contents of a rhd file that exists in the given path.

Parameters:
  • rhd_datapath (str) – the path to a rhd file.

  • ras_datapath (str) – the path to a ras file.

Returns:

A dict which contains the results of the profiler for the ras.

Return type:

dict

stelardataprofiler.profile_vista_rasters_with_config(config: dict) None[source]

This method performs profiling on ras data and writes the resulting profile dictionary based on a configuration dictionary.

Parameters:

config (dict) – a dictionary with all configuration settings.

Returns:

None.

Return type:

None

stelardataprofiler.read_config(json_file: str) dict[source]

This method reads configuration settings from a json file. Configuration includes all parameters for input/output.

Parameters:

json_file (str) – path to .json file that contains the configuration parameters.

Returns:

A dictionary with all configuration settings.

Return type:

dict

stelardataprofiler.run_profile(config: dict) None[source]

This method executes the specified profiler and writes the resulting profile dictionary based on a configuration dictionary.

Parameters:

config (dict) – a dictionary with all configuration settings.

Returns:

None.

Return type:

None

stelardataprofiler.type_detection(input_path: str | DataFrame | GeoDataFrame, header: str | int = 0, sep: str = ',', ts_mode: bool = False, ts_mode_datetime_col: str | None = None, crs: str = 'EPSG:4326', num_cat_perc_threshold: float = 0.5, max_freq_distr=10, eps_distance: int = 1000, extra_geometry_columns: list | None = None, **kwargs) dict[source]

Detects data types for each column in tabular or timeseries data.

Parameters:
  • input_path (str or pandas.DataFrame or geopandas.GeoDataFrame) – Path or DataFrame/GeoDataFrame to inspect.

  • header (int or str) – Header row index or name.

  • sep (str) – Separator character.

  • ts_mode (bool) – Whether to treat data as timeseries.

  • ts_mode_datetime_col (str) – Datetime column for timeseries.

  • crs (str) – Coordinate reference system for geometry data.

  • num_cat_perc_threshold (float) – Threshold for treating numeric as categorical.

  • max_freq_distr (int) – Top-K most frequent values to be displayed in the frequency distribution.

  • eps_distance (int) – Distance tolerance for geometry heatmap calculations.

  • extra_geometry_columns (list) – Additional geometry columns to consider.

Returns:

Types dictionary mapping column to detected type info.

Return type:

dict

stelardataprofiler.type_detection_with_config(config: dict) None[source]

This method performs type detection on tabular or timeseries data and writes the resulting type detection dictionary based on a configuration dictionary.

Parameters:

config (dict) – a dictionary with all configuration settings.

Returns:

None.

Return type:

None

stelardataprofiler.write_to_json(output_dict: dict, output_file: str | Path) None[source]

Write the profile dictionary to a file.

Parameters:
  • output_dict (dict) – the profile dictionary that will writen.

  • output_file (Union[str, Path]) – The name or the path of the file to generate including the extension (.json).

Returns:

a dict which contains the results of the profiler for the texts.

Return type:

dict