src.asqi.datasets

Functions

validate_dataset_features(→ None)

Validate that the loaded dataset has all required features and types.

load_hf_dataset(→ datasets.Dataset)

Load a HuggingFace dataset using the provided loader parameters.

verify_txt_file(→ str)

Verify that the provided file path points to a valid .txt file.

verify_pdf_file(→ str)

Verify that the provided file path points to a valid .pdf file.

Module Contents

src.asqi.datasets.validate_dataset_features(dataset: datasets.Dataset, expected_features: Sequence[asqi.schemas.DatasetFeature | asqi.schemas.HFFeature], dataset_name: str = 'dataset') None

Validate that the loaded dataset has all required features and types.

Args:

dataset: The loaded HuggingFace dataset to validate. expected_features: List of feature definitions from InputDataset or OutputDataset. dataset_name: Name of the dataset for error messages (default: “dataset”).

Raises:

ValueError: If required features are missing or types don’t match.

Note:

Validates both feature existence and types (dtypes for scalars, feature types for Image/Audio/Video/List/ClassLabel). Complex nested structures (DictFeature fields) are not fully validated.

src.asqi.datasets.load_hf_dataset(dataset_config: dict | asqi.schemas.HFDatasetDefinition, input_mount_path: pathlib.Path | None = None, expected_features: Sequence[asqi.schemas.DatasetFeature | asqi.schemas.HFFeature] | None = None, dataset_name: str = 'dataset') datasets.Dataset

Load a HuggingFace dataset using the provided loader parameters.

Args:

dataset_config: Configuration for loading the HuggingFace dataset. input_mount_path: Optional path to prepend to relative data_files/data_dir paths.

Typically used in containers to resolve paths relative to the input mount point. Absolute paths in the dataset config are not modified.

expected_features: Optional list of features to validate after loading.

If provided, validates both feature existence and types (dtypes for scalars, feature types for Image/Audio/Video/List/ClassLabel).

dataset_name: Name of the dataset for validation error messages (default: “dataset”).

Only used when expected_features is provided.

Returns:

Dataset: Loaded HuggingFace dataset.

Raises:

ValidationError: If dataset_config dict fails Pydantic validation. ValueError: If expected_features is provided and validation fails.

Security Note:

This function uses local file loaders (json, csv, parquet, etc.) via builder_name constrained by Literal types in DatasetLoaderParams. The revision parameter is provided for forward compatibility with HF Hub datasets, but current usage is limited to local files only.

src.asqi.datasets.verify_txt_file(file_path: str) str

Verify that the provided file path points to a valid .txt file.

Args:

file_path (str): Path to the .txt file.

Returns:

str: The validated file path.

Raises:

ValueError: If the file is not a .txt file.

src.asqi.datasets.verify_pdf_file(file_path: str) str

Verify that the provided file path points to a valid .pdf file.

Args:

file_path (str): Path to the .pdf file.

Returns:

str: The validated file path.

Raises:

ValueError: If the file is not a .pdf file.