src.asqi.datasets¶
Functions¶
|
Validate that the loaded dataset has all required features and types. |
|
Load a HuggingFace dataset using the provided loader parameters. |
|
Verify that the provided file path points to a valid .txt file. |
|
Verify that the provided file path points to a valid .pdf file. |
Module Contents¶
- src.asqi.datasets.validate_dataset_features(dataset: datasets.Dataset, expected_features: Sequence[asqi.schemas.DatasetFeature | asqi.schemas.HFFeature], dataset_name: str = 'dataset') None¶
Validate that the loaded dataset has all required features and types.
- Args:
dataset: The loaded HuggingFace dataset to validate. expected_features: List of feature definitions from InputDataset or OutputDataset. dataset_name: Name of the dataset for error messages (default: “dataset”).
- Raises:
ValueError: If required features are missing or types don’t match.
- Note:
Validates both feature existence and types (dtypes for scalars, feature types for Image/Audio/Video/List/ClassLabel). Complex nested structures (DictFeature fields) are not fully validated.
- src.asqi.datasets.load_hf_dataset(dataset_config: dict | asqi.schemas.HFDatasetDefinition, input_mount_path: pathlib.Path | None = None, expected_features: Sequence[asqi.schemas.DatasetFeature | asqi.schemas.HFFeature] | None = None, dataset_name: str = 'dataset') datasets.Dataset¶
Load a HuggingFace dataset using the provided loader parameters.
- Args:
dataset_config: Configuration for loading the HuggingFace dataset. input_mount_path: Optional path to prepend to relative data_files/data_dir paths.
Typically used in containers to resolve paths relative to the input mount point. Absolute paths in the dataset config are not modified.
- expected_features: Optional list of features to validate after loading.
If provided, validates both feature existence and types (dtypes for scalars, feature types for Image/Audio/Video/List/ClassLabel).
- dataset_name: Name of the dataset for validation error messages (default: “dataset”).
Only used when expected_features is provided.
- Returns:
Dataset: Loaded HuggingFace dataset.
- Raises:
ValidationError: If dataset_config dict fails Pydantic validation. ValueError: If expected_features is provided and validation fails.
- Security Note:
This function uses local file loaders (json, csv, parquet, etc.) via builder_name constrained by Literal types in DatasetLoaderParams. The revision parameter is provided for forward compatibility with HF Hub datasets, but current usage is limited to local files only.
- src.asqi.datasets.verify_txt_file(file_path: str) str¶
Verify that the provided file path points to a valid .txt file.
- Args:
file_path (str): Path to the .txt file.
- Returns:
str: The validated file path.
- Raises:
ValueError: If the file is not a .txt file.
- src.asqi.datasets.verify_pdf_file(file_path: str) str¶
Verify that the provided file path points to a valid .pdf file.
- Args:
file_path (str): Path to the .pdf file.
- Returns:
str: The validated file path.
- Raises:
ValueError: If the file is not a .pdf file.