TextSimilarity usecases

class previsionio.text_similarity.DescriptionsColumnConfig(content_column, id_column)

Bases: previsionio.usecase_config.UsecaseConfig

Description Column configuration for starting a usecase: this object defines the role of specific columns in the dataset.

Parameters:
  • content_column (str, required) – Name of the column containing the text descriptions in the description dataset.
  • id_column (str, optional) – Name of the id column in the description dataset.
class previsionio.text_similarity.ModelEmbedding

Bases: enum.Enum

Embedding models for Text Similarity

TFIDF = 'tf_idf'

Term Frequency - Inverse Document Frequency

Transformer = 'transformer'

Transformer

TransformerFineTuned = 'transformer_fine_tuned'

fine tuned Transformer

class previsionio.text_similarity.ModelsParameters(model_embedding: previsionio.text_similarity.ModelEmbedding = <ModelEmbedding.TFIDF: 'tf_idf'>, preprocessing: previsionio.text_similarity.Preprocessing = <previsionio.text_similarity.Preprocessing object>, models: List[previsionio.text_similarity.TextSimilarityModels] = [<TextSimilarityModels.BruteForce: 'brute_force'>])

Bases: previsionio.usecase_config.UsecaseConfig

Training configuration that holds the relevant data for a usecase description: the wanted feature engineering, the selected models, the training speed…

Parameters:
  • preprocessing (Preprocessing, optional) –

    Dictionary of the text preprocessings to be applied (only for “tf_idf” embedding model),

    • word_stemming: default to “yes”
    • ignore_stop_word: default to “auto”, choice will be made depending on if the text descriptions contain full sentences or not
    • ignore_punctuation: default to “no”.
  • model_embedding (ModelEmbedding, optional) – Name of the embedding model to be used (among: “tf_idf”, “transformer”, “transformer_fine_tuned”).
  • models (list(TextSimilarityModels), optional) – Names of the searching models to be used (among: “brute_force”, “cluster_pruning”, “ivfopq”, “hkm”, “lsh”).
class previsionio.text_similarity.QueriesColumnConfig(queries_dataset_content_column, queries_dataset_matching_id_description_column, queries_dataset_id_column=None)

Bases: previsionio.usecase_config.UsecaseConfig

Description Column configuration for starting a usecase: this object defines the role of specific columns in the dataset.

Parameters:
  • content_column (str, required) – Name of the column containing the text queries in the description dataset.
  • id_column (str, optional) – Name of the id column in the description dataset.
class previsionio.text_similarity.TextSimilarity(**usecase_info)

Bases: previsionio.usecase_version.BaseUsecaseVersion

A text similarity usecase version

model_class

alias of previsionio.model.TextSimilarityModel

new_version(description: str = None, dataset: previsionio.dataset.Dataset = None, description_column_config: previsionio.text_similarity.DescriptionsColumnConfig = None, metric: previsionio.metrics.TextSimilarity = None, top_k: int = None, lang: previsionio.text_similarity.TextSimilarityLang = <TextSimilarityLang.Auto: 'auto'>, queries_dataset: previsionio.dataset.Dataset = None, queries_column_config: Optional[previsionio.text_similarity.QueriesColumnConfig] = None, models_parameters: previsionio.text_similarity.ListModelsParameters = None, **kwargs) → previsionio.text_similarity.TextSimilarity

Start a text similarity usecase training to create a new version of the usecase (on the platform): the training configs are copied from the current version and then overridden for the given parameters.

Parameters:
  • description (str, optional) – additional description of the version
  • dataset (Dataset, DatasetImages, optional) – Reference to the dataset object to use for as training dataset
  • description_column_config (DescriptionsColumnConfig, optional) – Column configuration for the usecase (see the documentation of the ColumnConfig resource for more details on each possible column types)
  • metric (metrics.TextSimilarity, optional) – Specific metric to use for the usecase (default: None)
  • holdout_dataset (Dataset, optional) – Reference to a dataset object to use as a holdout dataset (default: None)
  • training_config (TrainingConfig, optional) – Specific training configuration (see the documentation of the TrainingConfig resource for more details on all the parameters)
Returns:

Newly created text similarity usecase version object (new version)

Return type:

TextSimilarity

class previsionio.text_similarity.TextSimilarityLang

Bases: enum.Enum

An enumeration.

class previsionio.text_similarity.TextSimilarityModels

Bases: enum.Enum

Similarity search models for Text Similarity

BruteForce = 'brute_force'

Brute force search

ClusterPruning = 'cluster_pruning'

Cluster Pruning

HKM = 'hkm'

Hierarchical K-Means

IVFOPQ = 'ivfopq'

InVerted File system and Optimized Product Quantization

LSH = 'lsh'

Locality Sensitive Hashing