TextSimilarity experiments

class previsionio.text_similarity.DescriptionsColumnConfig(content_column, id_column)

Bases: previsionio.experiment_config.ExperimentConfig

Description Column configuration for starting an experiment: this object defines the role of specific columns in the dataset.

Parameters:
  • content_column (str, required) – Name of the column containing the text descriptions in the description dataset.
  • id_column (str, optional) – Name of the id column in the description dataset.
class previsionio.text_similarity.ModelEmbedding

Bases: enum.Enum

Embedding models for Text Similarity

TFIDF = 'tf_idf'

Term Frequency - Inverse Document Frequency

Transformer = 'transformer'

Transformer

TransformerFineTuned = 'transformer_fine_tuned'

fine tuned Transformer

class previsionio.text_similarity.ModelsParameters(model_embedding: previsionio.text_similarity.ModelEmbedding = <ModelEmbedding.TFIDF: 'tf_idf'>, preprocessing: previsionio.text_similarity.Preprocessing = <previsionio.text_similarity.Preprocessing object>, models: List[previsionio.text_similarity.TextSimilarityModels] = [<TextSimilarityModels.BruteForce: 'brute_force'>])

Bases: previsionio.experiment_config.ExperimentConfig

Training configuration that holds the relevant data for an experiment description: the wanted feature engineering, the selected models, the training speed…

Parameters:
  • preprocessing (Preprocessing, optional) –

    Dictionary of the text preprocessings to be applied (only for “tf_idf” embedding model),

    • word_stemming: default to “yes”
    • ignore_stop_word: default to “auto”, choice will be made depending on if the text descriptions contain full sentences or not
    • ignore_punctuation: default to “no”.
  • model_embedding (ModelEmbedding, optional) – Name of the embedding model to be used (among: “tf_idf”, “transformer”, “transformer_fine_tuned”).
  • models (list(TextSimilarityModels), optional) – Names of the searching models to be used (among: “brute_force”, “cluster_pruning”, “ivf_opq”, “hkm”, “lsh”).
class previsionio.text_similarity.QueriesColumnConfig(queries_dataset_content_column, queries_dataset_matching_id_description_column, queries_dataset_id_column=None)

Bases: previsionio.experiment_config.ExperimentConfig

Description Column configuration for starting an experiment: this object defines the role of specific columns in the dataset.

Parameters:
  • content_column (str, required) – Name of the column containing the text queries in the description dataset.
  • id_column (str, optional) – Name of the id column in the description dataset.
class previsionio.text_similarity.TextSimilarity(**experiment_version_info)

Bases: previsionio.experiment_version.BaseExperimentVersion

A text similarity experiment version

best_model

Get the model with the best predictive performance over all models (including Blend models), where the best performance corresponds to a minimal loss.

Returns:Model with the best performance in the experiment, or None if no model matched the search filter.
Return type:(Model, None)
dataset

Get the Dataset object corresponding to the training dataset of this experiment version.

Returns:Associated training dataset
Return type:Dataset
delete()

Delete an experiment version from the actual [client] workspace.

Raises:
  • PrevisionException – If the experiment version does not exist
  • requests.exceptions.ConnectionError – Error processing the request
delete_prediction(prediction_id: str)

Delete a prediction in the list for the current experiment from the actual [client] workspace.

Parameters:prediction_id (str) – Unique id of the prediction to delete
Returns:Deletion process results
Return type:dict
delete_predictions()

Delete all predictions in the list for the current experiment from the actual [client] workspace.

Returns:Deletion process results
Return type:dict
done

Get a flag indicating whether or not the experiment is currently done.

Returns:done status
Return type:bool
fastest_model

Returns the model that predicts with the lowest response time

Returns:Model object – corresponding to the fastest model
classmethod from_id(_id: str) → previsionio.text_similarity.TextSimilarity

Get a text-similarity experiment version from the platform by its unique id.

Parameters:_id (str) – Unique id of the experiment version to retrieve
Returns:Fetched experiment version
Return type:TextSimilarity
Raises:PrevisionException – Any error while fetching data from the platform or parsing result
get_holdout_predictions(full: bool = False)

Retrieves the list of holdout predictions for the current experiment from client workspace (with the full predictions object if necessary) :param full: If true, return full holdout prediction objects (else only metadata) :type full: boolean

get_predictions(full: bool = False)

Retrieves the list of predictions for the current experiment from client workspace (with the full predictions object if necessary) :param full: If true, return full prediction objects (else only metadata) :type full: boolean

model_class

alias of previsionio.model.TextSimilarityModel

models

Get the list of models generated for the current experiment version. Only the models that are done training are retrieved.

Returns:List of models found by the platform for the experiment
Return type:list(Model)
new_version(dataset: previsionio.dataset.Dataset = None, description_column_config: previsionio.text_similarity.DescriptionsColumnConfig = None, metric: previsionio.metrics.TextSimilarity = None, top_k: int = None, lang: previsionio.text_similarity.TextSimilarityLang = None, queries_dataset: previsionio.dataset.Dataset = None, queries_column_config: Optional[previsionio.text_similarity.QueriesColumnConfig] = None, models_parameters: previsionio.text_similarity.ListModelsParameters = None, description: str = None) → previsionio.text_similarity.TextSimilarity

Start a new text-similarity experiment version training from this version (on the platform). The training parameters are copied from the current version and then overridden for those provided.

Parameters:
  • dataset (Dataset) – Reference to the dataset object to use for as training dataset
  • description_column_config (DescriptionsColumnConfig) – Description column configuration (see the documentation of the DescriptionsColumnConfig resource for more details on each possible column types)
  • metric (metrics.TextSimilarity, optional) – Specific metric to use for the experiment
  • top_k (int, optional) – top_k
  • lang (TextSimilarityLang, optional) – lang of the training dataset
  • queries_dataset (Dataset, optional) – Reference to a dataset object to use as a queries dataset
  • queries_column_config (QueriesColumnConfig) – Queries column configuration (see the documentation of the QueriesColumnConfig resource for more details on each possible column types)
  • models_parameters (ListModelsParameters) – Specific training configuration (see the documentation of the ListModelsParameters resource for more details on all the parameters)
  • description (str, optional) – The description of this experiment version (default: None)
Returns:

Newly created text-similarity experiment version object (new version)

Return type:

TextSimilarity

print_info()

Print all info on the experiment.

queries_dataset

Get the Dataset object corresponding to the queries dataset of this experiment version.

Returns:Associated queries dataset
Return type:Dataset
running

Get a flag indicating whether or not the experiment is currently running.

Returns:Running status
Return type:bool
schema

Get the data schema of the experiment.

Returns:Experiment schema
Return type:dict
score

Get the current score of the experiment (i.e. the score of the model that is currently considered the best performance-wise for this experiment).

Returns:Experiment score (or infinity if not available).
Return type:float
status

Get a flag indicating whether or not the experiment is currently running.

Returns:Running status
Return type:bool
stop()

Stop an experiment (stopping all nodes currently in progress).

train_dataset

Get the Dataset object corresponding to the training dataset of the experiment.

Returns:Associated training dataset
Return type:Dataset
update_status()

Get an update on the status of a resource.

Parameters:specific_url (str, optional) – Specific (already parametrized) url to fetch the resource from (otherwise the url is built from the resource type and unique _id)
Returns:Updated status info
Return type:dict
wait_until(condition, raise_on_error: bool = True, timeout: float = 3600.0)

Wait until condition is fulfilled, then break.

Parameters:
  • (func (condition) – (BaseExperimentVersion) -> bool.): Function to use to check the break condition
  • raise_on_error (bool, optional) – If true then the function will stop on error, otherwise it will continue waiting (default: True)
  • timeout (float, optional) – Maximal amount of time to wait before forcing exit

Example:

experiment.wait_until(lambda experimentv: len(experimentv.models) > 3)
Raises:PrevisionException – If the resource could not be fetched or there was a timeout.
class previsionio.text_similarity.TextSimilarityLang

Bases: enum.Enum

An enumeration.

class previsionio.text_similarity.TextSimilarityModels

Bases: enum.Enum

Similarity search models for Text Similarity

BruteForce = 'brute_force'

Brute force search

ClusterPruning = 'cluster_pruning'

Cluster Pruning

HKM = 'hkm'

Hierarchical K-Means

IVFOPQ = 'ivf_opq'

InVerted File system and Optimized Product Quantization

LSH = 'lsh'

Locality Sensitive Hashing