Getting started

The following document is a step by step usage example of the Prevision.io Python SDK. The full documentation of the software is available here.

Pre-requisites

You need to have an account at cloud.prevision.io or on an on-premise version installed in your company. Contact us or your IT manager for more information.

You will be working on a specific “instance”. This instance corresponds to the subdomain at the beginning of the url in your prevision.io address: https://<your instance>.prevision.io.

Get the package

pip install previsionio

Set up your client

Prevision.io’s SDK client uses a specific master token to authenticate with the instance’s server and allows you to perform various requests. To get your master token, log in the online interface of your instance, navigate to the admin page and copy the token.

You can either set the token and the instance name as environment variables, by specifying PREVISION_URL and PREVISION_MASTER_TOKEN, or at the beginning of your script:

import previsionio as pio

# The client is initialized with your master token and the url of the prevision.io server
# (or local installation, if applicable)
url = "https://<your instance>.prevision.io"
token = "<your token>"
pio.client.init_client(url, token)

# You can manage the verbosity (only output warnings and errors by default)
pio.verbose(
    False,           # whether to activate info logging
    debug=False,     # whether to activate detailed debug logging
    event_log=False, # whether to activate detailed event managers debug logging
)

# You can manage the duration you wish to wait for an asynchronous response
pio.config.default_timeout = 3600

# You can manage the number of retries for each call to the Prevision.io API
pio.config.request_retries = 6

# You can manage the duration of retry for each call to the Prevision.io API
pio.config.request_retry_time = 10

Create a project

First things first, to upload data or train a usecase, you need to create a project.

# create project
project = pio.Project.new(name="project_name",
                          description="project description")

Data

To train a usecase, you need to gather some training data. This data must be uploaded to your instance using either a data source, a file path or a pandas.DataFrame.

Managing datasources & connectors

Datasources and connectors are Prevision.io’s way of keeping a link to a source of data and taking snapshots when needed. The avaible data sources are:

  • SQL
  • FTP
  • SFTP
  • S3
  • GCP

Connectors hold the credentials to connect to the distant data sources. Then you can specify the exact resource to extract from a data source (be it the path to the file to load, the name of the database table to parse, …).

Creating a connector

To create a connector, use the appropriate method of project class. For example, to create a connector to an SQL database, use the create_sql_connector() and pass in your credentials:

connector = project.create_sql_connector('my_sql_connector',
                                         'https://myserver.com',
                                         port=3306,
                                         username='username',
                                         password='password')

For more information on all the available connectors, check out the Project full documentation.

Creating a data source

After you’ve created a connector, you need to use a datasource to actually refer to and fetch a resource in the distant data source. To create a datasource, you need to link the matching connector and to supply the relevant info, depending on the connector type:

datasource = project.create_datasource(connector,
                                       'my_sql_datasource',
                                       database='my_db',
                                       table='table1')

For more details on the creation of a datasource, check out the Project full documentation of the method create_datasource.

You can then create datasets from this datasource as explained in Uploading Data.

Listing available connectors and data sources

Connectors and datasources already registered on your workspace can be listed using the list_connectors() and list_datasource() method from project class:

connectors = project.list_connectors()
for connector in connectors:
    print(connector.name)

datasources = project.list_datasource()
for datasource in datasources:
    print(datasource.name)

Uploading Data

You can upload data from three different sources: a path to a local (csv, zip) file, a pandas.DataFrame or a created data source

# Upload tabular data from a CSV file
data_path = 'path/to/your/data.csv'
dataset = project.create_dataset(name='helloworld', file_name=data_path)

# or use a pandas DataFrame
dataframe = pd.read_csv(data_path)
dataset = project.create_dataset(name='helloworld', dataframe=dataframe)

# or use a created data source
datasource = pio.DataSource.from_id('my_datasource_id')
dataset = project.create_dataset(name='helloworld', datasource=datasource)

# Upload an image folder
image_folder_path = 'path/to/your/image_data.zip'
image_folder = project.create_image_folder(name='helloworld', file_name=image_folder_path)

This will automatically upload the data as a new dataset in your workspace. If you go to the online interface, you will see this new dataset in the list of datasets (in the “Data” tab).

Listing available datasets

To get a list of all the datasets currently available in your workspace, use the list_datasets() method:

# List tabular datasets
datasets = project.list_datasets()
for dataset in datasets:
    print(dataset.name)

# List image folders
image_folders = project.list_image_folders()
for folder in image_folders:
    print(folder.name)

Downloading data from your workspace

If you created or uploaded a dataset in your workspace and want to grab it locally, simply use the Dataset.download method:

out_path = dataset.download(download_path="your/local/path")

Regression/Classification/Multi-classification usecases

Configuring the dataset

To start a usecase you need to specify the dataset to be used and its configuration (target column, weight column, id column, …). To get a full documentation check the api reference of the ColumnConfig in Usecase configuration.

column_config = pio.ColumnConfig(target_column='TARGET', id_column='ID')

Configuring the training parameters

If you want, you can also specify some training parameters, such as which models are used, which transformations are applied, and how the models are optimized. To get a full documentation check the api reference of the TrainingConfig in Usecase configuration.

training_config = pio.TrainingConfig(
    advanced_models=[pio.AdvancedModel.LinReg],
    normal_models=[pio.NormalModel.LinReg],
    simple_models=[pio.SimpleModel.DecisionTree],
    features=[pio.Feature.Counts],
    profile=pio.Profile.Quick,
)

Starting training

You can now create a new usecase based on:

  • a usecase name
  • a dataset
  • a column config
  • (optional) a metric type
  • (optional) a training config
  • (optional) a holdout dataset (dataset only used for evaluation)
usecase_version = project.fit_classification(
    name='helloworld_classif',
    dataset=dataset,
    column_config=column_config,
    metric=pio.metrics.Classification.AUC,
    training_config=training_config,
    holdout_dataset=None,
)

If you want to use image data for your usecase, you need to provide the API with both the tabular dataset and the image folder:

usecase_version = project.fit_image_classification(
    name='helloworld_images_classif',
    dataset=(dataset, image_folder),
    column_config=column_config,
    metric=pio.metrics.Classification.AUC,
    training_config=training_config,
    holdout_dataset=None,
)

To get an exhaustive list of the available metrics go to the api reference Metrics.

Making predictions

To make predictions from a dataset and a usecase, you need to wait until at least one model is trained. This can be achieved in the following way:

# block until there is at least 1 model trained
usecase_version.wait_until(lambda usecasev: len(usecasev.models) > 0)

# check out the usecase status and other info
usecase_version.print_info()
print('Current (best model) score:', usecase_version.score)

Note

The wait_until method takes a function that takes the usecase as an argument, and can therefore access any info relative to the usecase.

Then you have to options:

  1. you can predict from a dataset of your workspace, which returns a previsionio.ValidationPrediction object. It allows you to keep on working even if the prediction isn’t complete
  2. you can predict from a pd.DataFrame, which returns a pd.DataFrame once the prediction is complete
# predict from a dataset of your workspace
validation_prediction = usecase_version.predict_from_dataset(test_dataset)
# get the result at a pandas.DataFrame
prediction_df = validation_prediction.get_result()

# predict from a pandas.DataFrame
prediction_df = usecase_version.predict(test_dataframe)

Time Series usecases

A time series usecase is very similar to a regression usecase. The main differences rely in the dataset configuration, and the specification of a time window.

Configuring the dataset

Here you need to specify which column in the dataset defines the time steps. Also you can specify the group_columns (columns defining a unique time serie) as well as the apriori_columns (columns containing information known in advanced):

column_config = pio.ColumnConfig(
    target_column='Sales',
    id_column='ID',
    time_column='Date',
    group_columns=['Store', 'Product'],
    apriori_columns=['is_holiday'],
)

Configuring the training parameters

The training config is the same as for a regression usecase (detailed in Configuring the training parameters).

Starting training

You can now create a new usecase based on:

  • a usecase name
  • a dataset
  • a column config
  • a time window
  • (optional) a metric type
  • (optional) a training config

In particular the time_window parameter defines the period in the past that you have for each prediction, and the period in the future that you want to predict:

# Define your time window:
# example here using 2 weeks in the past to predict the next week
time_window = pio.TimeWindow(
    derivation_start=-28,
    derivation_end=-14,
    forecast_start=1,
    forecast_end=7,
)

usecase_version = project.fit_timeseries_regression(
    name='helloworld_time_series',
    dataset=dataset,
    time_window=time_window,
    column_config=column_config,
    metric=pio.metrics.Regression.RMSE,
    training_config=training_config,
    holdout_dataset=None,
)

To get a full documentation check the api reference Time Series usecases.

Making predictions

The prediction workflow is the same as for a classic usecase (detailed in Making predictions).

Text Similarity usecases

A Text Similarity usecase matches the most similar texts between a dataset containing descriptions (can be seen as a catalog) and a dataset containing queries. It first converts texts to numerical vectors (text embeddings) and then performs a similarity search to retrieve the most similar documents to a query.

Configuring the datasets

To start a usecase you need to specify the datasets to be used and their configuration. Note that a DescriptionsDataset is required while a QueriesDataset is optional during training (used for scoring).

# Required: configuration of the DescriptionsDataset
description_column_config = pio.TextSimilarity.DescriptionsColumnConfig(
    content_column='text_descriptions',
    id_column='ID',
)

# Optional: configuration of the QueriesDataset
queries_column_config = pio.TextSimilarity.QueriesColumnConfig(
    content_column='text_queries',
    id_column='ID',
)

To get a full documentation check the api reference of DescriptionsColumnConfig and QueriesColumnConfig.

Configuring the training parameters

If you want, you can also specify some training parameters, such as which embedding models, searching models and preprocessing are used. Here you need to specify one configuration per embedding model you want to use:

# Using TF-IDF as embedding model
models_parameters_1 = pio.ModelsParameters(
    model_embedding=pio.ModelEmbedding.TFIDF,
    preprocessing=pio.Preprocessing(),
    models=[pio.TextSimilarityModels.BruteForce, pio.TextSimilarityModels.ClusterPruning],
)

# Using Transformer as embedding model
models_parameters_2 = pio.ModelsParameters(
    model_embedding=pio.ModelEmbedding.Transformer,
    preprocessing=pio.Preprocessing(),
    models=[pio.TextSimilarityModels.BruteForce, pio.TextSimilarityModels.IVFOPQ],
)

# Using fine-tuned Transformer as embedding model
models_parameters_3 = pio.ModelsParameters(
    model_embedding=pio.ModelEmbedding.TransformerFineTuned,
    preprocessing=pio.Preprocessing(),
    models=[pio.TextSimilarityModels.BruteForce, pio.TextSimilarityModels.IVFOPQ],
)

# Gather everything
models_parameters = [models_parameters_1, models_parameters_2, models_parameters_3]
models_parameters = pio.ListModelsParameters(models_parameters=models_parameters)

To get a full documentation check the api reference of ModelsParameters.

Note

If you want the default configuration of text similarity models, simply use:

models_parameters = pio.ListModelsParameters()

Starting the training

You can then create a new text similarity usecase based on:

  • a usecase name
  • a dataset
  • a description column config
  • (optional) a queries dataset
  • (optional) a queries column config
  • (optional) a metric type
  • (optional) the number of top k results you want per query
  • (optional) a language
  • (optional) a models parameters list
usecase_verion = project.fit_text_similarity(
    name='helloworld_text_similarity',
    dataset=dataset,
    description_column_config=description_column_config,
    metric=pio.metrics.TextSimilarity.accuracy_at_k,
    top_k=10,
    queries_dataset=queries_dataset,
    queries_column_config=queries_column_config,
    models_parameters=models_parameters,
)

To get a full documentation check the api reference of previsionio.metrics.TextSimilarity.

Making predictions

The prediction workflow is very similar to a classic usecase (detailed in Making predictions).

The only differences are the specific parameters top_k and queries_dataset_matching_id_description_column which are optional.

To get a full documentation check the api reference of TextSimilarityModel prediction methods.

Deployed usecases

Prevision.io’s SDK allows to deploy a usecase’s models. Deployed models are made available for unit and bulk prediction through apis. Then you can follow the usage of a model and the evolution of its input features distribution.

You first need to deploy a main model (and a challenger model) from an existing usecase:

# retrieve the best model of your usecase
uc_best_model = usecase_version.best_model

# deploy the usecase model
usecase_deployment = project.create_usecase_deployment(
    'my_deployed_usecase',
    main_model=uc_best_model,
    challenger_model=None,
)

Now you can make bulk predictions from your deployed model(s):

# make predictions
deployment_prediction = usecase_deployment.predict_from_dataset(test_dataset)

# retrieve prediction from main model
prediction_df = deployment_prediction.get_result()

# retrieve prediction from challenger model (if any)
prediction_df = deployment_prediction.get_challenger_result()

To get a full documentation check the api reference Usecase Deployment.

You can also make unitary predictions from the main model:

# create an api key for your model
usecase_deployment.create_api_key()

# retrieve the last client id and client secret
creds = usecase_deployment.get_api_keys()[-1]

# initialize the deployed model with its url, your client id and client secret
model = pio.DeployedModel(
    prevision_app_url=usecase_deployment.url,
    client_id=creds['client_id'],
    client_secret=creds['client_secret'],
)

# make a prediction
prediction, confidance, explain = model.predict(
    predict_data={'feature1': 0, 'feature2': 42},
    use_confidence=True,
    explain=True,
)

To get a full documentation check the api reference Deployed model.

Exporters

Once you trained a model and made predictions from it you might want to export your results on a remote filesystem/database. To do so you will need a registered connector on your project (described in section Creating a connector).

Creating an exporter

The first step is to create an exporter in your project:

exporter = project.create_exporter(
    connector=connector,
    name = 'my_exporter',
    path='remote/file/path.csv',
    write_mode = pio.ExporterWriteMode.timestamp,
)

To get a full documentation check the api reference Exporter.

Exporting

Once your exporter is operational you can export your datasets or predictions:

# export a dataset stored in your project
export = exporter.export_dataset(
    dataset=dataset,
    wait_for_export=False,
)

# export a prediction stored in your project
export = exporter.export_prediction(
    prediction=deployment_prediction,
    wait_for_export=False,
)

To get a full documentation check the api reference Export.

Additional util methods

Retrieving a use case

Since a use case can be somewhat long to train, it can be useful to separate the training, monitoring and prediction phases.

To do that, we need to be able to recreate a usecase object in python from its name:

usecase_version = pio.Supervised.from_id('<a usecase id>')
# Usecase_version now has all the same methods as a usecase_version
# created directly from a file or a dataframe
usecase_version.print_info()

Stopping and deleting

Once you’re satisfied with model performance, don’t want to wait for the complete training process to be over, or need to free up some resources to start a new training, you can stop the usecase_version simply:

usecase_version.stop()

You’ll still be able to make predictions and get info, but the performance won’t improve anymore. Note: there’s no difference in state between a stopped usecase and a usecase that has completed its training completely.

You can decide to completely delete the usecase:

uc = pio.Usecase.from_id(usecase_version.usecase_id)
uc.delete()

However be careful, in that case any detail about the usecase will be removed, and you won’t be able to make predictions from it anymore.