Examples

This page provides example usages of the enc4ppm package.

Basic example

The following example shows how to setup encoder to properly read log columns like case id, activity name, etc.

import pandas as pd

from enc4ppm.frequency_encoder import FrequencyEncoder
from enc4ppm.constants import LabelingType

log = pd.read_csv('log.csv')

encoder = FrequencyEncoder(
    labeling_type=LabelingType.NEXT_ACTIVITY,
    case_id_key='CaseID',               # if not set, defaults to case:concept:name
    activity_key='Activity',            # if not set, defaults to concept:name
    timestamp_key='Complete Timestamp', # if not set, defaults to time:timestamp
)

encoded_log = encoder.encode(log)

Encode train-test data: freezing the encoder

When working with train-test data, we don't want the encoder to look at test data to avoid data leakage. It is usually a good idea to first encode training data and freeze the encoder on it, then use the frozen encoder to encode test data.

Encoder can be frozen by calling .encode() with freeze=True. When an encoder is not frozen, calling .encode() will build internal vocabularies of activities, attributes, and so on; on the other hand, if .encode() is called on a frozen encoder it will use the previously computed internal vocabularies.

For example, test set may contain an activity ActivityX that is not present in training set. If frequency encoding is performed on the full log (training+test), then the encoded dataframe will contain a column ActivityX even though that activity should not be known during training.

The following example avoids this problem by encoding training and test sets separately, freezing the encoder on the training data.

import pandas as pd

from enc4ppm.simple_index_encoder import SimpleIndexEncoder
from enc4ppm.constants import LabelingType, PrefixStrategy

def split_log(log):
    # your split logic here

log = pd.read_csv('log.csv')

train_log, test_log = split_log(log)                       # split the log before encoding

encoder = SimpleIndexEncoder(
    labeling_type=LabelingType.REMAINING_TIME
)

encoded_train_log = encoder.encode(train_log, freeze=True) # freeze encoder on train log
encoded_test_log = encoder.encode(test_log)                # use frozen encoder on test log

The encoder is frozen on the train log. As a result, when encoding the test log, ActivityX will be mapped to activity UNKNOWN, because it was not present in train log.

Save and load encoder to disk

You can save the encoder object to a file and load it for later use. In order to save an encoder to disk you need to freeze it first.

The following example shows a simple save/load worflow.

import pandas as pd

from enc4ppm.frequency_encoder import FrequencyEncoder
from enc4ppm.constants import LabelingType

log = pd.read_csv('log.csv')

encoder = FrequencyEncoder(
    labeling_type=LabelingType.NEXT_ACTIVITY
)

encoded_log = encoder.encode(log, freeze=True)              # freeze encoder
encoder.summary()                                           # print info about the encoder

encoder.save('/path/to/encoder.pkl')                        # save encoder to disk

loaded_encoder = BaseEncoder.load('/path/to/encoder.pkl')   # load encoder from disk
loaded_encoder.summary()                                    # print info about loaded_encoder (should output the same as encoder.summary())

inference_log = pd.read_csv('inference.csv')
encoded_inference_log = loaded_encoder.encode(inference_log)

Prefix length and strategy

You can specify prefix_length to set a specific prefix length, otherwise the maximum prefix length found in the log will be used. You can specify prefix_strategy to be either up_to_specified (the default) which will consider all prefix lengths from 1 up to prefix_length, or only_specified which will consider only prefix of length prefix_length.

The following code encodes a log keeping only examples with a prefix length of 10.

import pandas as pd

from enc4ppm.simple_index_encoder import SimpleIndexEncoder
from enc4ppm.constants import LabelingType, PrefixStrategy

log = pd.read_csv('log.csv')

encoder = SimpleIndexEncoder(
    labeling_type=LabelingType.REMAINING_TIME,
    prefix_length=10,
    prefix_strategy=PrefixStrategy.ONLY_SPECIFIED,
)

encoded_log = encoder.encode(log)

Categorical encoding

The categorical_encoding parameter determines whether categorical values (activity names and categorical attributes) are kept as string (default) or one-hot encoded.

The following example encodes with one-hot encoding.

import pandas as pd

from enc4ppm.simple_index_encoder import SimpleIndexEncoder
from enc4ppm.constants import LabelingType, CategoricalEncoding

log = pd.read_csv('log.csv')

encoder = SimpleIndexEncoder(
    labeling_type=LabelingType.REMAINING_TIME,
    categorical_encoding=CategoricalEncoding.ONE_HOT,
)

encoded_log = encoder.encode(log)

Numerical scaling

The numerical_scaling parameter can be used to scale numerical values (numerical attributes, label in the case of remaining time, and TimeSinceCaseStart and TimeSincePreviousActivity features). It can be either none (default) to not apply any scaling, or standardization to apply standardization. The dictionary encoder.numerical_scaling_info will contain mean and std values to transform standardized numerical values back to their original range. unscale_numerical_feature is a helper method that unscales standardization automatically.

The following example first standardizes all numerical values, then restores the 'label' column back to its original space.

import pandas as pd

from enc4ppm.frequency_encoder import FrequencyEncoder
from enc4ppm.constants import LabelingType, NumericalScaling

log = pd.read_csv('log.csv')

encoder = FrequencyEncoder(
    labeling_type=LabelingType.REMAINING_TIME,
    numerical_scaling=NumericalScaling.STANDARDIZATION,
)

encoded_log = encoder.encode(log)

# Use encoder.numerical_scaling_info to restore original label values...
restored_label = encoder.numerical_scaling_info[encoder.LABEL_KEY]['std'] * encoded_log[encoder.LABEL_KEY] + encoder.numerical_scaling_info[encoder.LABEL_KEY]['mean']

# ... or use the method unscale_numerical_feature()
restored_label = encoder.unscale_numerical_feature(encoded_log[encoder.LABEL_KEY], encoder.LABEL_KEY)

Label remaining time as a classification task

Instead of labeling remaining time as a regression task (with label being the number of hours for remaining trace completion), it is also possible to label it as a classification task.

The following example labels remaining time as a classification problem, also specifying the number of bins to divide times in.

import pandas as pd

from enc4ppm.simple_index_encoder import SimpleIndexEncoder
from enc4ppm.constants import LabelingType

log = pd.read_csv('log.csv')

encoder = SimpleIndexEncoder(
    labeling_type=LabelingType.REMAINING_TIME_CLASSIFICATION
)
encoder.set_remaining_time_num_bins(10) # cut in 10 bins (10 classes)

encoded_log = encoder.encode(log)