<a href="https://colab.research.google.com/github/Shea-Fyffe/transforming-personality-scales/blob/main/tutorials/fine-tuning-transformers-for-text-classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# Fine-tuning Transformer Models for Text Classification of Big Five Items
---
This tutorial illistrates how to *fine-tune* (see [Lui et al., 2020](https://doi.org/10.1007/978-981-15-5573-2)) **Transformer** models to classify Big Five personality items. When applied for this specific purpose, fine-tuning involves training a classification model using a set of items with known Big Five trait labels. Afterwards, a fine-tuned model can be used to predict the Big Five trait of new items (or text).
<br></br>
While this notebook demonstrates how these models can be used for text classification of personality items (i.e., as an automated form of content analysis; [Short et al., 2018](https://doi.org/10.1146/annurev-orgpsych-032117-104622)), the same steps can be taken with other scale inventories or forms of text---merely by changing the training data and labels.

---
## Setup
---

Below, we provide information regarding the libraries, functions, and classes used in this tutorial. *Text Blocks* (like this) will serve as informative sign posts. `Code Blocks` which have a black background will actually perform the commands. We recommend adding a *Scratch Code Cell* (**Ctrl+Alt+N**) for running commands interactively. 
<br></br>
**Libraries and Modules**

Colab comes with a large number of Python libraries pre-loaded. However, `Transformers` is not initially available in Colab. The `Transformers` library can be installed by using the code below. More information on the `Transformers` library can be seen [here](https://huggingface.co/transformers/quicktour.html).
<br></br>
**User-Defined Functions and Classes**

Below we provide several classes and functions to help may the process a bit easier. For each function help text is provided and can be printed via `print(fun_name.__doc__)`. For example, to see documentation for the `fine_tune()` function:

```python
print(fine_tune.__doc__)

Output:

  Fine-tune transformer model for text classification.

  A wrapper function for fine-tuning a pre-trained transformer
  from the popular transformers library. Abstracts away many of the steps
  involved, such as loading a tokenizer and formatting data.
  
  Arguments
  ---------
  model: a string usually returned from ``get_model()``.
  text: a list of text.
  labels: a list of labels.
  train_args: a dictionary of training arguments.
  multi_label: a boolean specifying whether perform multi-label classification (False by default).
  max_seq_len: a string determining how to pad text sequences ('longest' by default).

  Returns
  -------
  trainer : transformers.Trainer
    a fine-tuned transformer model.
  tokenizer : transformers.tokenizer
    the tokenizer of the fine-tuned model.

```
This provides the arguments that can be modified to customize the fine-tuning process.
<br></br>
**Using a GPU**

To speed things up you can use a *GPU* (*optional*). First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

You can confirm that you have an active GPU by using the following command:
```
# check using a command line interface
!nvidia-smi
```


In [None]:
#@markdown __Run:__ Install libraries
## Uncomment command below to install Transformers
! pip install transformers
! pip install sentencepiece

In [None]:
#@markdown __Run:__ Import libraries and modules
# load relevant modules from transformers
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer

# data libraries
from torch.utils.data import Dataset # for formatting data before training
import pandas as pd # for importing and exporting data
import numpy as np

# util libraries
from scipy.special import softmax
from sklearn.metrics import classification_report
from google.colab import drive # optional for getting data
from typing import Dict, List, Tuple # for type hinting
import torch
import os
import sys
import datetime
import gc
import warnings
import requests
from io import StringIO

---
## Functions and Classes
---


In [None]:
#@markdown __RUN:__ Data-Related Functions
# Custom data class
class TextClassificationDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

# Import data function
def import_data(path: str, text_col: str = 'text', label_col: str = None, enc: str = 'latin1'):
    """
    Import a text data from a csv file.

    A wrapper function around pandas.read_csv. Includes URL support.
    
    Arguments
    ---------
    path: a string indicating a local csv file path or url.
    text_col: a string indicating the name of column in csv containing text ('text' by default).
    label_col: a string indicating the name of column in csv containing text labels ('label' by default).
    enc: a string indicating csv file encoding ('latin1' by default).

    Returns
    -------
    List[str]
      a list of text.
    List[str]
      a list of labels.
    pandas.DataFrame
      the raw data.
    """
    if (path.startswith("http")):
        res = requests.get(path,
                           headers= {'User-Agent': 'Mozilla/5.0',
                                     "X-Requested-With": "XMLHttpRequest"})
        path = StringIO(res.text)
    df = pd.read_csv(path, encoding = enc)
    
    if label_col is None:
      return df[text_col].tolist(), df
    return df[text_col].tolist(), df[label_col].tolist(), df

# Format output data function
def format_output_data(raw_outputs, test_case_ids = None, label_values = None, output_probabilities: bool = True,
                       output_predicted_label: bool = True):
    """
    Format model predictions to DataFrame.

    A helper function that formats classification predictions taken from
    ``transformers.Trainer.predict()`` into various outputs.
    
    Arguments
    ---------
    raw_outputs: a numpy.ndarray of predictions from ``transformers.Trainer.predict()``.
    test_case_ids: a list of test case ids (None by default).
    label_values: a list of *unique ordered* labels (None by default).
    output_probabilities: A boolean specifying whether to convert logit predictions to probabilities (True by default).
    output_predicted_label: A boolean whether to append a 'predicted' column with the most likely label (True by default).

    Returns
    -------
    pandas.DataFrame
      a dataset of predicted values.
    """
    

    if output_probabilities:
        raw_outputs = softmax(raw_outputs, axis=1)
    
    out_df = pd.DataFrame(raw_outputs)

    if label_values is not None:
        out_df.columns = label_values

    if output_predicted_label:
        out_df['predicted'] = out_df.idxmax(axis = 1)
    
    if test_case_ids is not None:
        out_df.insert(0, 'id', test_case_ids)
    return out_df

In [None]:
#@markdown __RUN:__  Model-related functions
# Custom fine-tuning function
def fine_tune(model, text, labels, train_args, multi_label: bool = False, max_seq_len: str = 'longest', **kwargs):
    """
    Fine-tune transformer model for text classification.
  
    A wrapper function for fine-tuning a pre-trained transformer
    from the popular transformers library. Abstracts away many of the steps
    involved, such as loading a tokenizer and formatting data.
    
    Arguments
    ---------
    model: a string usually returned from ``get_model()``.
    text: a list of text.
    labels: a list of labels.
    train_args: a dictionary of training arguments.
    multi_label: a boolean specifying whether perform multi-label classification (False by default).
    max_seq_len: a string determining how to pad text sequences ('longest' by default).
    kwargs: additional keyword arguments to pass to ``Trainer.__init__``.
  
    Returns
    -------
    trainer : transformers.Trainer
      a fine-tuned transformer model.
    tokenizer : transformers.tokenizer
      the tokenizer of the fine-tuned model.
    """
    _, model_name = get_model(model)
  
    tokenizer = AutoTokenizer.from_pretrained(model_name)
  
    train_labels_indx, lab_to_id, num_labs = map_labels_to_keys(labels)
    
    if max_seq_len == 'longest':
      train_encodings = tokenizer(text, truncation=True, padding=True)
    else:
      train_encodings = tokenizer(text, padding='max_len', max_length=max_seq_len)
  
    train_dataset = TextClassificationDataset(train_encodings, train_labels_indx)
      
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name, num_labels=num_labs, label2id = lab_to_id
        )
    # will not preform multi_label_classification is number of labels is 2 (or fewer)
    if multi_label and num_labs > 2:
      model.problem_type = "multi_label_classification"
    
    # initialize Trainer class
    trainer = Trainer(model=model,
        args = training_args,
        train_dataset = train_dataset,
        **kwargs
      )
   
    # train model
    trainer.train()

    # add tokenizer to use on the testing set
    trainer.tokenizer = tokenizer

    return trainer

# Get model for classification transformers
def get_model(model_type: str) -> Tuple[str, str]:
    """
    Get pre-trained transformer model.
    
    A helper function that looks up pre-trained model given model type. If model
    is *not* found in lookup will ouput string used as input.
    
    Arguments
    ---------
    model_type: a string indicating model type name (e.g., 'bart', 'bert', 'deberta', 'xlnet')
    
    Returns
    -------
    model_name : tuple(str, str)
      a tuple of the model type and specific model name.
      
    See Also
    --------
    See https://huggingface.co/models for the complete repository of usable transformer models      
    """
    model_dict = {
    'albert': ("albert", "albert-xlarge-v2"),
    'bart': ("bart", "facebook/bart-large"),
    'bert': ("bert", "bert-base-cased"),
    'deberta': ("debertav2", "microsoft/deberta-v3-large"),
    'distilbert': ("distilbert", "distilbert-base-cased-distilled-squad"),
    'distilroberta': ("roberta", "cross-encoder/stsb-distilroberta-base"),
    'electra': ("electra", "cross-encoder/ms-marco-electra-base"),
    'roberta': ("roberta", "roberta-large"),
    'xlnet': ("xlnet", "xlnet-large-cased"),
    'xmlroberta': ("xmlroberta", "xlm-roberta-large"),
    }
    # if model is not found will try model_type as model_name
    model_name = model_dict.get(model_type, (model_type, model_type))
    # returns a Tuple  
    return model_name
  
# Compute evaluation metrics
def evaluate_model(actual: List[str], predicted: List[str], label_values = None, **kwargs):
    """
    Calculate evaluation metrics on test data (given labels are available).

    A helper function that returns model evaluation metrics. 
    
    Arguments
    ---------
    actual: list of actual labels.
    predicted: list of predicted labels.
    label_values: a list of *unique ordered* labels (None by default).
    kwargs: additional keyword arguments to pass to ``sklearn.metrics.classification_report()``.

    Returns
    -------
    dict
      summary of the precision, recall, F1 score for each class
    """
    if label_values is not None:
        kwargs.update({'target_names': label_values})
    else:
        kwargs.update({'target_names': list(dict.fromkeys(actual))})
        
    res = classification_report(y_true = actual, y_pred = predicted, output_dict = True, **kwargs)

    class_level = {k: res.get(k, None) for k in res.keys() if k in kwargs['target_names']}
    overall = {k: res.get(k, None)for k in res.keys() if k not in kwargs['target_names']}
    return {'overall' : pd.DataFrame(overall), 'by_label': pd.DataFrame(class_level)}

In [None]:
#@markdown __RUN:__ Utility functions
# Map labels to keys
def map_labels_to_keys(labels: List[str], sort_labels: bool = True):
    """
    Map text labels to integers.
    
    This function maps a list of strings to integers.
    
    Arguments
    ---------
    labels: a list of labels
    sort_labels: a boolean specifying if labels should be sorted alphabetically before recoding (True by default)

    Returns
    -------
    List[str]
      a list of labels.
    dict{str : int}
      a dictionary where labels are keys and mapped int are values.
    int
      the number of class labels.
    """
    k = list(dict.fromkeys(labels))
    if sort_labels:
      k.sort()
    labels_to_id = {k[i] : int(i) for i in range(0, len(k))}
    labels_out = []
    for j in labels:
      labels_out.append(labels_to_id[j])
    return labels_out, labels_to_id, len(k)

# Helper to return labels from trained model
def get_labels(trained_model):
    """
    Return list of class labels from a model returned by `Trainer.train()`
    
    Arguments
    ---------
    trained_model: a trained transformer model of class Trainer.

    Returns
    -------
    List[str]
      a list of labels.
    """
    return trained_model.model.config.label2id
    
# Helper to check for GPU device and garbage collect
def get_gpu ():
    """
    Check if CUDA compatible GPU is available.

    To manually check if you are able to use a GPU environment in Colab click
    the `Runtime` menu above, then select `Change Runtime Type`, the pick "GPU"
    for the `Hardware Accelerator` dropdown.
    
    Returns
    -------
    int
      number of current CUDA GPU device. If -1, no was found. 
    """
    if torch.cuda.is_available():
      torch.cuda.empty_cache()
      gc.collect()
      return torch.cuda.current_device()
    else:
      return -1
    

---
## Selecting Model and Hyper-Parameters
---
We define our variables for purposes described in our research manuscript. However, we encourage researchers and practitioners to try out alternative models (by manually overriding `transformer_model`). In addition, we wanted to minimize the tuning hyper-parameters during training as the aim of this research is to highlight Transformers in a baseline sense.

In [None]:
#@markdown __RUN:__  Select Pre-Trained Transformer Model
transformer_model = "deberta" #@param ["deberta", "albert", "bert", "bart", "distilbert","distilroberta", "electra", "roberta", "xlnet", "xlmroberta"]

In [None]:
#@markdown __RUN:__ Define Hyper-Parameters

# length to pad items to (~each word is 1.15 sequence units)
SEQ_LEN = 32

# first we can initialized the ClassificationArguments object
training_args = TrainingArguments(
   num_train_epochs = 10,
   learning_rate = 2e-5,
   warmup_ratio = 0.10,
   weight_decay = 0.01,
   per_device_train_batch_size = 16,
   seed = 42,
   logging_strategy="epoch", 
   output_dir = f"{transformer_model}/outputs",
)

---
## Uploading and Importing Data
---

**Uploading Data**

While there are several ways to import data into Colab ([see here](https://colab.research.google.com/notebooks/io.ipynb)), the most intuitive way is to use the project's code repository url:

```python
# Assign the online data repository to a url so it does not have to be repeated later
repository_data_path = "https://github.com/Shea-Fyffe/transforming-personality-scales/tree/main/data/text-classification"
```

As an alternative, you can also upload a local `.csv` file. You can do this by:
- Visiting the project url above and clicking the `download file` button (top right in project repository)
- Clicking the ***Files*** pane in Colab (the folder icon on the left in Colab)
- Clicking the ***Upload to session storage*** icon (left-most icon in Colab)
- Selecting the local data file you would like to use (e.g., `.csv`,`.tsv`)

If using this method, the path to the file can be used. To locate the file path using the *Colab File Pane* (folder icon on the left-hand side). Generally, uploaded files will be in the `/content/` directory. Once the file is found, right click the file and select "Copy path." This path can be pasted into the `import_data` function directly or assigned to an object that can be used later.

```pyton
local_file_path = "content/train-data.csv"
```
</br>

**Importing Data**

To properly import the training data we must specify the file path, column name containing our items, and column name containing our labels. Then, the `import_data()` returns three objects:

- a list (vector) of items
- a list (vector) of labels
- a copy of our training data

```python
# Example using the url
train_text, train_labels, train_raw_data = import_data(repository_data_path + 'train-data.csv', "text", "label")

# Example using a local file path
train_text, train_labels, train_raw_data = import_data("/" + local_file_path, "text", "label")
```

The code above assigns these to objects names `train_text`, `train_labels` and `raw_data` respectively.

<br>

### Importing the training and testing data

We will now import the training and testing data, named---`train-data.csv` and `test-data.csv` respectively. These data can be found on our [GitHub repo](https://github.com/Shea-Fyffe/transforming-personality-scales) in the directory `data/text-classification/`.

In [None]:
#@markdown __RUN:__ Import Train and Testing Data
# Assign the online data repository to a url so it doesn't have to be repeated laterr
repository_data_url = 'https://github.com/Shea-Fyffe/transforming-personality-scales/tree/main/data/text-classification'

# the import_data function will return a list of sentences, a list of labels, and the original dataset
train_text, train_labels, raw_training_data = import_data(repository_data_url + 'train-data.csv', "text", "label")

# the import_data function will return a list of sentences and the original dataset if label is left blank
test_text, raw_test_data = import_data(repository_data_url + 'test-data.csv', "text")

In [None]:
#@markdown __RUN:__ Inspect Data
# here we show the first 10 items in the training set
# ... and their corresponding labels
for x,y in zip(train_text[:10], train_labels[:10]):
    print("Item: %s | Label: %s" %(x, y))

Item: I rarely feel depressed. | Label: neuroticism
Item: I always know what I am doing. | Label: conscientiousness
Item: I do not put my mind on the task at hand. | Label: conscientiousness
Item: I keep things tidy. | Label: conscientiousness
Item: I laugh a lot. | Label: extraversion
Item: I rarely get caught up in the excitement. | Label: extraversion
Item: I am not a very enthusiastic person. | Label: extraversion
Item: I see myself as a good leader. | Label: extraversion
Item: I can talk others into doing things. | Label: extraversion
Item: I do not have an assertive personality. | Label: extraversion


---
## Training the Model
---

To clarify: *fine-tuning* is a specific type of training applied to models that have been pre-trained. This process allows the model to update its parameters to better align with our classification task.

The `fine-tune()` function requires that we define four arguments. We provide a description of each and the (`object`) holding such data:
- The model or type of transformer model (`transformer_model`)
- Text or personality items (`train_text`)
- Text or item class labels (`train_labels`)
- The training hyper parameters (`trainings_args`) 

This results in a function call that looks like:

```python
fine_tuned_model, tokenizer = fine_tune(model = transformer_model,
                                        text = train_text,
                                        labels = train_labels,
                                        train_args = training_args)
```

There are several *optional* arguments, such as `max_seq_len` which determines how long text is truncated (discussed below). Additionally, there's the `multi_label` argument&mdash;by setting `multi_label` to `True` i.e., `fine_tune(..., multi_label = True)` one can train a model that will treat items as multi-dimensional, so items may belong to multiple classes at once.


**Tokenizing**

The `fine_tune()` function outputs the fine-tuned model (i.e., `fine_tuned_model`) and add model’s tokenizer to the object (i.e., `fine_tuned_model.tokenizer`). This step ensures both the testing and training items will be tokenized in the same way. Since we will not input the test data to the `fine_tune` function, the model's tokenizer object (i.e., `fine_tuned_model.tokenizer`) will be used right before predicting the class labels of the test items.

The `fine_tuned_model.tokenizer()` function has several notable arguments&mdash;`truncation` and `padding`. While truncation is not relevant to our case (because personality items tend to be relatively short text documents), setting `truncation=True` ensures that any document longer than the specified sequence length is truncated. Setting `padding=True`, ensures that any document shorter than the specified sequence length is padded up to that point. Usually transformers default sequence length to 512 tokens; however, it is best practice to set it to a number that is roughly 150% of the words in the longest text document.

In [None]:
#@markdown __RUN:__ Fine-Tune Model
fine_tuned_model = fine_tune(transformer_model, train_text, train_labels, training_args)

---
## Testing the Model
---

Since we've fined tuned the model, we can now use the `.predict()` method to predict the labels of new personality items as well as other types of text documents (e.g., survey responses, social media comments, and performance evaluations).

After performing predictions on the test data, we can clean up the results with `format_output_data()`. By default the function will return multi-class probabilities and the most likely label, which is appended as a column named *'predicted'*. These options can be modified by setting the arguments `output_probabilities` and `output_predicted_label` to `False`. For example:

```python
# output predicted label and logit values
out_test_df = format_output_data(predictions, output_probabilities = False)

# output probabilities but no predicted label
out_test_df = format_output_data(predictions, output_predicted_label = False)

```

---
## Evaluating the Model
---

In a case where we are provided the *ground truth* test labels (e.g., the *'label'* column in the `raw_test_data` dataset), we provide the `evaluate_model()` function to calculate model evaluation metrics. 

**Note:** The *'predicted'* column needs to be present in the `out_test_df` (or calculated manually) and then defined as `predicted =` argument.

In [None]:
#@markdown __RUN:__ Predict labels of the test items

# pre-process the test data before prediction
test_encodings = fine_tuned_model.tokenizer(test_text, truncation=True, padding=True)
test_dataset = TextClassificationDataset(test_encodings)

# predict the test set and return single label predictions and the raw logits
predictions, _, _ = fine_tuned_model.predict(test_dataset)

In [None]:
#@markdown __RUN:__ Format test predictions
# we can format the output and save it, be sure to add label values
out_test_df = format_output_data(predictions, label_values = get_labels(fine_tuned_model))

In [None]:
#@markdown __RUN:__ Calculate model evaluation metrics
eval_metrics = evaluate_model(actual = raw_test_data["label"], predicted = out_test_df["predicted"])

# Print Results
eval_metrics

{'overall':            accuracy   macro avg  weighted avg
 precision  0.823529    0.830026      0.833937
 recall     0.823529    0.824778      0.823529
 f1-score   0.823529    0.821486      0.823006
 support    0.823529  119.000000    119.000000,
 'by_label':            agreeableness  extraversion   openness  neuroticism  \
 precision           0.90      0.875000   0.850000     0.703704   
 recall              0.72      0.840000   0.739130     0.904762   
 f1-score            0.80      0.857143   0.790698     0.791667   
 support            25.00     25.000000  23.000000    21.000000   
 
            conscientiousness  
 precision           0.821429  
 recall              0.920000  
 f1-score            0.867925  
 support            25.000000  }

---
### Saving the Model
---
Fine-tuned models can also be saved for further training or prediction. Since we utilized a testing set, the model trained here did not get to train on all the items collected. Thus, after saving the model, we perform some additional training using the testing data. For example:

```python
# Save the fine tuned model
fine_tuned_model.save_model("fine-tuned-big5-personality-model")

# Then re-run the fine_tune function changing the model path, training text, and labels
really_fine_tuned_model, tokenizer = fine_tune("fine-tuned-big5-personality-model",
    test_text, raw_test_data["label"], training_args)
```


In [None]:
#@markdown __RUN:__ Write Predictions to File
out_test_df.to_csv(f"{transformer_model}-test-preds.csv", index=False)

In [None]:
#@markdown __RUN:__ Save fine-tuned model
# Uncomment the line below to save the fine-tuned model for later use
# fine_tuned_model.save_model("fine-tuned-big5-personality-model")

---
### Classifying New Examples
---

```python
# Load Python libraries
from transformers import AutoModel, AutoTokenizer
from transformers import pipeline

# Import model to classify new items
big5_model = AutoModel.from_pretrained("fine-tuned-big5-personality-model")
big5_tokenizer = AutoTokenizer.from_pretrained("fine-tuned-big5-personality-model")

# Create classification pipeline
classify_items = pipeline("text-classification", model=big5_model, tokenizer=big5_tokenizer)

# Import or generate items to classify (taken from openpsychometrics.org)
new_items = ["I put family first.",  
             "When other people are arguing, I leave the room.", 
             "I have a bland facial expression when I talk to people.", 
             "Does your heart ever thump in your ears so that you cannot sleep?"]

# Classify items to the Big Five factors
results = classify_items(new_items)
```
