<a href="https://colab.research.google.com/github/Shea-Fyffe/transforming-personality-scales/blob/main/tutorials/create-fixed-sentence-embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# Creating Fixed Sentence Embeddings
---
This colab is written in **Python** for the creation of pre-trained *fixed* sentence embeddings using *Universal Sentence Encoder* (USE; [Cer et al., 2018](https://arxiv.org/abs/1803.11175)), *Sentence BERT*  (SBERT; [Reimers & Gurevych, 2019](https://arxiv.org/abs/1908.10084)), and mean-aggregate GloVe embeddings (GloVe; [Pennington et al., 2014](https://aclanthology.org/D14-1162/)). These examples could be extrapolated to several different types of text documents. However, we focus on generating embeddings for Big Five personality statements.

---
## Setup
---

### Libraries

Colab comes with a large number of Python libraries pre-loaded. However, `Sentence Transformers` is not one of those libraries. The `Sentence Transformers` library can be installed by using the code below.

In [None]:
#@markdown __RUN:__ Installing Sentence Transformers

## Uncomment command below to install Sentence Transformers
! pip install sentence_transformers

In [None]:
#@markdown __RUN:__ Loading Libraries
# load libraries for USE (tensorflow is a native library in Colab)
import tensorflow as tf
import tensorflow_hub as hub

# load sentence_tranformers for SBERT embeddings
from sentence_transformers import SentenceTransformer

# Util libraries
import pandas as pd
import numpy as np
import os
import requests
from io import StringIO

### Using a GPU
To speed things up you can use a *GPU* (*optional*).

First, you'll need to enable GPUs for the notebook:

- Navigate to Editâ†’Notebook Settings
- select GPU from the Hardware Accelerator drop-down

The command below can be used to check the GPU instance, and additionally, the memory usage of the GPU.

In [None]:
# Get GPU status and info
!nvidia-smi

---
## Functions
---

In [None]:
#@markdown __RUN:__ Load user-defined function to create SBERT embeddings
def create_sbert_embeddings(model: str, text, return_numpy = True):
  """Create Sentence BERT Embeddings from a list of strings
  
  Args:
    model: SBERT model to import from tf hub
    text: a list of sentences to embed
    return_numpy: Should a numpy array be returned?
  """
  embedding_model = SentenceTransformer(model)
  sbert_embeddings = embedding_model.encode(text)
  if return_numpy:
    sbert_embeddings = np.array(sbert_embeddings)
  return pd.DataFrame(sbert_embeddings)

In [None]:
#@markdown __RUN:__ Load user-defined function to create USE embeddings
def create_use_embeddings(model: str, text, return_numpy: bool = False):
  """Create Universal Sentence Embeddings from a list of strings
  
  Args:
    model: USE model to import from tf hub
    text: a list of sentences to embed
    return_numpy: Should a numpy array be returned?
  """
  embedding_model = hub.load(model)
  use_embeddings = embedding_model(text)
  if return_numpy:
    use_embeddings = np.array(use_embeddings)
  return pd.DataFrame(use_embeddings)

In [None]:
#@markdown __RUN:__ Load user-defined function to create Aggregate Glove embeddings
def create_agg_doc_embeddings(text, return_numpy: bool = False):
  """Create Aggregate Glove embeddings from list of strings
  
  Args:
    text: a list of sentences to embed
    return_numpy: Should a numpy array be returned?
  """
  return create_sbert_embeddings("average_word_embeddings_glove.840B.300d", text, return_numpy)

In [None]:
#@markdown __RUN:__ Load user-defined utility functions

# Import Data function
def import_data(path: str, text_col, enc = 'latin1'):
  """Import a CSV of sentences
  
  Args:
    path: A csv file path or url pointing at CSV file
    text_col: Name of column in csv containing sentences
    enc: File encoding to be used (optional)
  """
  if (path.startswith("http")):
      res = requests.get(path,
                         headers= {'User-Agent': 'Mozilla/5.0',
                                   "X-Requested-With": "XMLHttpRequest"})
      path = StringIO(res.text)
  df = pd.read_csv(path, encoding = enc)
  return df[text_col].tolist(), df

# Format output data function
def format_output_data(emb_df, add_df = None, emb_names_prefix = "f_V"):
  """Format data to be output to CSV
  
  Args:
    emb_df: A dictionary of embeddings DataFrames
    add_df: A dictionary of additional DataFrames information to merge (optional)
    emb_names_prefix: A string to prefix embedding column names (so that theyre not numbers)
  """
  out_df = pd.concat(emb_df)
  out_df = out_df.add_prefix(emb_names_prefix)
  out_df.insert(0, "set", [x[0] for x in out_df.index])
  if add_df is not None:
      add_df = pd.concat(add_df)
      add_df.reset_index(drop=True, inplace=True)
      out_df.reset_index(drop=True, inplace=True)
      out_df = pd.concat([add_df, out_df], axis=1)
  return out_df

---
## Model Selection
---

In [None]:
#@markdown __RUN:__ Define Universal Sentence Encoder's model
use_model = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]

In [None]:
#@markdown __RUN:__ Define SBERT model
sbert_model = "paraphrase-mpnet-base-v2" #@param ["all-mpnet-base-v2", "paraphrase-mpnet-base-v2", "paraphrase-xlm-r-multilingual-v1", "paraphrase-distilroberta-base-v2","distilbert-base-nli-stsb-quora-ranking", "average_word_embeddings_glove.840B.300d"]

---
## Importing and formatting Data
---

While there are several ways to import data into Colab ([see here](https://colab.research.google.com/notebooks/io.ipynb)), the most intuitive way is to use the project's code repository url:

```python
# both the training and testing data can be automatically downloaded using the repository below
repository_data_url = 'https://github.com/Shea-Fyffe/transforming-personality-scales/tree/main/data/text-classification'

# the import_data function will return a list of items and the original dataset
train_text, train_raw_data = import_data(repository_data_url + "train-data.csv", "text")
test_text, test_raw_data = import_data(repository_data_url + "test-data.csv", "text")
```


You can also upload a local `.csv` file. You can do this by:
- Visiting the project url above and clicking the `download file` button (top right in project repository)
- Clicking the ***Files*** pane in Colab (the folder icon on the left in Colab)
- Clicking the ***Upload to session storage*** icon (left-most icon in Colab)
- Selecting the local data file you would like to use (e.g., `.csv`,`.tsv`)

In [None]:
#@markdown __RUN:__ Importing training data

# Assign the online data repository to a url so it doesn't have to be repeated later
repository_data_url = 'https://github.com/Shea-Fyffe/transforming-personality-scales/tree/main/data/text-classification'
# the import_data function will return a list of sentences and the original dataset
train_text, raw_train_data = import_data(repository_data_url + "train-data.csv", "text")

In [None]:
#@markdown __RUN:__ Importing test data

# the import_data function will return a list of sentences and the original dataset
test_text, raw_test_data = import_data(repository_data_url + "test-data.csv", "text")

In [None]:
#@markdown __RUN:__ Inspecting Data

# we can combine both text for easier use downstream
all_text = {"train": train_text,
            "test": test_text
            }

            
raw_train_data.head()

---
## Text Representation: Embedding Training and Testing Data
---




---
### Universal Sentence Encoder (USE)




#### USE: Embedding Training and Testing Data

Now we can move on to using the text from our actual data. We've stored as the  `train_text` and `test_text` in a dictionary called `all_text`. We can now just interate over this dictionary to encode everything.

In [None]:
#@markdown __RUN:__ Create USE Embeddings
# We use our custom function *create_use_embeddings* to produce embeddings
# Let store them as a dictionary
use_embeddings = {}
for key, values in all_text.items():
    use_embeddings[key] = create_use_embeddings(use_model, values)

#### USE: Formatting Data for Output
We can now use the `format_output_data()` function to combine our embedding data with infromation from our raw datasets (`raw_train_data` & `raw_test_data`).

In [None]:
# Remember that the first argument should be a dictionary of embedding DataFrames
use_output_df = format_output_data(use_embeddings, {'raw_train':raw_train_data, 'raw_test':raw_test_data}, "use_V")

#### USE: Output Data
*Note:* if you are using Colab file will be exported to a virtual directory which can be found by using the command `%cd` (current directory) or `!pwd` (python working directory)

In [None]:
use_output_df.to_csv("sentence-USE-embedding-data.csv")

---
### Sentence BERT (SBERT)



#### SBERT: Embedding Training and Testing Data

We essentially repeat the process described above; though we must change the embedding function accordingly. This time we will use the `create_sbert_embeddings` function.

In [None]:
#@markdown __RUN:__ Create SBERT Embeddings
# We use our custom function *create_sbert_embeddings* to produce embeddings
# Let store them as a dictionary
sbert_embeddings = {}
for key, values in all_text.items():
    sbert_embeddings[key] = create_sbert_embeddings(sbert_model, values)

#### SBERT: Formatting Data for Output
We can now use the `format_output_data()` function to combine our embedding data with infromation from our raw datasets (`raw_train_data` & `raw_test_data`).

In [None]:
# Remember that the first argument should be a dictionary of embedding DataFrames
sbert_output_df = format_output_data(sbert_embeddings, {'raw_train':raw_train_data, 'raw_test':raw_test_data}, "sbert_V")

#### SBERT: Output Data
*Note:* if you are using Colab file will be exported to a virtual directory which can be found by using the command `%cd` (current directory) or `!pwd` (python working directory)

In [None]:
sbert_output_df.to_csv("sentence-SBERT-embedding-data.csv")



---



### Aggregate Word Embeddings (Glove)



#### Glove: Embedding Training and Testing Data

We essentially repeat the process described above; though we must change the embedding function accordingly. This time we will use the `create_agg_doc_embeddings` function.

In [None]:
#@markdown __RUN:__ Create GloVe Embeddings
# We use our custom function *create_sbert_embeddings* to produce embeddings
# Let store them as a dictionary
glove_embeddings = {}
for key, values in all_text.items():
    glove_embeddings[key] = create_agg_doc_embeddings(values)

#### Glove: Formatting Data for Output
We can now use the `format_output_data()` function to combine our embedding data with infromation from our raw datasets (`raw_train_data` & `raw_test_data`).

In [None]:
# Remember that the first argument should be a dictionary of embedding DataFrames
glove_output_df = format_output_data(glove_embeddings, {'raw_train':raw_train_data, 'raw_test':raw_test_data}, "glove_V")

#### Glove: Output Data
*Note:* if you are using Colab file will be exported to a virtual directory which can be found by using the command `%cd` (current directory) or `!pwd` (python working directory)

In [None]:
glove_output_df.to_csv("aggregate-word-embedding-data.csv")