Transformer model Fine-tuning for text classification with Pytorch Lightning

The years 2018-2020 have been the start of a new era in the field of natural language processing (nlp). This means that today computers can understand humans much better. Something similar occurred in image processing about six years earlier, when computers learned to understand images much better. The recent development in natural language processing (nlp) has been compared favourably to that of image processing, by many authors.

In this post we’ll demonstrate code that uses the vastly improved natural language processing (nlp) software landscape. In particular we use it for one of the classic nlp use cases: Text classification.

More specifically, we use the new capabilities to predict from a user’s app review in the Google Play Store the star rating that the same user gave to the app.. This is useful e.g. for spotting clearly erroneous star ratings, e.g. where the user misinterpreted the rating scale thinking: “5 is very bad, 1 is very good”, whereas the normal meaning is : “5 is very good, 1 is very bad”.​1​

Paradigm Shift

The dawn of the new era witnessed a true paradigm shift. The new paradigm shift is dominated by one thing: fine tuning. This mirrors again the earlier development in images, when a similar paradigm shift occurred. So you could ask: What is this fine tuning? Really, it is quite simple. Fine tuning is jargon for reusing a general model trained on a huge amount of general data for the specific data at hand. The model trained on such general data is called “pretrained”. So for instance a model might typically have been pretrained on Wikipedia, e.g. Wikipedia English, which means that it would have learned and therefore now “understands” general English. If our specific task then is to classify newspaper articles, it will typically be useful to do a moderate amount of fine tuning on the newspaper data, so that the model better understands the specificity of newspaper English.

In this article we will demonstrate some code that follows this new paradigm. In particular we will demonstrate this finetuning on a Google App Rating dataset.

For the better organisation of our code and general convenience, we will us pytorch lightning.

The paradigm shift in 2018-2020 was driven by the arrival of many new models. A timeline of these arrivals is shown in figure Timeline.

Timeline : Timeline of nlp model arrivals. Shown is the model size vs arrival time. BERT released in October 2018, has 340 million parameters, 🤗 Distilbert released in 2019 has 66 million parameters. Adapted from DistilBERT, a Distilled Version of BERT​2​.

In the diagram you can clearly see that the models have generally been getting bigger. They also generally get better, as this is of course quite a general trend in research, otherwise things would not be worth publishing. The en vogue expression for a proof that a model is indeed an improvement is “beating the state of the art” or “beating SOTA”. The state of the art is measured for various text related tasks, simpler ones such as text classification, and more complicated ones such as text summarisation.

So the progress of the state of the art is partially driven by model size. The model size is generally by the number of free parameters that need to be trained, a number that is typically measured in millions. In fact a new model that was published in May 2020, the very trendy so-called GPT-3​3​ by OpenAI, has 175 000 million parameters, which is more than 10 times more than the largest model shown in the diagram, Turing-NLG with 17 000 million parameters. Turing-NLG had previously been the biggest model.

To some extent this is a worrying trend,  which reinforces the lead of the very big players in the field. In fact OpenAI – in a significant deviation from current practices – has not published a pre-trained version of the GPT-3 model, but rather prefers to sell an API service which is powered by the model. On the other hand, there are very few players who can afford the cost to train the model, e.g. the hardware cost plus electricity bill would be quite large. One estimate puts the price tag at ~5 million US Dollars​4​,  which is certainly out of the question for most individual researchers.

Another thing to notice in the diagram is the continual and steady flow of new models. So there is really no single point of the revolution. This is quite a general observation​5​ and it can be seen very plainly e.g. in the evolution of playing strength of history’s strongest​6​ chess playing software stockfish​7​.

The code provided here in the present post allows you to switch models very easily.

Possible choices for pretrained models are

  • albert-base-v2
  • bert-base-uncased
  • distilbert-base-uncased

We have run the colab notebook with these choices, other pretrained models from huggingface might also be possible.

The interest in these models can be expressed like this:

  • BERT​8​ (published in October 2018 by researchers from Google) managed to capture a lot of attention. The enormous hold on the nlp researchers collective psyche is illustrated by the choice of subsequent model names incorporating the term BERT, such as roBERTa​9​, ALBERT​10​, distilBERT​2​. Therefore BERT is to many researchers and practitioners a symbol of the nlp revolution and is as suitable as any model in being a representative of the paradigm shift.
  • DistilBERT​2​ (published in October 2019 by researchers from Huggingface) was put forward with the explicit aim of going against the prevailing big-is-beautiful trend, by creating a BERT like model that is “Smaller, Faster, Cheaper and Lighter​2​.
  • ALBERT​10​ (version 1 was published in September 2019 by researchers from Google and Toyota, an updated version 2 was published in January 2020) was an effort to beat BERT and existing SOTAs while limiting model size.

The bigger-is-better trend has since continued unabated, so by now BERT itself at 340 million parameters or ~1.3 Gigabyte looks small in comparison to recent models, however, our preference is for a model that can be fine-tuned using the limited resources that are available to an individual. Therefore in the notebook we show results for DistilBERT.

To run the different model, e.g. the BERT Model, just one name needs to be changed in the Colab notebook, namely for bert-base-uncased the python command should be changed to

where distilbert-base-uncased was changed to bert-base-uncased. For albert-base-v2 the same procedure can be used.

We have also run the code for BERT instead of distilBERT, and the results are only marginally better. For ALBERT they were marginally worse. However, this post here will not provide a detailed comparison of model performance.

We will do the fine-tuning with Google Colab, which is currently arguably the best GPU resource that is entirely free.

Requirements

  • Hardware
    • A computer with at least 4 GB Ram
  • Software
    • The computer can run on Linux, MacOS or Windows
    • VS code is an advantage
  • Wetware
    • Familiarity with Python
    • For the technical code, a familiarity with pytorch lightning definitely helps
    • … as does a familiarity with nlp transformers
    • … as does a familiarity in particular with hugginface/transformers

For the technical code of this post we owe thanks to the following two posts, they also contain some background information, just in the case the reader wants to brush up on any of the wetware requirements.

The defining features of our post where our post distinguishes itself from Chris and Nick’s and/or Venelin’s are:

  • Task : We focus on the classic task of text classification, using a different dataset, viz. Google Play app reviews dataset from Venelin Valkov’s post. Similar to Venelin, different from Chris.
  • Colab: We train with Google Colab, which, as mentioned, is currently arguably the best GPU resource that is entirely free. It’s use is greatly simplified by point Structure. Similar to Venelin, similar to Chris. The main difference here is, that we develop locally first, e.g. in VS code, and then “wrap” the code with a Colab notebook.
  • Structure: We structure our code with Pytorch Lightning, which makes everything very readable. Further it makes switching from coding locally with your cpu, to running it on a cloud-based gpu very simple (a “breeze”), literally one line of code. Different from Venelin, different from Chris.

Transformers from Huggingface

The github hosted package https://github.com/huggingface/transformers is the go-to-place when it comes to modern natural language processing. Its popularity can be seen from its fast rising github stars, currently at  Star . The package supplies the architecture of a huge zoo of models such as BERT, roBERTa, ALBERT and distilBERT. It is available in pytorch and tensorflow.

But best of all it provides fully pretrained models. With the notable exception of GPT 3 🙁.

Structure with Pytorch Lightning

The github hosted package https://github.com/PytorchLightning/pytorch-lightning also is a highly popular package.

Its popularity can also be seen from its equally fast rising github stars, currently at Star .

This is from the lightning README:

“Lightning disentangles PyTorch code to decouple the science from the engineering by organizing it into 4 categories:

  1. Research code (the LightningModule).
  2. Engineering code (you delete, and is handled by the Trainer).
  3. Non-essential research code (logging, etc… this goes in Callbacks).
  4. Data (use PyTorch Dataloaders or organize them into a LightningDataModule).

Once you do this, you can train on multiple-GPUs, TPUs, CPUs and even in 16-bit precision without changing your code!”

So, in essence, it promotes readability, reasonable expectations regarding code structure, the promise you can focus on the interesting bits, while being able to switch painlessly from cpu to gpu/tpu.

Code

The code is available on github https://github.com/hfwittmann/transformer_finetuning_lightning.

As mentioned, for the technical code, it is helpful to have a familiarity with

  1. pytorch lightning definitely
  2. nlp transformers
  3. hugging face transformers

Workflow

We start with a python file which we can develop locally in eg VS Code, thereby benefitting from its convenience features, above all the very nice debugging facilities. Because of the ease of switching from CPU to GPU code facilitated by pytorch-lightning, we can do this on our local CPU.

Once this is done, we transfer the file to Google Drive, mount the appropriate folder in Google Colab, and then execute our file from a jupyter-type Colab notebook.

To make it really useful we use the excellent argparse facilities of pytorch-lightning, so that we can eg. change various parameters via the command line on the fly.

So here’s the fully commented python code:

 

Mount the google drive

... where we've already upladed the ncessary files

  • the requirements files to install the necessary python packages
  • nlp_finetuning_lightning_google.py, the python file that was already discussed
In [1]:
from pathlib import Path
import os

from google import colab
colab.drive.mount('/content/gdrive')

os.chdir(Path('/content/gdrive/My Drive/Colab Notebooks/lightning/transformer_finetuning_lightning'))

os.environ["PYTHONPATH"] = os.getcwd()
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).

Show GPU Specification

In [4]:
!nvidia-smi
Tue Sep 29 18:52:51 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P8    12W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Directory structure

... take a look at what's already there.

  • The files in the data folder are from Venelin's post
  • bert_fine_tuning_lightning_google.py is the previously discussed script file
In [5]:
!apt install tree
!tree
Reading package lists... Done
Building dependency tree       
Reading state information... Done
tree is already the newest version (1.7.0-5).
0 upgraded, 0 newly installed, 0 to remove and 21 not upgraded.
.
├── LICENSE
├── nlp-fine-tune-run.ipynb
├── nlp_finetuning_lightning_google.py
├── README.md
├── requirements.in
├── requirements.txt
└── run.txt

0 directories, 7 files

Install packages

Remarks: this only attempts to install the packages, if pytorch_lightning has not been installed yet. This is convenient, as it allows to click through the cells quickly - as is sometimes necessary in Goolgle Colab - when it's been installed already

In [6]:
ls -all
total 4368
drwx------ 2 root root    4096 Sep 29 18:45 .git/
-rw------- 1 root root    2869 Sep 29 18:45 .gitignore
-rw------- 1 root root    1071 Sep 29 18:45 LICENSE
-rw------- 1 root root 4437852 Sep 29 18:51 nlp-fine-tune-run.ipynb
-rw------- 1 root root   18929 Sep 29 18:45 nlp_finetuning_lightning_google.py
-rw------- 1 root root     367 Sep 29 18:45 README.md
-rw------- 1 root root     263 Sep 29 18:45 requirements.in
-rw------- 1 root root    5079 Sep 29 18:45 requirements.txt
-rw------- 1 root root     545 Sep 29 18:45 run.txt
In [7]:
try: 
  import pytorch_lightning
except:
  !pip install -U ipython
  !pip install -r requirements.in
In [8]:
%reload_ext watermark
%watermark
2020-09-29T18:53:10+00:00

CPython 3.6.9
IPython 7.16.1

compiler   : GCC 8.4.0
system     : Linux
release    : 4.19.112+
machine    : x86_64
processor  : x86_64
CPU cores  : 2
interpreter: 64bit
In [9]:
%watermark -v -p numpy,pandas,torch,transformers,pytorch_lightning
CPython 3.6.9
IPython 7.16.1

numpy 1.18.5
pandas 1.1.2
torch 1.6.0+cu101
transformers 3.3.1
pytorch_lightning 0.9.0

Run the script

... with some specified parameters

In [10]:
%%time
! python nlp_finetuning_lightning_google.py \
  --pretrained distilbert-base-uncased \
  --gpus 1 \
  --auto_select_gpus true \
  --nr_frozen_epochs 5
# --limit_train_batches 1.0 \
# --limit_val_batches 1.0 \
# --frac 1 \
# --max_epochs 2 \
# --fast_dev_run False
2020-09-29 18:53:33.723348: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
args: Namespace(accumulate_grad_batches=1, amp_backend='native', amp_level='O2', auto_lr_find=False, auto_scale_batch_size=False, auto_select_gpus=True, batch_size=32, benchmark=False, check_val_every_n_epoch=1, checkpoint_callback=True, default_root_dir=None, deterministic=False, distributed_backend=None, early_stop_callback=False, fast_dev_run=False, frac=1, gpus=1, gradient_clip_val=0, learning_rate=2e-05, limit_test_batches=1.0, limit_train_batches=1.0, limit_val_batches=1.0, log_gpu_memory=None, log_save_interval=100, logger=True, max_epochs=1000, max_steps=None, min_epochs=1, min_steps=None, nr_frozen_epochs=5, num_nodes=1, num_processes=1, num_sanity_val_steps=2, overfit_batches=0.0, overfit_pct=None, precision=32, prepare_data_per_node=True, pretrained='distilbert-base-uncased', process_position=0, profiler=None, progress_bar_refresh_rate=1, reload_dataloaders_every_epoch=False, replace_sampler_ddp=True, resume_from_checkpoint=None, row_log_interval=50, sync_batchnorm=False, terminate_on_nan=False, test_percent_check=None, tpu_cores=<function Trainer._gpus_arg_default at 0x7fee5c466730>, track_grad_norm=-1, train_percent_check=None, training_portion=0.9, truncated_bptt_steps=None, val_check_interval=1.0, val_percent_check=None, weights_save_path=None, weights_summary='top')
kwargs: {}
Loading BERT tokenizer
PRETRAINED:distilbert-base-uncased
Downloading: 100% 442/442 [00:00<00:00, 414kB/s]
Downloading: 100% 232k/232k [00:00<00:00, 842kB/s]
Type tokenizer: <class 'transformers.tokenization_distilbert.DistilBertTokenizer'>
Setting up dataset
Downloading...
From: https://drive.google.com/uc?id=1S6qMioqPJjyBLpLVz4gmRTnJHnjitnuV
To: /content/gdrive/My Drive/Colab Notebooks/lightning/transformer_finetuning_lightning/data/apps.csv
100% 134k/134k [00:00<00:00, 54.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=1zdmewp7ayS4js4VtrJEHzAheSW-5NBZv
To: /content/gdrive/My Drive/Colab Notebooks/lightning/transformer_finetuning_lightning/data/reviews.csv
7.17MB [00:00, 142MB/s]
Number of training sentences: 15,746

14,171 training samples
1,575 validation samples
DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "vocab_size": 30522
}

Downloading: 100% 268M/268M [00:03<00:00, 87.4MB/s]
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Model Type <class 'transformers.modeling_distilbert.DistilBertForSequenceClassification'>
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: Could not log computational graph since the `model.example_input_array` attribute is not set or `input_array` was not given
  warnings.warn(*args, **kwargs)

  | Name  | Type                                | Params
--------------------------------------------------------------
0 | model | DistilBertForSequenceClassification | 66 M  
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: Your val_dataloader has `shuffle=True`, it is best practice to turn this off for validation and test dataloaders.
  warnings.warn(*args, **kwargs)
Epoch 0:  90% 443/493 [00:50<00:05,  8.79it/s, loss=1.477, v_num=0]
Validating: 0it [00:00, ?it/s]
Epoch 0:  91% 449/493 [00:50<00:04,  8.89it/s, loss=1.477, v_num=0]
Validating:  16% 8/50 [00:00<00:02, 20.37it/s]
Validating:  20% 10/50 [00:00<00:02, 14.20it/s]
Epoch 0:  92% 455/493 [00:51<00:04,  8.88it/s, loss=1.477, v_num=0]
Validating:  28% 14/50 [00:01<00:03, 10.60it/s]
Validating:  32% 16/50 [00:01<00:03,  9.84it/s]
Validating:  34% 17/50 [00:01<00:03,  9.30it/s]
Epoch 0:  94% 461/493 [00:51<00:03,  8.88it/s, loss=1.477, v_num=0]
Validating:  38% 19/50 [00:01<00:03,  8.79it/s]
Validating:  40% 20/50 [00:01<00:03,  8.74it/s]
Validating:  42% 21/50 [00:01<00:03,  8.73it/s]
Validating:  44% 22/50 [00:02<00:03,  8.60it/s]
Validating:  46% 23/50 [00:02<00:03,  8.54it/s]
Epoch 0:  95% 467/493 [00:52<00:02,  8.87it/s, loss=1.477, v_num=0]
Validating:  50% 25/50 [00:02<00:02,  8.39it/s]
Validating:  52% 26/50 [00:02<00:02,  8.36it/s]
Validating:  54% 27/50 [00:02<00:02,  8.32it/s]
Validating:  56% 28/50 [00:02<00:02,  8.32it/s]
Validating:  58% 29/50 [00:02<00:02,  8.34it/s]
Epoch 0:  96% 473/493 [00:53<00:02,  8.86it/s, loss=1.477, v_num=0]
Validating:  62% 31/50 [00:03<00:02,  8.37it/s]
Validating:  64% 32/50 [00:03<00:02,  8.43it/s]
Validating:  66% 33/50 [00:03<00:02,  8.45it/s]
Validating:  68% 34/50 [00:03<00:01,  8.46it/s]
Validating:  70% 35/50 [00:03<00:01,  8.38it/s]
Epoch 0:  97% 479/493 [00:54<00:01,  8.86it/s, loss=1.477, v_num=0]
Validating:  74% 37/50 [00:03<00:01,  8.35it/s]
Validating:  76% 38/50 [00:03<00:01,  8.37it/s]
Validating:  78% 39/50 [00:04<00:01,  8.36it/s]
Validating:  80% 40/50 [00:04<00:01,  8.32it/s]
Validating:  82% 41/50 [00:04<00:01,  8.27it/s]
Epoch 0:  98% 485/493 [00:54<00:00,  8.85it/s, loss=1.477, v_num=0]
Validating:  86% 43/50 [00:04<00:00,  8.27it/s]
Validating:  88% 44/50 [00:04<00:00,  8.22it/s]
Validating:  90% 45/50 [00:04<00:00,  8.25it/s]
Validating:  92% 46/50 [00:04<00:00,  8.28it/s]
Validating:  94% 47/50 [00:05<00:00,  8.25it/s]
Epoch 0: 100% 491/493 [00:55<00:00,  8.84it/s, loss=1.477, v_num=0]
Validating:  98% 49/50 [00:05<00:00,  8.24it/s]
Validating: 100% 50/50 [00:05<00:00,  8.27it/s]/usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:200: UserWarning: Please also save or load the state of the optimzer when saving or loading the scheduler.
  warnings.warn(SAVE_STATE_WARNING, UserWarning)
Epoch 0: 100% 493/493 [00:57<00:00,  8.53it/s, loss=1.477, v_num=0, avg_loss=1.44, avg_accuracy=0.408]
Epoch 1:  90% 443/493 [00:56<00:06,  7.88it/s, loss=1.357, v_num=0, avg_loss=1.44, avg_accuracy=0.408]
Epoch 1:  90% 444/493 [00:56<00:06,  7.90it/s, loss=1.357, v_num=0, avg_loss=1.44, avg_accuracy=0.408]
Epoch 1:  91% 450/493 [00:56<00:05,  7.97it/s, loss=1.357, v_num=0, avg_loss=1.44, avg_accuracy=0.408]
Validating:  16% 8/50 [00:00<00:02, 19.02it/s]
Validating:  20% 10/50 [00:00<00:03, 13.31it/s]
Epoch 1:  92% 456/493 [00:57<00:04,  7.97it/s, loss=1.357, v_num=0, avg_loss=1.44, avg_accuracy=0.408]
Validating:  28% 14/50 [00:01<00:03,  9.89it/s]
Validating:  30% 15/50 [00:01<00:03,  9.25it/s]
Validating:  32% 16/50 [00:01<00:03,  8.77it/s]
Validating:  34% 17/50 [00:01<00:03,  8.43it/s]
Validating:  36% 18/50 [00:01<00:03,  8.31it/s]
Epoch 1:  94% 462/493 [00:57<00:03,  7.97it/s, loss=1.357, v_num=0, avg_loss=1.44, avg_accuracy=0.408]
Validating:  40% 20/50 [00:01<00:03,  8.10it/s]
Validating:  42% 21/50 [00:02<00:03,  8.02it/s]
Validating:  44% 22/50 [00:02<00:03,  8.09it/s]
Validating:  46% 23/50 [00:02<00:03,  8.06it/s]
Validating:  48% 24/50 [00:02<00:03,  8.01it/s]
Epoch 1:  95% 468/493 [00:58<00:03,  7.97it/s, loss=1.357, v_num=0, avg_loss=1.44, avg_accuracy=0.408]
Validating:  52% 26/50 [00:02<00:03,  8.00it/s]
Validating:  54% 27/50 [00:02<00:02,  8.02it/s]
Validating:  56% 28/50 [00:02<00:02,  8.00it/s]
Validating:  58% 29/50 [00:03<00:02,  7.94it/s]
Validating:  60% 30/50 [00:03<00:02,  7.94it/s]
Epoch 1:  96% 474/493 [00:59<00:02,  7.97it/s, loss=1.357, v_num=0, avg_loss=1.44, avg_accuracy=0.408]
Validating:  64% 32/50 [00:03<00:02,  8.00it/s]
Validating:  66% 33/50 [00:03<00:02,  7.96it/s]
Validating:  68% 34/50 [00:03<00:02,  7.90it/s]
Validating:  70% 35/50 [00:03<00:01,  7.99it/s]
Validating:  72% 36/50 [00:03<00:01,  8.03it/s]
Epoch 1:  97% 480/493 [01:00<00:01,  7.97it/s, loss=1.357, v_num=0, avg_loss=1.44, avg_accuracy=0.408]
Validating:  76% 38/50 [00:04<00:01,  8.03it/s]
Validating:  78% 39/50 [00:04<00:01,  7.95it/s]
Validating:  80% 40/50 [00:04<00:01,  7.96it/s]
Validating:  82% 41/50 [00:04<00:01,  8.01it/s]
Validating:  84% 42/50 [00:04<00:01,  7.98it/s]
Epoch 1:  99% 486/493 [01:01<00:00,  7.97it/s, loss=1.357, v_num=0, avg_loss=1.44, avg_accuracy=0.408]
Validating:  88% 44/50 [00:04<00:00,  7.93it/s]
Validating:  90% 45/50 [00:05<00:00,  8.01it/s]
Validating:  92% 46/50 [00:05<00:00,  8.05it/s]
Validating:  94% 47/50 [00:05<00:00,  8.03it/s]
Validating:  96% 48/50 [00:05<00:00,  7.98it/s]
Epoch 1: 100% 492/493 [01:01<00:00,  7.97it/s, loss=1.357, v_num=0, avg_loss=1.44, avg_accuracy=0.408]
Epoch 1: 100% 493/493 [01:03<00:00,  7.73it/s, loss=1.357, v_num=0, avg_loss=1.34, avg_accuracy=0.436]
Epoch 2:  90% 443/493 [00:55<00:06,  8.01it/s, loss=1.279, v_num=0, avg_loss=1.34, avg_accuracy=0.436]
Epoch 2:  90% 444/493 [00:55<00:06,  8.02it/s, loss=1.279, v_num=0, avg_loss=1.34, avg_accuracy=0.436]
Epoch 2:  91% 450/493 [00:55<00:05,  8.10it/s, loss=1.279, v_num=0, avg_loss=1.34, avg_accuracy=0.436]
Validating:  16% 8/50 [00:00<00:02, 18.88it/s]
Validating:  20% 10/50 [00:00<00:02, 13.40it/s]
Epoch 2:  92% 456/493 [00:56<00:04,  8.09it/s, loss=1.279, v_num=0, avg_loss=1.34, avg_accuracy=0.436]
Validating:  28% 14/50 [00:01<00:03,  9.90it/s]
Validating:  30% 15/50 [00:01<00:03,  9.24it/s]
Validating:  32% 16/50 [00:01<00:03,  8.83it/s]
Validating:  34% 17/50 [00:01<00:03,  8.43it/s]
Validating:  36% 18/50 [00:01<00:03,  8.31it/s]
Epoch 2:  94% 462/493 [00:57<00:03,  8.09it/s, loss=1.279, v_num=0, avg_loss=1.34, avg_accuracy=0.436]
Validating:  40% 20/50 [00:01<00:03,  8.19it/s]
Validating:  42% 21/50 [00:02<00:03,  8.09it/s]
Validating:  44% 22/50 [00:02<00:03,  7.98it/s]
Validating:  46% 23/50 [00:02<00:03,  8.01it/s]
Validating:  48% 24/50 [00:02<00:03,  7.99it/s]
Epoch 2:  95% 468/493 [00:57<00:03,  8.09it/s, loss=1.279, v_num=0, avg_loss=1.34, avg_accuracy=0.436]
Validating:  52% 26/50 [00:02<00:03,  7.92it/s]
Validating:  54% 27/50 [00:02<00:02,  7.98it/s]
Validating:  56% 28/50 [00:02<00:02,  7.99it/s]
Validating:  58% 29/50 [00:03<00:02,  7.96it/s]
Validating:  60% 30/50 [00:03<00:02,  7.87it/s]
Epoch 2:  96% 474/493 [00:58<00:02,  8.09it/s, loss=1.279, v_num=0, avg_loss=1.34, avg_accuracy=0.436]
Validating:  64% 32/50 [00:03<00:02,  7.97it/s]
Validating:  66% 33/50 [00:03<00:02,  7.90it/s]
Validating:  68% 34/50 [00:03<00:02,  7.88it/s]
Validating:  70% 35/50 [00:03<00:01,  7.97it/s]
Validating:  72% 36/50 [00:03<00:01,  7.99it/s]
Epoch 2:  97% 480/493 [00:59<00:01,  8.09it/s, loss=1.279, v_num=0, avg_loss=1.34, avg_accuracy=0.436]
Validating:  76% 38/50 [00:04<00:01,  7.90it/s]
Validating:  78% 39/50 [00:04<00:01,  7.96it/s]
Validating:  80% 40/50 [00:04<00:01,  7.99it/s]
Validating:  82% 41/50 [00:04<00:01,  7.99it/s]
Validating:  84% 42/50 [00:04<00:01,  7.95it/s]
Epoch 2:  99% 486/493 [01:00<00:00,  8.08it/s, loss=1.279, v_num=0, avg_loss=1.34, avg_accuracy=0.436]
Validating:  88% 44/50 [00:04<00:00,  7.94it/s]
Validating:  90% 45/50 [00:05<00:00,  7.95it/s]
Validating:  92% 46/50 [00:05<00:00,  7.90it/s]
Validating:  94% 47/50 [00:05<00:00,  7.89it/s]
Validating:  96% 48/50 [00:05<00:00,  7.95it/s]
Epoch 2: 100% 492/493 [01:00<00:00,  8.08it/s, loss=1.279, v_num=0, avg_loss=1.34, avg_accuracy=0.436]
Epoch 2: 100% 493/493 [01:02<00:00,  7.85it/s, loss=1.279, v_num=0, avg_loss=1.29, avg_accuracy=0.445]
Epoch 3:  90% 443/493 [00:56<00:06,  7.88it/s, loss=1.255, v_num=0, avg_loss=1.29, avg_accuracy=0.445]
Epoch 3:  90% 444/493 [00:56<00:06,  7.89it/s, loss=1.255, v_num=0, avg_loss=1.29, avg_accuracy=0.445]
Epoch 3:  91% 450/493 [00:56<00:05,  7.97it/s, loss=1.255, v_num=0, avg_loss=1.29, avg_accuracy=0.445]
Validating:  16% 8/50 [00:00<00:02, 19.04it/s]
Validating:  20% 10/50 [00:00<00:03, 13.32it/s]
Epoch 3:  92% 456/493 [00:57<00:04,  7.97it/s, loss=1.255, v_num=0, avg_loss=1.29, avg_accuracy=0.445]
Validating:  28% 14/50 [00:01<00:03, 10.10it/s]
Validating:  30% 15/50 [00:01<00:03,  9.38it/s]
Validating:  32% 16/50 [00:01<00:03,  8.89it/s]
Validating:  34% 17/50 [00:01<00:03,  8.50it/s]
Validating:  36% 18/50 [00:01<00:03,  8.36it/s]
Epoch 3:  94% 462/493 [00:57<00:03,  7.97it/s, loss=1.255, v_num=0, avg_loss=1.29, avg_accuracy=0.445]
Validating:  40% 20/50 [00:01<00:03,  8.22it/s]
Validating:  42% 21/50 [00:01<00:03,  8.18it/s]
Validating:  44% 22/50 [00:02<00:03,  8.05it/s]
Validating:  46% 23/50 [00:02<00:03,  8.06it/s]
Validating:  48% 24/50 [00:02<00:03,  8.10it/s]
Epoch 3:  95% 468/493 [00:58<00:03,  7.97it/s, loss=1.255, v_num=0, avg_loss=1.29, avg_accuracy=0.445]
Validating:  52% 26/50 [00:02<00:02,  8.02it/s]
Validating:  54% 27/50 [00:02<00:02,  7.96it/s]
Validating:  56% 28/50 [00:02<00:02,  7.99it/s]
Validating:  58% 29/50 [00:02<00:02,  8.05it/s]
Validating:  60% 30/50 [00:03<00:02,  8.06it/s]
Epoch 3:  96% 474/493 [00:59<00:02,  7.97it/s, loss=1.255, v_num=0, avg_loss=1.29, avg_accuracy=0.445]
Validating:  64% 32/50 [00:03<00:02,  8.05it/s]
Validating:  66% 33/50 [00:03<00:02,  7.96it/s]
Validating:  68% 34/50 [00:03<00:02,  7.98it/s]
Validating:  70% 35/50 [00:03<00:01,  8.00it/s]
Validating:  72% 36/50 [00:03<00:01,  8.03it/s]
Epoch 3:  97% 480/493 [01:00<00:01,  7.97it/s, loss=1.255, v_num=0, avg_loss=1.29, avg_accuracy=0.445]
Validating:  76% 38/50 [00:04<00:01,  7.96it/s]
Validating:  78% 39/50 [00:04<00:01,  7.97it/s]
Validating:  80% 40/50 [00:04<00:01,  8.03it/s]
Validating:  82% 41/50 [00:04<00:01,  8.00it/s]
Validating:  84% 42/50 [00:04<00:01,  7.99it/s]
Epoch 3:  99% 486/493 [01:00<00:00,  7.97it/s, loss=1.255, v_num=0, avg_loss=1.29, avg_accuracy=0.445]
Validating:  88% 44/50 [00:04<00:00,  8.00it/s]
Validating:  90% 45/50 [00:04<00:00,  8.06it/s]
Validating:  92% 46/50 [00:05<00:00,  8.10it/s]
Validating:  94% 47/50 [00:05<00:00,  8.09it/s]
Validating:  96% 48/50 [00:05<00:00,  8.10it/s]
Epoch 3: 100% 492/493 [01:01<00:00,  7.97it/s, loss=1.255, v_num=0, avg_loss=1.29, avg_accuracy=0.445]
Epoch 3: 100% 493/493 [01:02<00:00,  7.90it/s, loss=1.255, v_num=0, avg_loss=1.26, avg_accuracy=0.464]
Epoch 4:  90% 443/493 [00:55<00:06,  7.92it/s, loss=1.279, v_num=0, avg_loss=1.26, avg_accuracy=0.464]
Epoch 4:  90% 444/493 [00:55<00:06,  7.94it/s, loss=1.279, v_num=0, avg_loss=1.26, avg_accuracy=0.464]
Epoch 4:  91% 450/493 [00:56<00:05,  8.01it/s, loss=1.279, v_num=0, avg_loss=1.26, avg_accuracy=0.464]
Validating:  16% 8/50 [00:00<00:02, 19.20it/s]
Validating:  20% 10/50 [00:00<00:02, 13.50it/s]
Epoch 4:  92% 456/493 [00:56<00:04,  8.01it/s, loss=1.279, v_num=0, avg_loss=1.26, avg_accuracy=0.464]
Validating:  28% 14/50 [00:01<00:03, 10.05it/s]
Validating:  30% 15/50 [00:01<00:03,  9.35it/s]
Validating:  32% 16/50 [00:01<00:03,  8.88it/s]
Validating:  34% 17/50 [00:01<00:03,  8.50it/s]
Validating:  36% 18/50 [00:01<00:03,  8.41it/s]
Epoch 4:  94% 462/493 [00:57<00:03,  8.01it/s, loss=1.279, v_num=0, avg_loss=1.26, avg_accuracy=0.464]
Validating:  40% 20/50 [00:01<00:03,  8.24it/s]
Validating:  42% 21/50 [00:01<00:03,  8.22it/s]
Validating:  44% 22/50 [00:02<00:03,  8.16it/s]
Validating:  46% 23/50 [00:02<00:03,  8.03it/s]
Validating:  48% 24/50 [00:02<00:03,  8.04it/s]
Epoch 4:  95% 468/493 [00:58<00:03,  8.01it/s, loss=1.279, v_num=0, avg_loss=1.26, avg_accuracy=0.464]
Validating:  52% 26/50 [00:02<00:02,  8.06it/s]
Validating:  54% 27/50 [00:02<00:02,  8.04it/s]
Validating:  56% 28/50 [00:02<00:02,  7.96it/s]
Validating:  58% 29/50 [00:02<00:02,  8.00it/s]
Validating:  60% 30/50 [00:03<00:02,  8.07it/s]
Epoch 4:  96% 474/493 [00:59<00:02,  8.01it/s, loss=1.279, v_num=0, avg_loss=1.26, avg_accuracy=0.464]
Validating:  64% 32/50 [00:03<00:02,  8.11it/s]
Validating:  66% 33/50 [00:03<00:02,  8.09it/s]
Validating:  68% 34/50 [00:03<00:01,  8.07it/s]
Validating:  70% 35/50 [00:03<00:01,  7.98it/s]
Validating:  72% 36/50 [00:03<00:01,  7.99it/s]
Epoch 4:  97% 480/493 [00:59<00:01,  8.01it/s, loss=1.279, v_num=0, avg_loss=1.26, avg_accuracy=0.464]
Validating:  76% 38/50 [00:04<00:01,  8.12it/s]
Validating:  78% 39/50 [00:04<00:01,  8.11it/s]
Validating:  80% 40/50 [00:04<00:01,  8.11it/s]
Validating:  82% 41/50 [00:04<00:01,  8.09it/s]
Validating:  84% 42/50 [00:04<00:00,  8.01it/s]
Epoch 4:  99% 486/493 [01:00<00:00,  8.01it/s, loss=1.279, v_num=0, avg_loss=1.26, avg_accuracy=0.464]
Validating:  88% 44/50 [00:04<00:00,  7.99it/s]
Validating:  90% 45/50 [00:04<00:00,  8.04it/s]
Validating:  92% 46/50 [00:05<00:00,  8.05it/s]
Validating:  94% 47/50 [00:05<00:00,  7.99it/s]
Validating:  96% 48/50 [00:05<00:00,  7.97it/s]
Epoch 4: 100% 492/493 [01:01<00:00,  8.01it/s, loss=1.279, v_num=0, avg_loss=1.26, avg_accuracy=0.464]
Epoch 4: 100% 493/493 [01:02<00:00,  7.94it/s, loss=1.279, v_num=0, avg_loss=1.24, avg_accuracy=0.465]
Epoch 5:  90% 443/493 [02:34<00:17,  2.87it/s, loss=1.016, v_num=0, avg_loss=1.24, avg_accuracy=0.465]
Epoch 5:  90% 444/493 [02:34<00:17,  2.88it/s, loss=1.016, v_num=0, avg_loss=1.24, avg_accuracy=0.465]
Epoch 5:  91% 450/493 [02:34<00:14,  2.91it/s, loss=1.016, v_num=0, avg_loss=1.24, avg_accuracy=0.465]
Validating:  16% 8/50 [00:00<00:02, 19.02it/s]
Validating:  20% 10/50 [00:00<00:02, 13.52it/s]
Validating:  24% 12/50 [00:00<00:03, 11.18it/s]
Epoch 5:  92% 456/493 [02:35<00:12,  2.94it/s, loss=1.016, v_num=0, avg_loss=1.24, avg_accuracy=0.465]
Validating:  28% 14/50 [00:01<00:03,  9.31it/s]
Validating:  30% 15/50 [00:01<00:03,  8.88it/s]
Validating:  32% 16/50 [00:01<00:03,  8.64it/s]
Validating:  34% 17/50 [00:01<00:03,  8.45it/s]
Validating:  36% 18/50 [00:01<00:03,  8.21it/s]
Epoch 5:  94% 462/493 [02:36<00:10,  2.96it/s, loss=1.016, v_num=0, avg_loss=1.24, avg_accuracy=0.465]
Validating:  40% 20/50 [00:01<00:03,  8.18it/s]
Validating:  42% 21/50 [00:02<00:03,  8.13it/s]
Validating:  44% 22/50 [00:02<00:03,  8.09it/s]
Validating:  46% 23/50 [00:02<00:03,  8.00it/s]
Validating:  48% 24/50 [00:02<00:03,  8.01it/s]
Epoch 5:  95% 468/493 [02:36<00:08,  2.99it/s, loss=1.016, v_num=0, avg_loss=1.24, avg_accuracy=0.465]
Validating:  52% 26/50 [00:02<00:02,  8.01it/s]
Validating:  54% 27/50 [00:02<00:02,  7.97it/s]
Validating:  56% 28/50 [00:02<00:02,  7.93it/s]
Validating:  58% 29/50 [00:03<00:02,  7.96it/s]
Validating:  60% 30/50 [00:03<00:02,  7.98it/s]
Epoch 5:  96% 474/493 [02:37<00:06,  3.01it/s, loss=1.016, v_num=0, avg_loss=1.24, avg_accuracy=0.465]
Validating:  64% 32/50 [00:03<00:02,  7.91it/s]
Validating:  66% 33/50 [00:03<00:02,  7.96it/s]
Validating:  68% 34/50 [00:03<00:01,  8.00it/s]
Validating:  70% 35/50 [00:03<00:01,  7.99it/s]
Validating:  72% 36/50 [00:03<00:01,  7.91it/s]
Epoch 5:  97% 480/493 [02:38<00:04,  3.03it/s, loss=1.016, v_num=0, avg_loss=1.24, avg_accuracy=0.465]
Validating:  76% 38/50 [00:04<00:01,  8.03it/s]
Validating:  78% 39/50 [00:04<00:01,  8.05it/s]
Validating:  80% 40/50 [00:04<00:01,  8.04it/s]
Validating:  82% 41/50 [00:04<00:01,  7.98it/s]
Validating:  84% 42/50 [00:04<00:01,  7.96it/s]
Epoch 5:  99% 486/493 [02:39<00:02,  3.06it/s, loss=1.016, v_num=0, avg_loss=1.24, avg_accuracy=0.465]
Validating:  88% 44/50 [00:04<00:00,  8.01it/s]
Validating:  90% 45/50 [00:05<00:00,  8.02it/s]
Validating:  92% 46/50 [00:05<00:00,  7.96it/s]
Validating:  94% 47/50 [00:05<00:00,  7.97it/s]
Validating:  96% 48/50 [00:05<00:00,  8.03it/s]
Epoch 5: 100% 492/493 [02:39<00:00,  3.08it/s, loss=1.016, v_num=0, avg_loss=1.24, avg_accuracy=0.465]
Epoch 5: 100% 493/493 [02:45<00:00,  2.98it/s, loss=1.016, v_num=0, avg_loss=1.01, avg_accuracy=0.559]
Epoch 6:  90% 443/493 [02:34<00:17,  2.86it/s, loss=0.901, v_num=0, avg_loss=1.01, avg_accuracy=0.559]
Epoch 6:  90% 444/493 [02:34<00:17,  2.87it/s, loss=0.901, v_num=0, avg_loss=1.01, avg_accuracy=0.559]
Epoch 6:  91% 450/493 [02:35<00:14,  2.90it/s, loss=0.901, v_num=0, avg_loss=1.01, avg_accuracy=0.559]
Validating:  16% 8/50 [00:00<00:02, 18.81it/s]
Validating:  20% 10/50 [00:00<00:03, 13.17it/s]
Validating:  24% 12/50 [00:00<00:03, 11.05it/s]
Epoch 6:  92% 456/493 [02:35<00:12,  2.92it/s, loss=0.901, v_num=0, avg_loss=1.01, avg_accuracy=0.559]
Validating:  28% 14/50 [00:01<00:03,  9.36it/s]
Validating:  30% 15/50 [00:01<00:03,  8.89it/s]
Validating:  32% 16/50 [00:01<00:04,  8.49it/s]
Validating:  34% 17/50 [00:01<00:03,  8.32it/s]
Validating:  36% 18/50 [00:01<00:03,  8.23it/s]
Epoch 6:  94% 462/493 [02:36<00:10,  2.95it/s, loss=0.901, v_num=0, avg_loss=1.01, avg_accuracy=0.559]
Validating:  40% 20/50 [00:01<00:03,  8.02it/s]
Validating:  42% 21/50 [00:02<00:03,  8.08it/s]
Validating:  44% 22/50 [00:02<00:03,  8.10it/s]
Validating:  46% 23/50 [00:02<00:03,  8.14it/s]
Validating:  48% 24/50 [00:02<00:03,  8.07it/s]
Epoch 6:  95% 468/493 [02:37<00:08,  2.97it/s, loss=0.901, v_num=0, avg_loss=1.01, avg_accuracy=0.559]
Validating:  52% 26/50 [00:02<00:03,  7.90it/s]
Validating:  54% 27/50 [00:02<00:02,  7.97it/s]
Validating:  56% 28/50 [00:02<00:02,  7.99it/s]
Validating:  58% 29/50 [00:03<00:02,  8.02it/s]
Validating:  60% 30/50 [00:03<00:02,  7.95it/s]
Epoch 6:  96% 474/493 [02:38<00:06,  3.00it/s, loss=0.901, v_num=0, avg_loss=1.01, avg_accuracy=0.559]
Validating:  64% 32/50 [00:03<00:02,  7.98it/s]
Validating:  66% 33/50 [00:03<00:02,  7.99it/s]
Validating:  68% 34/50 [00:03<00:02,  7.92it/s]
Validating:  70% 35/50 [00:03<00:01,  7.91it/s]
Validating:  72% 36/50 [00:03<00:01,  7.99it/s]
Epoch 6:  97% 480/493 [02:38<00:04,  3.02it/s, loss=0.901, v_num=0, avg_loss=1.01, avg_accuracy=0.559]
Validating:  76% 38/50 [00:04<00:01,  8.07it/s]
Validating:  78% 39/50 [00:04<00:01,  8.06it/s]
Validating:  80% 40/50 [00:04<00:01,  7.99it/s]
Validating:  82% 41/50 [00:04<00:01,  7.95it/s]
Validating:  84% 42/50 [00:04<00:01,  7.98it/s]
Epoch 6:  99% 486/493 [02:39<00:02,  3.04it/s, loss=0.901, v_num=0, avg_loss=1.01, avg_accuracy=0.559]
Validating:  88% 44/50 [00:04<00:00,  7.94it/s]
Validating:  90% 45/50 [00:05<00:00,  7.94it/s]
Validating:  92% 46/50 [00:05<00:00,  8.02it/s]
Validating:  94% 47/50 [00:05<00:00,  8.06it/s]
Validating:  96% 48/50 [00:05<00:00,  8.06it/s]
Epoch 6: 100% 492/493 [02:40<00:00,  3.07it/s, loss=0.901, v_num=0, avg_loss=1.01, avg_accuracy=0.559]
Epoch 6: 100% 493/493 [02:41<00:00,  3.06it/s, loss=0.901, v_num=0, avg_loss=1.01, avg_accuracy=0.575]
Epoch 7:  90% 443/493 [02:34<00:17,  2.87it/s, loss=0.611, v_num=0, avg_loss=1.01, avg_accuracy=0.575]
Epoch 7:  90% 444/493 [02:34<00:17,  2.88it/s, loss=0.611, v_num=0, avg_loss=1.01, avg_accuracy=0.575]
Epoch 7:  91% 450/493 [02:34<00:14,  2.91it/s, loss=0.611, v_num=0, avg_loss=1.01, avg_accuracy=0.575]
Validating:  16% 8/50 [00:00<00:02, 18.91it/s]
Validating:  20% 10/50 [00:00<00:02, 13.37it/s]
Validating:  24% 12/50 [00:00<00:03, 11.04it/s]
Epoch 7:  92% 456/493 [02:35<00:12,  2.94it/s, loss=0.611, v_num=0, avg_loss=1.01, avg_accuracy=0.575]
Validating:  28% 14/50 [00:01<00:03,  9.43it/s]
Validating:  30% 15/50 [00:01<00:03,  8.98it/s]
Validating:  32% 16/50 [00:01<00:03,  8.69it/s]
Validating:  34% 17/50 [00:01<00:03,  8.45it/s]
Validating:  36% 18/50 [00:01<00:03,  8.22it/s]
Epoch 7:  94% 462/493 [02:36<00:10,  2.96it/s, loss=0.611, v_num=0, avg_loss=1.01, avg_accuracy=0.575]
Validating:  40% 20/50 [00:01<00:03,  8.09it/s]
Validating:  42% 21/50 [00:02<00:03,  8.14it/s]
Validating:  44% 22/50 [00:02<00:03,  8.10it/s]
Validating:  46% 23/50 [00:02<00:03,  8.00it/s]
Validating:  48% 24/50 [00:02<00:03,  8.05it/s]
Epoch 7:  95% 468/493 [02:36<00:08,  2.98it/s, loss=0.611, v_num=0, avg_loss=1.01, avg_accuracy=0.575]
Validating:  52% 26/50 [00:02<00:02,  8.06it/s]
Validating:  54% 27/50 [00:02<00:02,  8.00it/s]
Validating:  56% 28/50 [00:02<00:02,  7.93it/s]
Validating:  58% 29/50 [00:03<00:02,  8.02it/s]
Validating:  60% 30/50 [00:03<00:02,  8.09it/s]
Epoch 7:  96% 474/493 [02:37<00:06,  3.01it/s, loss=0.611, v_num=0, avg_loss=1.01, avg_accuracy=0.575]
Validating:  64% 32/50 [00:03<00:02,  8.10it/s]
Validating:  66% 33/50 [00:03<00:02,  8.06it/s]
Validating:  68% 34/50 [00:03<00:02,  7.97it/s]
Validating:  70% 35/50 [00:03<00:01,  7.95it/s]
Validating:  72% 36/50 [00:03<00:01,  7.99it/s]
Epoch 7:  97% 480/493 [02:38<00:04,  3.03it/s, loss=0.611, v_num=0, avg_loss=1.01, avg_accuracy=0.575]
Validating:  76% 38/50 [00:04<00:01,  8.06it/s]
Validating:  78% 39/50 [00:04<00:01,  8.05it/s]
Validating:  80% 40/50 [00:04<00:01,  7.97it/s]
Validating:  82% 41/50 [00:04<00:01,  7.98it/s]
Validating:  84% 42/50 [00:04<00:01,  7.98it/s]
Epoch 7:  99% 486/493 [02:39<00:02,  3.06it/s, loss=0.611, v_num=0, avg_loss=1.01, avg_accuracy=0.575]
Validating:  88% 44/50 [00:04<00:00,  7.93it/s]
Validating:  90% 45/50 [00:05<00:00,  7.95it/s]
Validating:  92% 46/50 [00:05<00:00,  8.03it/s]
Validating:  94% 47/50 [00:05<00:00,  8.06it/s]
Validating:  96% 48/50 [00:05<00:00,  8.07it/s]
Epoch 7: 100% 492/493 [02:39<00:00,  3.08it/s, loss=0.611, v_num=0, avg_loss=1.01, avg_accuracy=0.575]
Epoch 7: 100% 493/493 [02:44<00:00,  2.99it/s, loss=0.611, v_num=0, avg_loss=0.911, avg_accuracy=0.69]
Epoch 8:  90% 443/493 [02:34<00:17,  2.86it/s, loss=0.420, v_num=0, avg_loss=0.911, avg_accuracy=0.69]
Epoch 8:  90% 444/493 [02:34<00:17,  2.87it/s, loss=0.420, v_num=0, avg_loss=0.911, avg_accuracy=0.69]
Epoch 8:  91% 450/493 [02:35<00:14,  2.90it/s, loss=0.420, v_num=0, avg_loss=0.911, avg_accuracy=0.69]
Validating:  16% 8/50 [00:00<00:02, 18.91it/s]
Validating:  20% 10/50 [00:00<00:02, 13.38it/s]
Validating:  24% 12/50 [00:00<00:03, 11.06it/s]
Epoch 8:  92% 456/493 [02:35<00:12,  2.93it/s, loss=0.420, v_num=0, avg_loss=0.911, avg_accuracy=0.69]
Validating:  28% 14/50 [00:01<00:03,  9.30it/s]
Validating:  30% 15/50 [00:01<00:03,  8.84it/s]
Validating:  32% 16/50 [00:01<00:04,  8.47it/s]
Validating:  34% 17/50 [00:01<00:03,  8.36it/s]
Validating:  36% 18/50 [00:01<00:03,  8.27it/s]
Epoch 8:  94% 462/493 [02:36<00:10,  2.95it/s, loss=0.420, v_num=0, avg_loss=0.911, avg_accuracy=0.69]
Validating:  40% 20/50 [00:01<00:03,  8.11it/s]
Validating:  42% 21/50 [00:02<00:03,  8.00it/s]
Validating:  44% 22/50 [00:02<00:03,  8.04it/s]
Validating:  46% 23/50 [00:02<00:03,  8.06it/s]
Validating:  48% 24/50 [00:02<00:03,  8.02it/s]
Epoch 8:  95% 468/493 [02:37<00:08,  2.98it/s, loss=0.420, v_num=0, avg_loss=0.911, avg_accuracy=0.69]
Validating:  52% 26/50 [00:02<00:03,  7.94it/s]
Validating:  54% 27/50 [00:02<00:02,  8.00it/s]
Validating:  56% 28/50 [00:02<00:02,  8.02it/s]
Validating:  58% 29/50 [00:03<00:02,  8.00it/s]
Validating:  60% 30/50 [00:03<00:02,  7.93it/s]
Epoch 8:  96% 474/493 [02:38<00:06,  3.00it/s, loss=0.420, v_num=0, avg_loss=0.911, avg_accuracy=0.69]
Validating:  64% 32/50 [00:03<00:02,  8.01it/s]
Validating:  66% 33/50 [00:03<00:02,  7.99it/s]
Validating:  68% 34/50 [00:03<00:02,  7.95it/s]
Validating:  70% 35/50 [00:03<00:01,  7.89it/s]
Validating:  72% 36/50 [00:03<00:01,  7.95it/s]
Epoch 8:  97% 480/493 [02:38<00:04,  3.02it/s, loss=0.420, v_num=0, avg_loss=0.911, avg_accuracy=0.69]
Validating:  76% 38/50 [00:04<00:01,  7.95it/s]
Validating:  78% 39/50 [00:04<00:01,  7.87it/s]
Validating:  80% 40/50 [00:04<00:01,  7.95it/s]
Validating:  82% 41/50 [00:04<00:01,  7.97it/s]
Validating:  84% 42/50 [00:04<00:01,  7.97it/s]
Epoch 8:  99% 486/493 [02:39<00:02,  3.05it/s, loss=0.420, v_num=0, avg_loss=0.911, avg_accuracy=0.69]
Validating:  88% 44/50 [00:04<00:00,  7.91it/s]
Validating:  90% 45/50 [00:05<00:00,  7.97it/s]
Validating:  92% 46/50 [00:05<00:00,  7.96it/s]
Validating:  94% 47/50 [00:05<00:00,  7.87it/s]
Validating:  96% 48/50 [00:05<00:00,  7.93it/s]
Epoch 8: 100% 492/493 [02:40<00:00,  3.07it/s, loss=0.420, v_num=0, avg_loss=0.911, avg_accuracy=0.69]
Epoch 8: 100% 493/493 [02:45<00:00,  2.97it/s, loss=0.420, v_num=0, avg_loss=0.919, avg_accuracy=0.728]
Epoch 9:  90% 443/493 [02:34<00:17,  2.87it/s, loss=0.299, v_num=0, avg_loss=0.919, avg_accuracy=0.728]
Epoch 9:  90% 444/493 [02:34<00:17,  2.87it/s, loss=0.299, v_num=0, avg_loss=0.919, avg_accuracy=0.728]
Epoch 9:  91% 450/493 [02:34<00:14,  2.91it/s, loss=0.299, v_num=0, avg_loss=0.919, avg_accuracy=0.728]
Validating:  16% 8/50 [00:00<00:02, 19.00it/s]
Validating:  20% 10/50 [00:00<00:03, 13.28it/s]
Validating:  24% 12/50 [00:00<00:03, 11.17it/s]
Epoch 9:  92% 456/493 [02:35<00:12,  2.93it/s, loss=0.299, v_num=0, avg_loss=0.919, avg_accuracy=0.728]
Validating:  28% 14/50 [00:01<00:03,  9.32it/s]
Validating:  30% 15/50 [00:01<00:03,  8.83it/s]
Validating:  32% 16/50 [00:01<00:03,  8.53it/s]
Validating:  34% 17/50 [00:01<00:03,  8.34it/s]
Validating:  36% 18/50 [00:01<00:03,  8.24it/s]
Epoch 9:  94% 462/493 [02:36<00:10,  2.96it/s, loss=0.299, v_num=0, avg_loss=0.919, avg_accuracy=0.728]
Validating:  40% 20/50 [00:01<00:03,  8.06it/s]
Validating:  42% 21/50 [00:02<00:03,  8.12it/s]
Validating:  44% 22/50 [00:02<00:03,  8.17it/s]
Validating:  46% 23/50 [00:02<00:03,  8.17it/s]
Validating:  48% 24/50 [00:02<00:03,  8.14it/s]
Epoch 9:  95% 468/493 [02:37<00:08,  2.98it/s, loss=0.299, v_num=0, avg_loss=0.919, avg_accuracy=0.728]
Validating:  52% 26/50 [00:02<00:03,  7.96it/s]
Validating:  54% 27/50 [00:02<00:02,  7.97it/s]
Validating:  56% 28/50 [00:02<00:02,  8.06it/s]
Validating:  58% 29/50 [00:03<00:02,  8.09it/s]
Validating:  60% 30/50 [00:03<00:02,  8.07it/s]
Epoch 9:  96% 474/493 [02:37<00:06,  3.00it/s, loss=0.299, v_num=0, avg_loss=0.919, avg_accuracy=0.728]
Validating:  64% 32/50 [00:03<00:02,  7.97it/s]
Validating:  66% 33/50 [00:03<00:02,  7.98it/s]
Validating:  68% 34/50 [00:03<00:02,  7.99it/s]
Validating:  70% 35/50 [00:03<00:01,  7.95it/s]
Validating:  72% 36/50 [00:03<00:01,  7.92it/s]
Epoch 9:  97% 480/493 [02:38<00:04,  3.03it/s, loss=0.299, v_num=0, avg_loss=0.919, avg_accuracy=0.728]
Validating:  76% 38/50 [00:04<00:01,  8.01it/s]
Validating:  78% 39/50 [00:04<00:01,  8.06it/s]
Validating:  80% 40/50 [00:04<00:01,  8.04it/s]
Validating:  82% 41/50 [00:04<00:01,  7.98it/s]
Validating:  84% 42/50 [00:04<00:01,  7.92it/s]
Epoch 9:  99% 486/493 [02:39<00:02,  3.05it/s, loss=0.299, v_num=0, avg_loss=0.919, avg_accuracy=0.728]
Validating:  88% 44/50 [00:04<00:00,  7.99it/s]
Validating:  90% 45/50 [00:05<00:00,  7.96it/s]
Validating:  92% 46/50 [00:05<00:00,  7.91it/s]
Validating:  94% 47/50 [00:05<00:00,  8.00it/s]
Validating:  96% 48/50 [00:05<00:00,  8.04it/s]
Epoch 9: 100% 492/493 [02:40<00:00,  3.07it/s, loss=0.299, v_num=0, avg_loss=0.919, avg_accuracy=0.728]
Epoch 9: 100% 493/493 [02:45<00:00,  2.98it/s, loss=0.299, v_num=0, avg_loss=0.941, avg_accuracy=0.767]
Epoch 10:  90% 443/493 [02:34<00:17,  2.86it/s, loss=0.228, v_num=0, avg_loss=0.941, avg_accuracy=0.767]
Epoch 10:  90% 444/493 [02:34<00:17,  2.87it/s, loss=0.228, v_num=0, avg_loss=0.941, avg_accuracy=0.767]
Epoch 10:  91% 450/493 [02:35<00:14,  2.90it/s, loss=0.228, v_num=0, avg_loss=0.941, avg_accuracy=0.767]
Validating:  16% 8/50 [00:00<00:02, 18.94it/s]
Validating:  20% 10/50 [00:00<00:02, 13.38it/s]
Validating:  24% 12/50 [00:00<00:03, 11.05it/s]
Epoch 10:  92% 456/493 [02:35<00:12,  2.93it/s, loss=0.228, v_num=0, avg_loss=0.941, avg_accuracy=0.767]
Validating:  28% 14/50 [00:01<00:03,  9.29it/s]
Validating:  30% 15/50 [00:01<00:03,  8.75it/s]
Validating:  32% 16/50 [00:01<00:03,  8.57it/s]
Validating:  34% 17/50 [00:01<00:03,  8.42it/s]
Validating:  36% 18/50 [00:01<00:03,  8.29it/s]
Epoch 10:  94% 462/493 [02:36<00:10,  2.95it/s, loss=0.228, v_num=0, avg_loss=0.941, avg_accuracy=0.767]
Validating:  40% 20/50 [00:01<00:03,  8.04it/s]
Validating:  42% 21/50 [00:02<00:03,  8.10it/s]
Validating:  44% 22/50 [00:02<00:03,  8.10it/s]
Validating:  46% 23/50 [00:02<00:03,  8.04it/s]
Validating:  48% 24/50 [00:02<00:03,  7.95it/s]
Epoch 10:  95% 468/493 [02:37<00:08,  2.97it/s, loss=0.228, v_num=0, avg_loss=0.941, avg_accuracy=0.767]
Validating:  52% 26/50 [00:02<00:02,  8.06it/s]
Validating:  54% 27/50 [00:02<00:02,  8.02it/s]
Validating:  56% 28/50 [00:02<00:02,  7.98it/s]
Validating:  58% 29/50 [00:03<00:02,  7.91it/s]
Validating:  60% 30/50 [00:03<00:02,  8.00it/s]
Epoch 10:  96% 474/493 [02:38<00:06,  3.00it/s, loss=0.228, v_num=0, avg_loss=0.941, avg_accuracy=0.767]
Validating:  64% 32/50 [00:03<00:02,  8.07it/s]
Validating:  66% 33/50 [00:03<00:02,  8.06it/s]
Validating:  68% 34/50 [00:03<00:01,  8.04it/s]
Validating:  70% 35/50 [00:03<00:01,  7.94it/s]
Validating:  72% 36/50 [00:03<00:01,  8.01it/s]
Epoch 10:  97% 480/493 [02:38<00:04,  3.02it/s, loss=0.228, v_num=0, avg_loss=0.941, avg_accuracy=0.767]
Validating:  76% 38/50 [00:04<00:01,  8.04it/s]
Validating:  78% 39/50 [00:04<00:01,  8.02it/s]
Validating:  80% 40/50 [00:04<00:01,  7.94it/s]
Validating:  82% 41/50 [00:04<00:01,  8.01it/s]
Validating:  84% 42/50 [00:04<00:00,  8.06it/s]
Epoch 10:  99% 486/493 [02:39<00:02,  3.04it/s, loss=0.228, v_num=0, avg_loss=0.941, avg_accuracy=0.767]
Validating:  88% 44/50 [00:04<00:00,  8.03it/s]
Validating:  90% 45/50 [00:05<00:00,  7.94it/s]
Validating:  92% 46/50 [00:05<00:00,  7.99it/s]
Validating:  94% 47/50 [00:05<00:00,  8.05it/s]
Validating:  96% 48/50 [00:05<00:00,  8.03it/s]
Epoch 10: 100% 492/493 [02:40<00:00,  3.07it/s, loss=0.228, v_num=0, avg_loss=0.941, avg_accuracy=0.767]
Epoch 10: 100% 493/493 [02:46<00:00,  2.96it/s, loss=0.228, v_num=0, avg_loss=1.06, avg_accuracy=0.768] 
                                               Saving latest checkpoint..
Epoch 10: 100% 493/493 [02:46<00:00,  2.96it/s, loss=0.228, v_num=0, avg_loss=1.06, avg_accuracy=0.768]
CPU times: user 5.33 s, sys: 886 ms, total: 6.22 s
Wall time: 22min 14s

Show Results in Tensorboard

In [11]:
%load_ext tensorboard
%tensorboard --logdir lightning_logs/
Output hidden; open in https://colab.research.google.com to view.
In [ ]:
!kill 575
/bin/bash: line 0: kill: (575) - No such process

Load model from checkpoint

In [15]:
try:
  del Model
except:
  pass

from nlp_finetuning_lightning_google import Model, Data
In [15]:
 

To load the model from the checkpoint, we need to known the checkpoint's location. Everytime pytorch lightning saves a new checkpoint, the version number is increased. So let's take a look at the diretory tree.

In [16]:
!tree ./
./
├── data
│   ├── apps.csv
│   └── reviews.csv
├── LICENSE
├── lightning_logs
│   └── version_0
│       ├── checkpoints
│       │   └── epoch=10.ckpt
│       ├── events.out.tfevents.1601405646.03e090310b0a.726.0
│       └── hparams.yaml
├── nlp-fine-tune-run.ipynb
├── nlp_finetuning_lightning_google.py
├── __pycache__
│   └── nlp_finetuning_lightning_google.cpython-36.pyc
├── README.md
├── requirements.in
├── requirements.txt
└── run.txt

5 directories, 13 files

Generally we need to select the model, eg we might look for the most recent checkpoint. In the present case we have deleted all but one checkpoint, so there is only one.

In [17]:
model = Model.load_from_checkpoint('./lightning_logs/version_0/checkpoints/epoch=10.ckpt')
DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "vocab_size": 30522
}

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Model Type <class 'transformers.modeling_distilbert.DistilBertForSequenceClassification'>

Set model to evaluation mode

In [18]:
model.freeze()
_ = model.eval()
In [19]:
from argparse import Namespace


#parser.add_argument('--pretrained', type=str, default="bert-base-uncased")
#parser.add_argument('--use_autos', type=bool, default=True)
#parser.add_argument('--training_portion', type=float, default=0.9)
#parser.add_argument('--batch_size', type=float, default=32)
#parser.add_argument('--learning_rate', type=float, default=2e-5)
#parser.add_argument('--frac', type=float, default=1)


args = {
    'pretrained':"distilbert-base-uncased",
    'use_autos': True,
    'frac': 1,
    'training_portion': 0.9
}
args = Namespace(**args)


d = Data(args)
d.prepare_data()
d.setup()
args: Namespace(frac=1, pretrained='distilbert-base-uncased', training_portion=0.9, use_autos=True)
kwargs: {}
Loading BERT tokenizer
PRETRAINED:distilbert-base-uncased
Type tokenizer: <class 'transformers.tokenization_distilbert.DistilBertTokenizer'>
Setting up dataset
Number of training sentences: 15,746

14,171 training samples
1,575 validation samples

Predict labels in validation set

In [20]:
%%time
import torch
label_predictions = []
labels = []

input_ids = []
masks = []



for ix, x in enumerate(d.val_dataset):
    # if ix > 50: break
    if ix%100==0: print('ix:', ix)

    # print(x)
    
    input_ids = x[0].unsqueeze(dim=0)
    masks = x[1].unsqueeze(dim=0)

    batch = []
    batch = (input_ids, masks)

    l_p = int(torch.argmax(model(batch)[1][0], axis=1).numpy())
    label_predictions.append(l_p)

    labels.append(int(x[2].numpy()))
ix: 0
ix: 100
ix: 200
ix: 300
ix: 400
ix: 500
ix: 600
ix: 700
ix: 800
ix: 900
ix: 1000
ix: 1100
ix: 1200
ix: 1300
ix: 1400
ix: 1500
CPU times: user 5min 51s, sys: 1.44 s, total: 5min 52s
Wall time: 5min 52s

Save validation set

In [21]:
import pandas as pd
validation = pd.DataFrame({
    'predictions': label_predictions,
    'labels': labels    
    })
validation.to_csv('validation.csv')

Load validation data

In [22]:
import pandas as pd
validation = pd.read_csv('validation.csv')
(validation['labels'] == validation['predictions']).mean()
Out[22]:
0.7758730158730158

Plot confusion matrix

In [23]:
import matplotlib.pyplot as plt

# change figure size
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 10
fig_size[1] = 8
plt.rcParams["figure.figsize"] = fig_size
In [25]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
#Get the confusion matrix
cm = confusion_matrix(validation['labels'], validation['predictions'])
cm_relative = cm / cm.sum()
cm_relative = cm_relative.round(2)
disp = ConfusionMatrixDisplay(confusion_matrix=cm_relative, display_labels=['★','★★','★★★','★★★★','★★★★★'])

# NOTE: Fill all variables here with default values of the plot_confusion_matrix
disp = disp.plot()
plt.title('Google App Ratings Stars\n(Predictions are from texts)\n(Numbers add up to 1)')
plt.show()

Classify some text examples

In [26]:
from transformers import AutoTokenizer

def classify(text):

  tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
  tt = tokenizer(text)
  batch = [torch.tensor(tt['input_ids']).unsqueeze(dim=0), torch.tensor(tt['attention_mask']).unsqueeze(dim=0)]
  prediction = model(batch)
  stars = torch.argmax(prediction[1][0]) + 1

  print(f'Text: {text}')
  print(f'On the the scale 1(bad) to 5(good) this text received a {stars} stars rating')
In [27]:
classify('It is the most awesome shit.')
Text: It is the most awesome shit.
On the the scale 1(bad) to 5(good) this text received a 5 stars rating
In [28]:
classify('It is the most aweful shit.')
Text: It is the most aweful shit.
On the the scale 1(bad) to 5(good) this text received a 1 stars rating
In [28]:
 

Finally: take look at directory

In [29]:
!tree ./
./
├── data
│   ├── apps.csv
│   └── reviews.csv
├── LICENSE
├── lightning_logs
│   └── version_0
│       ├── checkpoints
│       │   └── epoch=10.ckpt
│       ├── events.out.tfevents.1601405646.03e090310b0a.726.0
│       └── hparams.yaml
├── nlp-fine-tune-run.ipynb
├── nlp_finetuning_lightning_google.py
├── __pycache__
│   └── nlp_finetuning_lightning_google.cpython-36.pyc
├── README.md
├── requirements.in
├── requirements.txt
├── run.txt
└── validation.csv

5 directories, 14 files
In [ ]:
 

References

  1. 1.
    Aralikatte R, Sridhara G, Gantayat N, Mani S. Fault in Your Stars: An Analysis of Android App Reviews. Vol abs/1708.04968.; 2017. http://arxiv.org/abs/1708.04968
  2. 2.
    Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arxiv; 2019. https://arxiv.org/abs/1910.01108
  3. 3.
    B. Brown T, Mann B, Ryder N, et al. Language Models Are Few-Shot Learners.; 2020. https://arxiv.org/abs/2005.14165
  4. 4.
    Dickson B. The untold story of GPT-3 is the transformation of OpenAI. https://bdtechtalks.com/. Published September 12, 2020. Accessed September 12, 2020. https://bdtechtalks.com/2020/08/17/openai-gpt-3-commercial-ai/#:~:text=According%20to%20one%20estimate%2C%20training,increase%20the%20cost%20several%2Dfold.
  5. 5.
    Ridley M. How Innovation Works: And Why It Flourishes in Freedom. Harper (May 19, 2020); 2020.
  6. 6.
        . CCRL 40/15 Rating List. CCRL 40/15 Rating List. Published September 12, 2020. Accessed September 12, 2020. “CCRL 40/15”. computerchess.org.uk
  7. 7.
    Pohl S. Stockfish testing. Stefan Pohl Computer Chess. Published September 12, 2020. Accessed September 12, 2020. https://www.sp-cc.de/
  8. 8.
    Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Vol abs/1810.04805.; 2018. http://arxiv.org/abs/1810.04805
  9. 9.
    Liu Y, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. Published online 2019.
  10. 10.
    Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. Published online 2019. https://arxiv.org/abs/1909.11942