“The first duty in life is to be as artificial as possible. What the second duty is no one has as yet discovered.” Oscar Wilde
Transformer model Fine-tuning for text classification with Pytorch Lightning

Transformer model Fine-tuning for text classification with Pytorch Lightning

Update 3 June 2021: I have updated the code and notebook in github, to reflect the most recent api version of the packages, especially pytorch-lightning.

The years 2018-2020 have been the start of a new era in the field of natural language processing (nlp). This means that today computers can understand humans much better. Something similar occurred in image processing about six years earlier, when computers learned to understand images much better. The recent development in natural language processing (nlp) has been compared favourably to that of image processing, by many authors.

In this post we’ll demonstrate code that uses the vastly improved natural language processing (nlp) software landscape. In particular we use it for one of the classic nlp use cases: Text classification.

More specifically, we use the new capabilities to predict from a user’s app review in the Google Play Store the star rating that the same user gave to the app.. This is useful e.g. for spotting clearly erroneous star ratings, e.g. where the user misinterpreted the rating scale thinking: “5 is very bad, 1 is very good”, whereas the normal meaning is : “5 is very good, 1 is very bad”.​1​

Paradigm Shift

The dawn of the new era witnessed a true paradigm shift. The new paradigm shift is dominated by one thing: fine tuning. This mirrors again the earlier development in images, when a similar paradigm shift occurred. So you could ask: What is this fine tuning? Really, it is quite simple. Fine tuning is jargon for reusing a general model trained on a huge amount of general data for the specific data at hand. The model trained on such general data is called “pretrained”. So for instance a model might typically have been pretrained on Wikipedia, e.g. Wikipedia English, which means that it would have learned and therefore now “understands” general English. If our specific task then is to classify newspaper articles, it will typically be useful to do a moderate amount of fine tuning on the newspaper data, so that the model better understands the specificity of newspaper English.

In this article we will demonstrate some code that follows this new paradigm. In particular we will demonstrate this finetuning on a Google App Rating dataset.

For the better organisation of our code and general convenience, we will us pytorch lightning.

The paradigm shift in 2018-2020 was driven by the arrival of many new models. A timeline of these arrivals is shown in figure Timeline.

Timeline : Timeline of nlp model arrivals. Shown is the model size vs arrival time. BERT released in October 2018, has 340 million parameters, 🤗 Distilbert released in 2019 has 66 million parameters. Adapted from DistilBERT, a Distilled Version of BERT​2​.

In the diagram you can clearly see that the models have generally been getting bigger. They also generally get better, as this is of course quite a general trend in research, otherwise things would not be worth publishing. The en vogue expression for a proof that a model is indeed an improvement is “beating the state of the art” or “beating SOTA”. The state of the art is measured for various text related tasks, simpler ones such as text classification, and more complicated ones such as text summarisation.

So the progress of the state of the art is partially driven by model size. The model size is generally by the number of free parameters that need to be trained, a number that is typically measured in millions. In fact a new model that was published in May 2020, the very trendy so-called GPT-3​3​ by OpenAI, has 175 000 million parameters, which is more than 10 times more than the largest model shown in the diagram, Turing-NLG with 17 000 million parameters. Turing-NLG had previously been the biggest model.

To some extent this is a worrying trend,  which reinforces the lead of the very big players in the field. In fact OpenAI – in a significant deviation from current practices – has not published a pre-trained version of the GPT-3 model, but rather prefers to sell an API service which is powered by the model. On the other hand, there are very few players who can afford the cost to train the model, e.g. the hardware cost plus electricity bill would be quite large. One estimate puts the price tag at ~5 million US Dollars​4​,  which is certainly out of the question for most individual researchers.

Another thing to notice in the diagram is the continual and steady flow of new models. So there is really no single point of the revolution. This is quite a general observation​5​ and it can be seen very plainly e.g. in the evolution of playing strength of history’s strongest​6​ chess playing software stockfish​7​.

The code provided here in the present post allows you to switch models very easily.

Possible choices for pretrained models are

  • albert-base-v2
  • bert-base-uncased
  • distilbert-base-uncased

We have run the colab notebook with these choices, other pretrained models from huggingface might also be possible.

The interest in these models can be expressed like this:

  • BERT​8​ (published in October 2018 by researchers from Google) managed to capture a lot of attention. The enormous hold on the nlp researchers collective psyche is illustrated by the choice of subsequent model names incorporating the term BERT, such as roBERTa​9​, ALBERT​10​, distilBERT​2​. Therefore BERT is to many researchers and practitioners a symbol of the nlp revolution and is as suitable as any model in being a representative of the paradigm shift.
  • DistilBERT​2​ (published in October 2019 by researchers from Huggingface) was put forward with the explicit aim of going against the prevailing big-is-beautiful trend, by creating a BERT like model that is “Smaller, Faster, Cheaper and Lighter​2​.
  • ALBERT​10​ (version 1 was published in September 2019 by researchers from Google and Toyota, an updated version 2 was published in January 2020) was an effort to beat BERT and existing SOTAs while limiting model size.

The bigger-is-better trend has since continued unabated, so by now BERT itself at 340 million parameters or ~1.3 Gigabyte looks small in comparison to recent models, however, our preference is for a model that can be fine-tuned using the limited resources that are available to an individual. Therefore in the notebook we show results for DistilBERT.

To run the different model, e.g. the BERT Model, just one name needs to be changed in the Colab notebook, namely for bert-base-uncased the python command should be changed to

where distilbert-base-uncased was changed to bert-base-uncased. For albert-base-v2 the same procedure can be used.

We have also run the code for BERT instead of distilBERT, and the results are only marginally better. For ALBERT they were marginally worse. However, this post here will not provide a detailed comparison of model performance.

We will do the fine-tuning with Google Colab, which is currently arguably the best GPU resource that is entirely free.


  • Hardware
    • A computer with at least 4 GB Ram
  • Software
    • The computer can run on Linux, MacOS or Windows
    • VS code is an advantage
  • Wetware
    • Familiarity with Python
    • For the technical code, a familiarity with pytorch lightning definitely helps
    • … as does a familiarity with nlp transformers
    • … as does a familiarity in particular with hugginface/transformers

For the technical code of this post we owe thanks to the following two posts, they also contain some background information, just in the case the reader wants to brush up on any of the wetware requirements.

The defining features of our post where our post distinguishes itself from Chris and Nick’s and/or Venelin’s are:

  • Task : We focus on the classic task of text classification, using a different dataset, viz. Google Play app reviews dataset from Venelin Valkov’s post. Similar to Venelin, different from Chris.
  • Colab: We train with Google Colab, which, as mentioned, is currently arguably the best GPU resource that is entirely free. It’s use is greatly simplified by point Structure. Similar to Venelin, similar to Chris. The main difference here is, that we develop locally first, e.g. in VS code, and then “wrap” the code with a Colab notebook.
  • Structure: We structure our code with Pytorch Lightning, which makes everything very readable. Further it makes switching from coding locally with your cpu, to running it on a cloud-based gpu very simple (a “breeze”), literally one line of code. Different from Venelin, different from Chris.

Transformers from Huggingface

The github hosted package https://github.com/huggingface/transformers is the go-to-place when it comes to modern natural language processing. Its popularity can be seen from its fast rising github stars, currently at  Star . The package supplies the architecture of a huge zoo of models such as BERT, roBERTa, ALBERT and distilBERT. It is available in pytorch and tensorflow.

But best of all it provides fully pretrained models. With the notable exception of GPT 3 🙁.

Structure with Pytorch Lightning

The github hosted package https://github.com/PytorchLightning/pytorch-lightning also is a highly popular package.

Its popularity can also be seen from its equally fast rising github stars, currently at Star .

This is from the lightning README:

“Lightning disentangles PyTorch code to decouple the science from the engineering by organizing it into 4 categories:

  1. Research code (the LightningModule).
  2. Engineering code (you delete, and is handled by the Trainer).
  3. Non-essential research code (logging, etc… this goes in Callbacks).
  4. Data (use PyTorch Dataloaders or organize them into a LightningDataModule).

Once you do this, you can train on multiple-GPUs, TPUs, CPUs and even in 16-bit precision without changing your code!”

So, in essence, it promotes readability, reasonable expectations regarding code structure, the promise you can focus on the interesting bits, while being able to switch painlessly from cpu to gpu/tpu.


The code is available on github https://github.com/hfwittmann/transformer_finetuning_lightning.

As mentioned, for the technical code, it is helpful to have a familiarity with

  1. pytorch lightning definitely
  2. nlp transformers
  3. hugging face transformers


We start with a python file which we can develop locally in eg VS Code, thereby benefitting from its convenience features, above all the very nice debugging facilities. Because of the ease of switching from CPU to GPU code facilitated by pytorch-lightning, we can do this on our local CPU.

Once this is done, we transfer the file to Google Drive, mount the appropriate folder in Google Colab, and then execute our file from a jupyter-type Colab notebook.

To make it really useful we use the excellent argparse facilities of pytorch-lightning, so that we can eg. change various parameters via the command line on the fly.

So here’s the fully commented python code:



  1. 1.
    Aralikatte R, Sridhara G, Gantayat N, Mani S. Fault in Your Stars: An Analysis of Android App Reviews. Vol abs/1708.04968.; 2017. http://arxiv.org/abs/1708.04968
  2. 2.
    Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arxiv; 2019. https://arxiv.org/abs/1910.01108
  3. 3.
    B. Brown T, Mann B, Ryder N, et al. Language Models Are Few-Shot Learners.; 2020. https://arxiv.org/abs/2005.14165
  4. 4.
    Dickson B. The untold story of GPT-3 is the transformation of OpenAI. https://bdtechtalks.com/. Published September 12, 2020. Accessed September 12, 2020. https://bdtechtalks.com/2020/08/17/openai-gpt-3-commercial-ai/#:~:text=According%20to%20one%20estimate%2C%20training,increase%20the%20cost%20several%2Dfold.
  5. 5.
    Ridley M. How Innovation Works: And Why It Flourishes in Freedom. Harper (May 19, 2020); 2020.
  6. 6.
        . CCRL 40/15 Rating List. CCRL 40/15 Rating List. Published September 12, 2020. Accessed September 12, 2020. “CCRL 40/15”. computerchess.org.uk
  7. 7.
    Pohl S. Stockfish testing. Stefan Pohl Computer Chess. Published September 12, 2020. Accessed September 12, 2020. https://www.sp-cc.de/
  8. 8.
    Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Vol abs/1810.04805.; 2018. http://arxiv.org/abs/1810.04805
  9. 9.
    Liu Y, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. Published online 2019.
  10. 10.
    Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. Published online 2019. https://arxiv.org/abs/1909.11942
PHP Code Snippets Powered By : XYZScripts.com