Natural Language Processing

Bert-TweetEval: Natural Language Classification

PDF
Bert-TweetEval: Natural Language Classification
Bidirectional Encoder Representations from Transformers (BERT) are an encoder-transformer based architecture that is more suitable for some transformer tasks, such as classifying natural language.
Bence Danko
Last updated March 5, 2026 at 3:00 PM
Index Terms-- Emotion analysis, natural language understanding, transformer, DistilBERT, RoBERTa, tokenizer
Appendices: Appendix A , Appendix B , Appendix C , Appendix D , Appendix E , Appendix F , Appendix G

Abstract

Extracting sentiment and intent from human natural language holds immense value in strategic decision-making in many domains. A variety of transformer architectures and base models have emerged as notable language processors, but they vary widely in training scale, vocabulary, and parametric counts. In real-world deployment, models can be hindered due to their cost and latency constraints. Models deployed to production are also subject to real-world stress cases, such as lexical diversity, unknown symbols and vocabulary, and dataset imbalance from the training data. In this work, we analyze lightweight variants of base and fine-tuned Bidirectional Encoder Representations from Transformers (BERT) models performance on the TweetEval emotion classification task. We compare and train DistilBERT and DistilRoBERTa variants and the suitability of their tokenizer architectures (WordPiece, BPE) for the emotion classification domain and their impact on performance. We construct a framework to stress-test distribution shifts and corrupted inputs, and conduct structured error analysis and interpret model confidence and calibration. We also benchmark additional competitive LLM models, Qwen3-4B-Instruct-2507 and GPT 4o-mini, under consistent prompting strategies on the same classification task. All code is released to the public at https://github.com/bencejdanko/bert-tweeteval. Models are released to the public at https://huggingface.co/bdanko.

Introduction and Related Work

TweetEval [1] consists of seven Twitter-specific classification tasks, including emoji prediction, emotion recognition, hate speech detection, irony detection, offensive language identification, sentiment analysis, and stance detection. TweetEval and BERT-variant combinations have already been extensively explored. BERTweet [2], a prior RoBERTa-based model trained on a corpus of 850 million English tweets, established the state-of-the-art (SOTA) baseline across most of TweetEval’s subtasks and proved the value in domain-specific pre-training, outperforming the original BERT and RoBERTa. TimeLMs [3] later introduced models continuously trained on fresh Twitter data, outperforming BERTweet in all TweetEval domains except irony detection. SuperTweetEval [4] has also been since released, adding several more NLP task domains that TweetEval lacked.

Task Description

In this work, we will be targeting emotion classification task from the original TweetEval. Each sample is labeled with one of 4 classes: “anger”, “sadness”, “joy”, or “optimism”. The dataset is heavily imbalanced, with an overrepresentation of “anger” classes at 42.98% of the dataset, and underrepresentation of “optimism” at 9.03%, while joy and sadness sit at 21.74% and 26.25% respectively. In total, we have a training and validation set of sizes 3257 and 374. We test our results on 1421 samples.

We compare models across these key metrics:

Macro F1=1ni=1nF1i\text{Macro F1} = \frac{1}{n} \sum_{i=1}^{n} F1_{i}

To ensure reproducability, we set all configurable seeds to 15179996. No further data processing was done. For example, in future work, it would be possible to experiment with NLP augmentation techniques like Easy Data Augmentation (EDA) [5], back-translation[6], random masking or MixUp [7] and variants. However, these are currently out of scope for our tasks, so we will omit them. All local tests were done evaluated using the NVIDIA L4 GPU as hardware.

Summarized Results

Evaluation Reference

QuestionLocation
Part A – Zero-Shot BaselineFinal Results at Table 2
Part B – Fine-Tuning TransformersFinal Results at Table 2.
Training strategy located at Section 2.
Loss curves and explanation at Appendix G
Deployment Stress tests at Section 3
Part C – Error AnalysisError Analysis at Section 4
Part D – LLMsMinimal Prompt at Appendix A
Strucured Prompt at Appendix B
Final Results at Table 2
Prompt Explanation at Section 5
Other RequirementsSummary of hyperparameters at Section 2.
Final output table at Table 2
Training screenshots located at Appendix D
Confusion matrices located at Appendix C
A reference guide for evaluators

In our studies, we reaffirmed that domain specific pre-training provides BERT models state of the art results on NLP classification tasks. We also demonstrate the arising competitive performance of open-source decoder models against closed-source.

Summarized Results from TweetEval Emotion Classification

ModelAccuracyMacro FMacro PrecisionMacro Recallms/100 samplesECE
DistilBERT (WordPiece)0.0837440.0642190.3843790.192035312.3186070.167510
DistilRoBERTa (BPE)0.2174520.1553170.1775790.200315299.0711750.056845
bdanko/bert-tweeteval-distilbert0.798030.7611960.7678790.7562960.0364
bdanko/bert-tweeteval-distilroberta0.7888810.7506440.7995480.7285450.0364
GPT-4o-mini (Minimal Prompt)0.8001410.6014660.6537660.5797545060.112734
GPT-4o-mini (Structured Prompt)0.8219560.7814990.7912180.7735443823.308211
Qwen3-4B-Instruct-2507 (Minimal Prompt)0.7515830.5848640.5949740.5817512571.471757
Qwen3-4B-Instruct-2507 (Structured Prompt)0.8121040.7580030.7938050.7418734931.231370
Final comparison chart across all model evaluations on TweetEval Emotion Classification. While we see that LLMs dominate on performance benchmarks, BERT-architecture models can reach nearly the same performance.

Baseline Analysis

On our baseline analysis, we found that the original DistilBert and DistilRoBERTa models severely underperformed on the TweetEval dataset. From Figure fig:distilbert-base and Figure fig:distilroberta-base, we see an extreme bias towards particular classifications and distributions that do not match our dataset and domain.

Training Strategy

On both models, we initialize for 20 training epochs, a batch size of 16, AdamW optimization using a learning rate of 2e-5 and weight decay of 0.01. We employ EarlyStoppingCallback, early stopping based on the Macro F1 score on the validation set, and then select for the model with the best Macro F1 score after 3 failed improvements (patience of 3). We choose early stopping as our models experimentally overfit very easily, and we select Macro F1 as our stopping metric as it weighs each class equally, which is particularly important due to our imbalanced dataset.

Corruption Stress Testing

To simulate data-corruption stress tests, we randomly introduce typos, hashtag splitting, and emoji removal.

We’ll also conduct domain shift simulation. We create a shift by filtering tweets without mentions, links, or hashtags and compare performance.

Corruption Ablations

Dataset Shift / CorruptionAccuracy (DistilBERT)ECE (DistilBERT)Macro F1 (DistilBERT)Macro Precision (DistilBERT)Macro Recall (DistilBERT)Accuracy (DistilRoBERTa)ECE (DistilRoBERTa)Macro F1 (DistilRoBERTa)Macro Precision (DistilRoBERTa)Macro Recall (DistilRoBERTa)
All corruptions0.7776210.1867620.7312080.7482150.7196060.7698800.0279360.7293210.7857840.704836
Emoji Removal0.7966220.1694030.7594710.7653100.7545250.7881770.0445740.7542640.7997230.730673
Hashtag Splitting0.7973260.1696660.7580760.7671740.7512650.7902890.0496070.7504370.8040390.727826
Typos0.7748060.1921720.7356790.7503360.7263970.7621390.0324550.7198430.7671770.700259
Baseline0.7980300.1698430.7611960.7678790.7562960.7888810.0421150.7506440.7995480.728545
All Domain Shifts0.7980300.1698430.7611960.7678790.7562960.7888810.0421150.7506440.7995480.728545
No Hashtags0.7794120.1835940.7292140.7406080.7218490.7834220.0550990.7308220.7951150.706223
No HTTP Links0.7980300.1698430.7611960.7678790.7562960.7888810.0421150.7506440.7995480.728545
No @ Mentions0.7868650.1785950.7649520.7638100.7666160.7881040.0412460.7639550.7966070.749939
All corruption ablations and domain shifting ablations. While typos and corruption_all impact base model’s accuracy (0.77), the distilroberta model maintains higher calibration.

Error Analysis

8 Missclassified samples from both models can be seen in Appendix F. We can see how misspellings, such as “Deppression”, cause unrecognizable token fragments ’#’, ‘de’, ‘##pp’, ‘##ress’, ‘##ion’. These can’t be clearly mapped to any particular emotional state, and may be unrecognizable from the pre-trained vocabulary.

For misspellings, we would need to implement character-level data augmentation to simulate typos. We would inject random character insertions, deletions, and keyboard-distance typos, especially targetting emotional keywords. By forcing the model to see misspelled variants, we train the attention heads to recognize the pattern of the fragments. This may improve the score.

Other missclassified sentences have words that seem semantically biased for one class, where subtle, but important tokens change the meaning significantly. For example, both tokenizers correctly break down “revolting”, but neither model can weigh “i am” enough to overcome the “angry” connotations and classifications. Thus “i am revolting” is missclassified as “angry”.

In order to counteract this, we need to increase our training samples or introduce more robust data augmentation to increase the semantic representation and understanding for our models. The best technique in this case would be Counterfactual data augmentation (CDA), where we generate more samples consisting of the word “revolting” in all scenarios, and thus we can diversify the model’s interpretation of “revolting” across a greater number of samples and classes.

Open Source and Closed Models

We evaluated two LLM models, Qwen3-4B-Instruct-2507 and GPT 4o-mini, on prompts from Appendix A and Appendix B. They achieved SOTA results without manual tuning, and responses aligned with the distribution of the training data.

A structured result demonstrated greater performance in all metrics over minimal prompts, except on throughput for Qwen. Longer prompts tap further into the parametric memory of models, they can elicit the pre-trained memory to produce more aligned responses. Our longer prompt thus increased the statistical chances of the model producing accurate emotion-classification assessment. Without such context priming, the model is not parametrically activated in the same specialized manner, and thus was more unlikely to produce an aligned response.

The final results for each tested model has been collected and summarized in Table 2.

[1]
F. Barbieri, J. Camacho-Collados, L. Neves, and L. Espinosa-Anke, “TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification.” 2020. [Online]. Available: https://arxiv.org/abs/2010.12421
[2]
D. Q. Nguyen, T. Vu, and A. Tuan Nguyen, “BERTweet: A pre-trained language model for English Tweets,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Q. Liu and D. Schlangen, Eds., Online: Association for Computational Linguistics, Oct. 2020, pp. 9–14. doi: 10.18653/v1/2020.emnlp-demos.2.
[3]
D. Loureiro, F. Barbieri, L. Neves, L. E. Anke, and J. Camacho-Collados, “TimeLMs: Diachronic Language Models from Twitter.” 2022. [Online]. Available: https://arxiv.org/abs/2202.03829
[4]
D. Antypas et al., “SuperTweetEval: A Challenging, Unified and Heterogeneous Benchmark for Social Media NLP Research.” 2023. [Online]. Available: https://arxiv.org/abs/2310.14757
[5]
J. Wei and K. Zou, “EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks.” 2019. [Online]. Available: https://arxiv.org/abs/1901.11196
[6]
S. Edunov, M. Ott, M. Auli, and D. Grangier, “Understanding Back-Translation at Scale.” 2018. [Online]. Available: https://arxiv.org/abs/1808.09381
[7]
L. Sun, C. Xia, W. Yin, T. Liang, P. S. Yu, and L. He, “Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks.” 2020. [Online]. Available: https://arxiv.org/abs/2010.02394