Natural Language Processing

Bert-TweetEval: Natural Language Classification

Bidirectional Encoder Representations from Transformers (BERT) are an encoder-transformer based architecture that is more suitable for some transformer tasks, such as classifying natural language.

Bence Danko

Last updated March 5, 2026 at 3:00 PM

Index Terms-- Emotion analysis, natural language understanding, transformer, DistilBERT, RoBERTa, tokenizer

Appendices: Appendix A , Appendix B , Appendix C , Appendix D , Appendix E , Appendix F , Appendix G

Abstract

Extracting sentiment and intent from human natural language holds immense value in strategic decision-making in many domains. A variety of transformer architectures and base models have emerged as notable language processors, but they vary widely in training scale, vocabulary, and parametric counts. In real-world deployment, models can be hindered due to their cost and latency constraints. Models deployed to production are also subject to real-world stress cases, such as lexical diversity, unknown symbols and vocabulary, and dataset imbalance from the training data. In this work, we analyze lightweight variants of base and fine-tuned Bidirectional Encoder Representations from Transformers (BERT) models performance on the TweetEval emotion classification task. We compare and train DistilBERT and DistilRoBERTa variants and the suitability of their tokenizer architectures (WordPiece, BPE) for the emotion classification domain and their impact on performance. We construct a framework to stress-test distribution shifts and corrupted inputs, and conduct structured error analysis and interpret model confidence and calibration. We also benchmark additional competitive LLM models, Qwen3-4B-Instruct-2507 and GPT 4o-mini, under consistent prompting strategies on the same classification task. All code is released to the public at https://github.com/bencejdanko/bert-tweeteval. Models are released to the public at https://huggingface.co/bdanko.

Introduction and Related Work

TweetEval [1] consists of seven Twitter-specific classification tasks, including emoji prediction, emotion recognition, hate speech detection, irony detection, offensive language identification, sentiment analysis, and stance detection. TweetEval and BERT-variant combinations have already been extensively explored. BERTweet [2], a prior RoBERTa-based model trained on a corpus of 850 million English tweets, established the state-of-the-art (SOTA) baseline across most of TweetEval’s subtasks and proved the value in domain-specific pre-training, outperforming the original BERT and RoBERTa. TimeLMs [3] later introduced models continuously trained on fresh Twitter data, outperforming BERTweet in all TweetEval domains except irony detection. SuperTweetEval [4] has also been since released, adding several more NLP task domains that TweetEval lacked.

Task Description

In this work, we will be targeting emotion classification task from the original TweetEval. Each sample is labeled with one of 4 classes: “anger”, “sadness”, “joy”, or “optimism”. The dataset is heavily imbalanced, with an overrepresentation of “anger” classes at 42.98% of the dataset, and underrepresentation of “optimism” at 9.03%, while joy and sadness sit at 21.74% and 26.25% respectively. In total, we have a training and validation set of sizes 3257 and 374. We test our results on 1421 samples.

We compare models across these key metrics:

Accuracy: Number of overall correct classifications over the validation sample.
Macro F1: Macro F1 is the arithmetic mean of the F1 scores calculated for each individual class. It treats all classes equally, regardless of how many samples each class contains, meaning unbalanced classes are given the same weight:

\text{Macro F1} = \frac{1}{n} \sum_{i=1}^{n} F1_{i}

Precision: Number of true positives calculated over all classifications marked as positives. Measures the quality of a positive prediction. We are taking the macro average.
Recall: The true positive rate. Out of all positive cases in the data, how many can the classifier identify. We are taking the macro average.
ms/100 samples: Test of throughput. We test this by running evaluation inference in batches of 100 on all our models, except for GPT 4o-mini, in which we instead wall-clock for 100 async API requests (up to 20 concurrent).
Expected Calibration Error (ECE): A measure of how well a model’s confidence aligns with its actual accuracy. If a model assigns a probability of 0.90 to a prediction, it should be correct 90% of the time. This tests for overconfident or underconfident models, and typically an ECE score of 0.01 or 0.02 is considered reliable.

To ensure reproducability, we set all configurable seeds to 15179996. No further data processing was done. For example, in future work, it would be possible to experiment with NLP augmentation techniques like Easy Data Augmentation (EDA) [5], back-translation[6], random masking or MixUp [7] and variants. However, these are currently out of scope for our tasks, so we will omit them. All local tests were done evaluated using the NVIDIA L4 GPU as hardware.

Summarized Results

Evaluation Reference

Question	Location
Part A – Zero-Shot Baseline	Final Results at Table 2
Part B – Fine-Tuning Transformers	Final Results at Table 2.
	Training strategy located at Section 2.
	Loss curves and explanation at Appendix G
	Deployment Stress tests at Section 3
Part C – Error Analysis	Error Analysis at Section 4
Part D – LLMs	Minimal Prompt at Appendix A
	Strucured Prompt at Appendix B
	Final Results at Table 2
	Prompt Explanation at Section 5
Other Requirements	Summary of hyperparameters at Section 2.
	Final output table at Table 2
	Training screenshots located at Appendix D
	Confusion matrices located at Appendix C

A reference guide for evaluators

In our studies, we reaffirmed that domain specific pre-training provides BERT models state of the art results on NLP classification tasks. We also demonstrate the arising competitive performance of open-source decoder models against closed-source.

Summarized Results from TweetEval Emotion Classification

Model	Accuracy	Macro F	Macro Precision	Macro Recall	ms/100 samples	ECE
DistilBERT (WordPiece)	0.083744	0.064219	0.384379	0.192035	312.318607	0.167510
DistilRoBERTa (BPE)	0.217452	0.155317	0.177579	0.200315	299.071175	0.056845
bdanko/bert-tweeteval-distilbert	0.79803	0.761196	0.767879	0.756296		0.0364
bdanko/bert-tweeteval-distilroberta	0.788881	0.750644	0.799548	0.728545		0.0364
GPT-4o-mini (Minimal Prompt)	0.800141	0.601466	0.653766	0.579754	5060.112734
GPT-4o-mini (Structured Prompt)	0.821956	0.781499	0.791218	0.773544	3823.308211
Qwen3-4B-Instruct-2507 (Minimal Prompt)	0.751583	0.584864	0.594974	0.581751	2571.471757
Qwen3-4B-Instruct-2507 (Structured Prompt)	0.812104	0.758003	0.793805	0.741873	4931.231370

Final comparison chart across all model evaluations on TweetEval Emotion Classification. While we see that LLMs dominate on performance benchmarks, BERT-architecture models can reach nearly the same performance.

Baseline Analysis

On our baseline analysis, we found that the original DistilBert and DistilRoBERTa models severely underperformed on the TweetEval dataset. From Figure fig:distilbert-base and Figure fig:distilroberta-base, we see an extreme bias towards particular classifications and distributions that do not match our dataset and domain.

Training Strategy

On both models, we initialize for 20 training epochs, a batch size of 16, AdamW optimization using a learning rate of 2e-5 and weight decay of 0.01. We employ EarlyStoppingCallback, early stopping based on the Macro F1 score on the validation set, and then select for the model with the best Macro F1 score after 3 failed improvements (patience of 3). We choose early stopping as our models experimentally overfit very easily, and we select Macro F1 as our stopping metric as it weighs each class equally, which is particularly important due to our imbalanced dataset.

Corruption Stress Testing

To simulate data-corruption stress tests, we randomly introduce typos, hashtag splitting, and emoji removal.

Typos: We randomly swap, delete, or insert characters into words with chance $p=0.1$ . This tests the model robustness against misspelled but recognizable words.
Split Hashtags: We identify hashtags and split CamelCase words or remove the hashtag. This tests if the model relies on the hashtag or underlying semantic content better.
Remove Emoji: Emojis strongly indicate emotion, and removing them tests if the model is robust enough to understand the other semantic cues.

We’ll also conduct domain shift simulation. We create a shift by filtering tweets without mentions, links, or hashtags and compare performance.

Mentions: We compare performance once we strip all @user mentions from the evaluation.
Links: We compare performance once we strip all http links.
Hashtags: We compare performance once we strip all hashtags.

Corruption Ablations

Dataset Shift / Corruption	Accuracy (DistilBERT)	ECE (DistilBERT)	Macro F1 (DistilBERT)	Macro Precision (DistilBERT)	Macro Recall (DistilBERT)	Accuracy (DistilRoBERTa)	ECE (DistilRoBERTa)	Macro F1 (DistilRoBERTa)	Macro Precision (DistilRoBERTa)	Macro Recall (DistilRoBERTa)
All corruptions	0.777621	0.186762	0.731208	0.748215	0.719606	0.769880	0.027936	0.729321	0.785784	0.704836
Emoji Removal	0.796622	0.169403	0.759471	0.765310	0.754525	0.788177	0.044574	0.754264	0.799723	0.730673
Hashtag Splitting	0.797326	0.169666	0.758076	0.767174	0.751265	0.790289	0.049607	0.750437	0.804039	0.727826
Typos	0.774806	0.192172	0.735679	0.750336	0.726397	0.762139	0.032455	0.719843	0.767177	0.700259
Baseline	0.798030	0.169843	0.761196	0.767879	0.756296	0.788881	0.042115	0.750644	0.799548	0.728545
All Domain Shifts	0.798030	0.169843	0.761196	0.767879	0.756296	0.788881	0.042115	0.750644	0.799548	0.728545
No Hashtags	0.779412	0.183594	0.729214	0.740608	0.721849	0.783422	0.055099	0.730822	0.795115	0.706223
No HTTP Links	0.798030	0.169843	0.761196	0.767879	0.756296	0.788881	0.042115	0.750644	0.799548	0.728545
No @ Mentions	0.786865	0.178595	0.764952	0.763810	0.766616	0.788104	0.041246	0.763955	0.796607	0.749939

All corruption ablations and domain shifting ablations. While typos and corruption_all impact base model’s accuracy (0.77), the distilroberta model maintains higher calibration.

Error Analysis

8 Missclassified samples from both models can be seen in Appendix F. We can see how misspellings, such as “Deppression”, cause unrecognizable token fragments ’#’, ‘de’, ‘##pp’, ‘##ress’, ‘##ion’. These can’t be clearly mapped to any particular emotional state, and may be unrecognizable from the pre-trained vocabulary.

For misspellings, we would need to implement character-level data augmentation to simulate typos. We would inject random character insertions, deletions, and keyboard-distance typos, especially targetting emotional keywords. By forcing the model to see misspelled variants, we train the attention heads to recognize the pattern of the fragments. This may improve the score.

Other missclassified sentences have words that seem semantically biased for one class, where subtle, but important tokens change the meaning significantly. For example, both tokenizers correctly break down “revolting”, but neither model can weigh “i am” enough to overcome the “angry” connotations and classifications. Thus “i am revolting” is missclassified as “angry”.

In order to counteract this, we need to increase our training samples or introduce more robust data augmentation to increase the semantic representation and understanding for our models. The best technique in this case would be Counterfactual data augmentation (CDA), where we generate more samples consisting of the word “revolting” in all scenarios, and thus we can diversify the model’s interpretation of “revolting” across a greater number of samples and classes.

Open Source and Closed Models

We evaluated two LLM models, Qwen3-4B-Instruct-2507 and GPT 4o-mini, on prompts from Appendix A and Appendix B. They achieved SOTA results without manual tuning, and responses aligned with the distribution of the training data.

A structured result demonstrated greater performance in all metrics over minimal prompts, except on throughput for Qwen. Longer prompts tap further into the parametric memory of models, they can elicit the pre-trained memory to produce more aligned responses. Our longer prompt thus increased the statistical chances of the model producing accurate emotion-classification assessment. Without such context priming, the model is not parametrically activated in the same specialized manner, and thus was more unlikely to produce an aligned response.

The final results for each tested model has been collected and summarized in Table 2.

[1]

F. Barbieri, J. Camacho-Collados, L. Neves, and L. Espinosa-Anke, “TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification.” 2020. [Online]. Available: https://arxiv.org/abs/2010.12421

[2]

D. Q. Nguyen, T. Vu, and A. Tuan Nguyen, “BERTweet: A pre-trained language model for English Tweets,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Q. Liu and D. Schlangen, Eds., Online: Association for Computational Linguistics, Oct. 2020, pp. 9–14. doi: 10.18653/v1/2020.emnlp-demos.2.

[3]

D. Loureiro, F. Barbieri, L. Neves, L. E. Anke, and J. Camacho-Collados, “TimeLMs: Diachronic Language Models from Twitter.” 2022. [Online]. Available: https://arxiv.org/abs/2202.03829

[4]

D. Antypas et al., “SuperTweetEval: A Challenging, Unified and Heterogeneous Benchmark for Social Media NLP Research.” 2023. [Online]. Available: https://arxiv.org/abs/2310.14757

[5]

J. Wei and K. Zou, “EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks.” 2019. [Online]. Available: https://arxiv.org/abs/1901.11196

[6]

S. Edunov, M. Ott, M. Auli, and D. Grangier, “Understanding Back-Translation at Scale.” 2018. [Online]. Available: https://arxiv.org/abs/1808.09381

[7]

L. Sun, C. Xia, W. Yin, T. Liang, P. S. Yu, and L. He, “Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks.” 2020. [Online]. Available: https://arxiv.org/abs/2010.02394