The Dark Side of the Language: Pre-Trained Transformers in the DarkNet
January 2023
TLDR Pre-trained Transformers need extreme retraining to perform well on DarkNet data.
Pre-trained Transformers, which excel in many NLP tasks due to massive pre-training datasets, were tested on unseen sentences from a DarkNet corpus. Surprisingly, syntactic and lexical neural networks performed comparably to pre-trained Transformers, even after fine-tuning. Only after extreme domain adaptation, involving retraining with the masked language model task on the entire novel corpus, did pre-trained Transformers achieve their usual high performance. This indicates that the extensive pre-training corpora provide Transformers with an advantage by exposing them to a wide range of possible sentences.