The Dark Side of the Language: Pre-Trained Transformers in the DarkNet

January 2023

Leonardo Ranaldi, Aria Nourbakhsh, Arianna Patrizi, Elena Sofia Ruzzetti, Dario Onorati, Francesca Fallucchi, Fabio Massimo Zanzotto

TLDR Pre-trained Transformers need extreme retraining to perform well on DarkNet data.

Pre-trained Transformers, which excel in many NLP tasks due to massive pre-training datasets, were tested on unseen sentences from a DarkNet corpus. Surprisingly, syntactic and lexical neural networks performed comparably to pre-trained Transformers, even after fine-tuning. Only after extreme domain adaptation, involving retraining with the masked language model task on the entire novel corpus, did pre-trained Transformers achieve their usual high performance. This indicates that the extensive pre-training corpora provide Transformers with an advantage by exposing them to a wide range of possible sentences.

View this study on acl-bg.org →