The Dark Side of the Language: Pre-Trained Transformers in the DarkNet

    January 2023
    Leonardo Ranaldi, Aria Nourbakhsh, Arianna Patrizi, Elena Sofia Ruzzetti, Dario Onorati, Francesca Fallucchi, Fabio Massimo Zanzotto
    Image of study
    TLDR Pre-trained Transformers need extreme retraining to perform well on DarkNet data.
    Pre-trained Transformers, which excel in many NLP tasks due to massive pre-training datasets, were tested on unseen sentences from a DarkNet corpus. Surprisingly, syntactic and lexical neural networks performed comparably to pre-trained Transformers, even after fine-tuning. Only after extreme domain adaptation, involving retraining with the masked language model task on the entire novel corpus, did pre-trained Transformers achieve their usual high performance. This indicates that the extensive pre-training corpora provide Transformers with an advantage by exposing them to a wide range of possible sentences.
    Discuss this study in the Community →