Building open-source classifiers

The Challenge

Accurately classifying and annotating scientific texts is crucial for managing and leveraging vast amounts of data. However, existing models often fell short in terms of accuracy and efficiency. We were contacted by research support offices at SciLifeLab and KTH, who required a robust solution to streamline the classification process and improve accuracy for their bibliometric work using scientific articles.

“With the fast-growing community, some of the most used open-source ML libraries and tools, and a talented science team exploring the edge of tech, Hugging Face is at the heart of the AI revolution.”

HuggingFace

The Solution

Terran, in collaboration with SciLifeLab Data Center and KTH Research Support Office, developed an advanced, fine-tuned open-source classifier model. This model was designed to predict 134 classes from the WOS-46985 dataset, as detailed in the publication by Kowsari et al. (2017). The model is open source and available on Hugging Face.

Key Features of Terran’s Classifier Model:

  1. Fine-Tuned BERT Model: The classifier is based on the BERT (Bidirectional Encoder Representations from Transformers) model, specifically the bert-base-uncased version, which was fine-tuned to enhance its performance on the WOS-46985 dataset.
  2. High Accuracy: The model achieved an impressive accuracy of 83% on the final layer, surpassing the previous state-of-the-art accuracy of 77%. This significant improvement in performance was validated through a rigorous 10/90 validation/training split.
  3. Open Source and Accessible: Released under the Apache 2.0 license, the model is open source, allowing researchers and institutions worldwide to leverage and further develop it for their specific needs.

The Results

The implementation of Terran’s classifier helped SciLifeLab and KTH by:

  1. Enhanced Accuracy: The model’s 83% accuracy represents a substantial leap over the previous state-of-the-art, significantly improving the reliability of scientific text annotation.
  2. Efficiency in Annotation: By automating the classification process, the model has streamlined scientific text annotation, reducing the time and effort required by researchers and support staff.
  3. Wider Adoption and Use: Being open source, the model can be adopted by other institutions, facilitating broader improvements in scientific text classification across the academic community.
  4. Improved Research Support: The enhanced classification accuracy and efficiency has helped in data management, literature reviews, and trend analysis.
  5. Foundation for Future Developments: The model serves as a robust foundation for future advancements in scientific text classification, encouraging further innovations and refinements.

Conclusion

By significantly enhancing accuracy and efficiency, Terran’s open source model has addressed critical challenges in scientific text annotation, providing a powerful tool for academic research. This case study highlights the impact of cutting-edge AI solutions in advancing research capabilities.


Model Details

Model Description: A fine-tuned model to predict the 134 classes from the WOS-46985 model, enhancing the classification of scientific texts.

Developed by: Terran, SciLifeLab Data Center, and KTH Research Support Office

License: Apache 2.0

Finetuned from Model: bert-base-uncased

Evaluation: Conducted using a 10/90 validation/training split, achieving an accuracy of 83%.

Summary: An invaluable model for annotating scientific texts, significantly outperforming previous state-of-the-art models.

Discover more

Terran at the Nobel Symposium

Professor Joakim Lundeberg (KTH Royal Institute of Technology, Karolinska Institutet) led an inspiring symposium focused on spatial omics, and Terran was excited to collaborate with

New consortium project launch

Terran is launching a new consortium project to identify future high-impact research areas. The collaboration brings together SciLifeLab, KTH Royal Institute of Technology, MG Sustainable,