Turkish - Colbert: An open - source Turkish paragraph retrieval model for precise Turkish paragraph retrieval

Home

Turkish Colbert

Developed by ytu-ce-cosmos

Turkish paragraph retrieval model based on ColBERT architecture, fine-tuned on the Turkish-translated MS MARCO dataset

Text Embedding

Safetensors

Open Source License:MIT #Turkish retrieval #Paragraph similarity #Scientific literature retrieval

Downloads 1,724

Release Time : 12/3/2024

Model Overview

This is a Turkish paragraph retrieval model based on the ColBERT architecture, specifically designed for Turkish paragraph retrieval tasks. The model was fine-tuned on 500,000 triplets from the Turkish-translated MS MARCO dataset.

Model Features

Turkish optimization

Paragraph retrieval model specifically optimized for Turkish, fine-tuned on a Turkish foundation BERT model

Efficient retrieval

Utilizes ColBERT architecture to provide efficient paragraph retrieval capabilities

Case handling

Provides special case handling solutions for Turkish-specific 'I' character issues

Model Capabilities

Turkish paragraph retrieval

Sentence similarity calculation

Document indexing and search

Use Cases

Information retrieval

Scientific literature retrieval

Retrieve relevant information from scientific literature databases

Achieved 48.38 R@1 recall rate on Scifact-tr dataset

Encyclopedia knowledge retrieval

Retrieve relevant information from encyclopedia knowledge bases

Achieved 31.21 R@1 recall rate on WikiRAG-TR dataset

🚀 Turkish-ColBERT

This is a Turkish passage retrieval model based on the ColBERT architecture, offering efficient and accurate retrieval of Turkish passages.

🚀 Quick Start

This is a Turkish passage retrieval model based on the ColBERT architecture.

The Cosmos Turkish Base BERT model was fine-tuned on 500k triplets (query, positive passage, negative passage) from a Turkish-translated version of the MS MARCO dataset.

⚠️ Important Note

Uncased use requires manual lowercase conversion. This is due to a known issue with the tokenizer.

💡 Usage Tip

Convert your text to lower case as follows:

text.replace("I", "ı").lower()

✨ Features

Based on the ColBERT architecture for efficient passage retrieval.
Fine - tuned on a Turkish - translated version of the MS MARCO dataset.

📦 Installation

!pip install ragatouille

💻 Usage Examples

Basic Usage

from ragatouille import RAGPretrainedModel

rag = RAGPretrainedModel.from_pretrained("ytu-ce-cosmos/turkish-colbert")

docs = [
    "Marie Curie, radyoaktivite üzerine yaptığı çalışmalarla bilim dünyasına büyük katkılar sağlamıştır. Polonyum ve radyum elementlerini keşfetmiştir. İki farklı dalda Nobel Ödülü alan ilk kişi olmuştur.",
    "Isaac Newton, fizik ve matematik alanında yaptığı çalışmalarla bilinir. Yerçekimi teorisi ve hareket yasaları, bilim dünyasında çığır açmıştır. Ayrıca, matematiksel analiz üzerinde de önemli katkıları vardır.",
    "Albert Einstein, izafiyet teorisini geliştirerek modern fiziğin temellerini atmıştır. 1921 yılında Nobel Fizik Ödülü'nü kazanmıştır. Kütle-enerji eşitliği (E=mc²) onun en ünlü formülüdür.",
    "Alexander Fleming, 1928 yılında penisilini keşfederek modern tıpta devrim yaratmıştır. Bu keşfi sayesinde 1945 yılında Nobel Tıp Ödülü kazanmıştır. Fleming'in çalışmaları antibiyotiklerin gelişimine öncülük etmiştir.",
    "Nikola Tesla, alternatif akım (AC) sistemini geliştirmiştir. Elektrik mühendisliği alanında devrim niteliğinde çalışmalar yapmıştır. Kablosuz enerji aktarımı üzerine projeleriyle tanınır."
]

docs = [doc.replace("I", "ı").lower() for doc in docs]

rag.index(docs, index_name="sampleTest")

query = "Birden fazla Nobel Ödülü alan bilim insanı kimdir?"
query = query.replace("I", "ı").lower()

results = rag.search(query, k= 1)
print(results[0]['content']) # "marie curie, radyoaktivite üzerine yaptığı çalışmalarla bilim dünyasına büyük katkılar sağlamıştır. polonyum ve radyum elementlerini keşfetmiştir. i̇ki farklı dalda nobel ödülü alan ilk kişi olmuştur."

📚 Documentation

Evaluation

Dataset	R@1	R@5	R@10	MRR@10
Scifact-tr	48.38	67.85	75.52	56.88
WikiRAG-TR	31.21	75.63	79.63	49.08

📄 License

This project is licensed under the MIT license.

Acknowledgments

Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC). Thanks for providing access to the TFRC ❤️
Thanks to the generous support from the Hugging Face team, it is possible to download models from their S3 storage 🤗

Citations

@article{kesgin2023developing,
  title={Developing and Evaluating Tiny to Medium-Sized Turkish BERT Models},
  author={Kesgin, Himmet Toprak and Yuce, Muzaffer Kaan and Amasyali, Mehmet Fatih},
  journal={arXiv preprint arXiv:2307.14134},
  year={2023}
}

Contact

COSMOS AI Research Group, Yildiz Technical University Computer Engineering Department
https://cosmos.yildiz.edu.tr/
cosmos@yildiz.edu.tr

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご