Cocom-v1-128-Mistral-7B Open-Source Q&A Model - Efficiently Compress Context to Accelerate Q&A Generation

Cocom V1 128 Mistral 7b

Developed by naver

COCOM is an efficient context compression method that compresses long contexts into a small number of context embeddings, significantly accelerating the generation time for QA tasks.

Large Language Model

Transformers

English#Context Compression #Efficient QA Generation #Retrieval-Augmented Generation Optimization

Downloads 53

Release Time : 10/15/2024

Model Overview

COCOM is a context compression method for Retrieval-Augmented Generation (RAG) that enhances generation speed by compressing long contexts into a small number of context embeddings, supporting different compression rates for flexible trade-offs between decoding time and answer quality.

Model Features

Efficient Context Compression

Compresses long contexts into a small number of context embeddings, significantly reducing decoding time.

Flexible Compression Rate

Supports different compression rates, allowing flexible trade-offs between decoding time and answer quality.

Multi-context Processing

Efficiently handles multi-context scenarios, greatly reducing decoding time for long inputs.

Model Capabilities

Context compression

QA generation

Retrieval-Augmented Generation

Use Cases

Information Retrieval & QA

Movie Character QA

Quickly generates accurate answers based on multiple context segments.

Achieves speed improvements of up to 5.69x while maintaining high performance.

🚀 COCOM - Context Compression for Efficient RAG

COCOM is an effective context compression method. It can reduce long contexts to just a few Context Embeddings, which significantly speeds up the answer generation time in Question Answering.

🚀 Quick Start

COCOM is designed to address the issue of increased decoding time in Retrieval - Augmented Generation (RAG) when dealing with long inputs. It compresses long contexts into a small number of Context Embeddings, thus speeding up the answer generation process.

✨ Features

Effective Context Compression: COCOM can reduce long contexts to a handful of Context Embeddings, speeding up the generation time for question - answering.
Adjustable Compression Rates: It allows for different compression rates, enabling a trade - off between decoding time and answer quality.
Multi - Context Handling: Compared to earlier methods, COCOM can handle multiple contexts more effectively, significantly reducing decoding time for long inputs.
High Performance: The method demonstrates a speed - up of up to 5.69 times while achieving higher performance compared to existing efficient context compression methods.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoModel

model = AutoModel.from_pretrained('naver/cocom-v1-128-mistral-7b', trust_remote_code=True)
model = model.to('cuda')
contexts = [[
  'Rosalind Bailey. Rosalind Bailey Rosalind Bailey (born 1946) is a British actress, known for her portrayal of Sarah Headley ("née" Lytton) in the 1970s and 1980s BBC television drama “When the Boat Comes In". Bailey has appeared in numerous British television drama series, including "Byker Grove", “Distant Shores" and "Burn Up". Her stage work includes playing Miss Mary Shepherd in Alan Bennett’s play "The Lady in the Van”.',
  'Malcolm Terris. Malcolm Terris Malcolm Terris (born 11 January 1941 in Sunderland, County Durham) is a British actor. He had a lengthy career in a large number of television programmes. Possibly his best-known role was in "When the Boat Comes In", a popular 1970s series, where he played the part of Matt Headley. His film career includes appearances in "The First Great Train Robbery" (1978), "McVicar" (1980), "The Plague Dogs" (1982, voice only), "Slayground" (1983), “The Bounty" (1984) as Thomas Huggan, ship’s surgeon, "Mata Hari" (1985), "Revolution" (1985), “Scandal" (1989), and “Chaplin” (1992). His TV appearances include: One episode of',
  'When the Boat Comes In. When the Boat Comes In When the Boat Comes In is a British television period drama produced by the BBC between 1976 and 1981. The series stars James Bolam as Jack Ford, a First World War veteran who returns to his poverty-stricken (fictional) town of Gallowshield in the North East of England. The series dramatises the political struggles of the 1920s and 1930s and explores the impact of national and international politics upon Ford and the people around him. Section:Production. The majority of episodes were written by creator James Mitchell, but in Series 1 north-eastern',
  'Susie Youssef. Youssef began her comedy career as a writer for "The Ronnie Johns Half Hour" in 2006, and made her acting debut in the short film "Clicked" in the role of Lina in 2011. In 2014, she played Jane in the short film "Kevin Needs to Make New Friends: Because Everyone Hates Him for Some Reason" and then turned to television where she appeared in "The Chaser’s Media Circus". In 2014, Youssef played the lead role of Sarah in the Hayloft Project’s stage play "The Boat People" which won the Best On Stage award at the FBi SMAC Awards',
  'Madelaine Newton. Madelaine Newton Madelaine Newton is a British actress best known for her portrayal of Dolly in 1970s BBC television drama "When the Boat Comes In". She is married to actor Kevin Whately, known for his role as Robert "Robbie" Lewis in both "Inspector Morse” and its spin-off "Lewis". They have two children. She starred alongside her husband in the “Inspector Morse" episode "Masonic Mysteries" as Beryl Newsome - the love-interest of Morse - whom Morse was wrongly suspected of murdering. She played Whately’s on-screen wife in the 1988 Look and Read children’s serial, Geordie Racer. She also made'
  ]]
questions = ['who played sarah hedley in when the boat comes in?']

answers = model.generate_from_text(contexts=contexts, questions=questions, max_new_tokens=128)

print(answers)

Advanced Usage

No advanced usage examples are provided in the original document, so this part is not included.

📚 Documentation

Model Inference

For batch processing, the model takes the following as input:

questions (list): A list containing questions.
contexts (list of lists). For each question, there is a list of contexts, and the number of contexts is fixed throughout questions. The models have been fine - tuned (and should be inferenced) with 5 contexts.

The model compresses the questions into context embeddings and answers the question based on the provided context embeddings.

References

Paper: https://arxiv.org/pdf/2407.09252

@misc{rau2024contextembeddingsefficientanswer,
      title={Context Embeddings for Efficient Answer Generation in RAG}, 
      author={David Rau and Shuai Wang and Hervé Déjean and Stéphane Clinchant},
      year={2024},
      eprint={2407.09252},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.09252}, 
}

🔧 Technical Details

COCOM is an effective context compression method for Retrieval - Augmented Generation (RAG). RAG extends the input of LLMs with external context to overcome their limited knowledge. However, longer inputs in RAG lead to a significant increase in decoding time. COCOM addresses this issue by compressing long contexts into a small number of Context Embeddings, which speeds up the generation time. It allows for different compression rates, enabling users to trade off decoding time for answer quality. Compared to earlier methods, COCOM can handle multiple contexts more effectively, reducing decoding time for long inputs. It demonstrates a speed - up of up to 5.69 times while achieving higher performance compared to existing efficient context compression methods.

📄 License

No license information is provided in the original document, so this section is skipped.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご