fine-tuned-embedding-modelオープンソース文変換器 - 無料でテキストの意味的類似度計算を実現

ホーム

Fine Tuned Embedding Model

svb01によって開発

これはsentence-transformers/all-MiniLM-L6-v2をファインチューニングした文変換モデルで、テキストを384次元のベクトル空間にマッピングし、意味的類似度計算などのタスクをサポートします。

テキスト埋め込み

Safetensors

#短いテキストの意味マッチング #マルチネガティブ例の対比学習 #リスク管理テキスト分析

ダウンロード数 17

リリース時間 : 9/23/2024

モデル概要

このモデルは文や段落を384次元の密ベクトル空間にマッピングし、意味的テキスト類似性、意味的検索、言い換えマイニング、テキスト分類、クラスタリングなどのタスクに使用できます。

モデル特徴

効率的な意味エンコーディング

テキストを効率的に384次元ベクトルにエンコードし、意味情報を保持

マルチタスクサポート

意味的類似度計算、テキスト分類、クラスタリングなど様々な下流タスクをサポート

軽量モデル

MiniLMアーキテクチャに基づき、性能を維持しながら計算リソース要件を削減

モデル能力

意味的テキスト類似度計算

意味的検索

言い換えマイニング

テキスト分類

テキストクラスタリング

特徴抽出

使用事例

情報検索

ドキュメント類似度マッチング

ドキュメント間の意味的類似度を計算し、関連ドキュメントを推薦

コンテンツ管理

重複コンテンツ検出

意味的に類似した重複コンテンツを識別

🚀 文書要約

この文書は、Sentence Transformersを使用した文章類似度モデルに関する情報を提供しています。モデルのベース、ライブラリ名、パイプラインタグ、タグなどの基本情報が含まれており、また、ウィジェットによる文章類似度のテスト例も示されています。さらに、リスク管理、データプライバシーなどの関連情報も提供されています。

📦 モデル情報

属性	詳情
ベースモデル	sentence-transformers/all-MiniLM-L6-v2
ライブラリ名	sentence-transformers
パイプラインタグ	sentence-similarity
タグ	sentence-transformers, sentence-similarity, feature-extraction, generated_from_trainer, dataset_size:555, loss:MultipleNegativesRankingLoss

💻 使用例

ウィジェットによる文章類似度テスト

ソース文章: "What does this text say about unclassified?"

- 文章1: "these sources. \nErrors in third-party GAI components can also have downstream impacts on accuracy and robustness. \nFor example, test datasets commonly used to benchmark or validate models can contain label errors. \nInaccuracies in these labels can impact the “stability” or robustness of these benchmarks, which many \nGAI practitioners consider during the model selection process.  \nTrustworthy AI Characteristics: Accountable and Transparent, Explainable and Interpretable, Fair with \nHarmful Bias Managed, Privacy Enhanced, Safe, Secure and Resilient, Valid and Reliable \n3. \nSuggested Actions to Manage GAI Risks \nThe following suggested actions target risks unique to or exacerbated by GAI. \nIn addition to the suggested actions below, AI risk management activities and actions set forth in the AI \nRMF 1.0 and Playbook are already applicable for managing GAI risks. Organizations are encouraged to"
- 文章2: "and hardware vulnerabilities; labor practices; data privacy and localization \ncompliance; geopolitical alignment). \nData Privacy; Information Security; \nValue Chain and Component \nIntegration; Harmful Bias and \nHomogenization \nMG-3.1-003 \nRe-assess model risks after ﬁne-tuning or retrieval-augmented generation \nimplementation and for any third-party GAI models deployed for applications \nand/or use cases that were not evaluated in initial testing. \nValue Chain and Component \nIntegration \nMG-3.1-004 \nTake reasonable measures to review training data for CBRN information, and \nintellectual property, and where appropriate, remove it. Implement reasonable \nmeasures to prevent, ﬂag, or take other action in response to outputs that \nreproduce particular training data (e.g., plagiarized, trademarked, patented, \nlicensed content or trade secret material). \nIntellectual Property; CBRN \nInformation or Capabilities \n \n43"
- 文章3: "• \nStage of the AI lifecycle: Risks can arise during design, development, deployment, operation, \nand/or decommissioning. \n• \nScope: Risks may exist at individual model or system levels, at the application or implementation \nlevels (i.e., for a speciﬁc use case), or at the ecosystem level – that is, beyond a single system or \norganizational context. Examples of the latter include the expansion of “algorithmic \nmonocultures,3” resulting from repeated use of the same model, or impacts on access to \nopportunity, labor markets, and the creative economies.4 \n• \nSource of risk: Risks may emerge from factors related to the design, training, or operation of the \nGAI model itself, stemming in some cases from GAI model or system inputs, and in other cases, \nfrom GAI system outputs. Many GAI risks, however, originate from human behavior, including \n \n \n3 “Algorithmic monocultures” refers to the phenomenon in which repeated use of the same model or algorithm in"

ソース文章: "What does this text say about risk management?"

- 文章1: "robust watermarking techniques and corresponding detectors to identify the source of content or \nmetadata recording techniques and metadata management tools and repositories to trace content \norigins and modiﬁcations. Further narrowing of GAI task deﬁnitions to include provenance data can \nenable organizations to maximize the utility of provenance data and risk management eﬀorts. \nA.1.7. Enhancing Content Provenance through Structured Public Feedback \nWhile indirect feedback methods such as automated error collection systems are useful, they often lack \nthe context and depth that direct input from end users can provide. Organizations can leverage feedback \napproaches described in the Pre-Deployment Testing section to capture input from external sources such \nas through AI red-teaming.  \nIntegrating pre- and post-deployment external feedback into the monitoring process for GAI models and"
- 文章2: "tools for monitoring third-party GAI risks; Consider policy adjustments across GAI \nmodeling libraries, tools and APIs, ﬁne-tuned models, and embedded tools; \nAssess GAI vendors, open-source or proprietary GAI tools, or GAI service \nproviders against incident or vulnerability databases. \nData Privacy; Human-AI \nConﬁguration; Information \nSecurity; Intellectual Property; \nValue Chain and Component \nIntegration; Harmful Bias and \nHomogenization \nGV-6.1-010 \nUpdate GAI acceptable use policies to address proprietary and open-source GAI \ntechnologies and data, and contractors, consultants, and other third-party \npersonnel. \nIntellectual Property; Value Chain \nand Component Integration \nAI Actor Tasks: Operation and Monitoring, Procurement, Third-party entities \n \nGOVERN 6.2: Contingency processes are in place to handle failures or incidents in third-party data or AI systems deemed to be \nhigh-risk. \nAction ID \nSuggested Action \nGAI Risks \nGV-6.2-001"
- 文章3: "MEASURE 2.3: AI system performance or assurance criteria are measured qualitatively or quantitatively and demonstrated for \nconditions similar to deployment setting(s). Measures are documented. \nAction ID \nSuggested Action \nGAI Risks \nMS-2.3-001 Consider baseline model performance on suites of benchmarks when selecting a \nmodel for ﬁne tuning or enhancement with retrieval-augmented generation. \nInformation Security; \nConfabulation \nMS-2.3-002 Evaluate claims of model capabilities using empirically validated methods. \nConfabulation; Information \nSecurity \nMS-2.3-003 Share results of pre-deployment testing with relevant GAI Actors, such as those \nwith system release approval authority. \nHuman-AI Conﬁguration \n \n31 \nMS-2.3-004 \nUtilize a purpose-built testing environment such as NIST Dioptra to empirically \nevaluate GAI trustworthy characteristics. \nCBRN Information or Capabilities; \nData Privacy; Confabulation; \nInformation Integrity; Information \nSecurity; Dangerous, Violent, or"

ソース文章: "What does this text say about data privacy?"

- 文章1: "Property. We also note that some risks are cross-cutting between these categories.  \n \n4 \n1. CBRN Information or Capabilities: Eased access to or synthesis of materially nefarious \ninformation or design capabilities related to chemical, biological, radiological, or nuclear (CBRN) \nweapons or other dangerous materials or agents. \n2. Confabulation: The production of conﬁdently stated but erroneous or false content (known \ncolloquially as “hallucinations” or “fabrications”) by which users may be misled or deceived.6 \n3. Dangerous, Violent, or Hateful Content: Eased production of and access to violent, inciting, \nradicalizing, or threatening content as well as recommendations to carry out self-harm or \nconduct illegal activities. Includes diﬃculty controlling public exposure to hateful and disparaging \nor stereotyping content. \n4. Data Privacy: Impacts due to leakage and unauthorized use, disclosure, or de-anonymization of"
- 文章2: "information during GAI training and maintenance. \nHuman-AI Conﬁguration; Obscene, \nDegrading, and/or Abusive \nContent; Value Chain and \nComponent Integration; \nDangerous, Violent, or Hateful \nContent \nMS-2.6-002 \nAssess existence or levels of harmful bias, intellectual property infringement, \ndata privacy violations, obscenity, extremism, vio"