Bert Chunker 3
BertForTokenClassificationに基づくテキストチャンカーで、構造化および非構造化テキストに適用可能で、特にRAGシナリオ用に最適化されています。
ダウンロード数 1,226
リリース時間 : 2/9/2025
モデル概要
bert-chunker-3はBERTに基づくテキストチャンキングモデルで、テキストブロックの開始マーカーを予測し、スライディングウィンドウを使用して任意のサイズのドキュメントをテキストブロックに分割します。特に検索強化生成(RAG)などのシナリオに適しており、非構造化および乱雑なテキストを良好に処理できます。
モデル特徴
非構造化テキスト処理
非構造化および乱雑なテキストのチャンキングニーズを処理するために特別に最適化されています。
スライディングウィンドウメカニズム
スライディングウィンドウ技術を使用して任意の長さのドキュメントを処理します。
確率閾値調整
prob_thresholdパラメータを使用してチャンキング粒度を柔軟に制御できます。
LLMアノテーションデータ
トレーニングデータは大規模言語モデルによってアノテーション付けされ、モデルの安定性が向上します。
モデル能力
テキストチャンキング
ドキュメント分割
非構造化テキスト処理
RAGシナリオサポート
使用事例
検索強化生成(RAG)
ドキュメント前処理
RAGシステム用にドキュメントをチャンクに分割します。
検索効率と精度を向上させます。
テキスト分析
技術ドキュメント処理
技術ドキュメントを論理的な段落に分割します。
後続の分析と処理を容易にします。
広告コンテンツ分析
広告テキストを意味のあるブロックに分割します。
コンテンツ分類と特徴抽出をサポートします。
🚀 bert-chunker-3
bert-chunker-3は、BertForTokenClassificationに基づくテキスト分塊器です。テキストブロックの開始トークンを予測し(検索強化生成(RAG)などのシナリオで使用できます)、スライディングウィンドウを使用して任意のサイズのドキュメントをテキストブロックに分割します。これはKamradtセマンティック分塊器の代替案と考えられ、特に構造化テキストだけでなく、非構造化で乱雑なテキストにも適用できます。
bc-2 や bc とは異なり、データ分布のオフセット問題を克服するために、トレーニングデータは大規模言語モデル(LLM)によってアノテーションされ、トレーニングプロセスが改善されているため、より安定しています。また、性能にも競争力があります。
更新内容
🚀 クイックスタート
以下のコードを実行します:
import torch
from transformers import AutoTokenizer, BertForTokenClassification
import math
model_path = "tim1900/bert-chunker-3"
tokenizer = AutoTokenizer.from_pretrained(
model_path,
padding_side="right",
model_max_length=255,
trust_remote_code=True,
)
device = "cpu" # or 'cuda'
model = BertForTokenClassification.from_pretrained(
model_path,
).to(device)
def chunk_text(model, text, tokenizer, prob_threshold=0.5):
# slide context window chunking
MAX_TOKENS = 255
tokens = tokenizer(text, return_tensors="pt", truncation=False)
input_ids = tokens["input_ids"]
attention_mask = tokens["attention_mask"][:, 0:MAX_TOKENS]
attention_mask = attention_mask.to(model.device)
CLS = input_ids[:, 0].unsqueeze(0)
SEP = input_ids[:, -1].unsqueeze(0)
input_ids = input_ids[:, 1:-1]
model.eval()
split_str_poses = []
token_pos = []
windows_start = 0
windows_end = 0
logits_threshold = math.log(1 / prob_threshold - 1)
print(f"Processing {input_ids.shape[1]} tokens...")
while windows_end <= input_ids.shape[1]:
windows_end = windows_start + MAX_TOKENS - 2
ids = torch.cat((CLS, input_ids[:, windows_start:windows_end], SEP), 1)
ids = ids.to(model.device)
output = model(
input_ids=ids,
attention_mask=torch.ones(1, ids.shape[1], device=model.device),
)
logits = output["logits"][:, 1:-1, :]
chunk_decision = logits[:, :, 1] > (logits[:, :, 0] - logits_threshold)
greater_rows_indices = torch.where(chunk_decision)[1].tolist()
# null or not
if len(greater_rows_indices) > 0 and (
not (greater_rows_indices[0] == 0 and len(greater_rows_indices) == 1)
):
split_str_pos = [
tokens.token_to_chars(sp + windows_start + 1).start
for sp in greater_rows_indices
if sp > 0
]
token_pos += [
sp + windows_start for sp in greater_rows_indices if sp > 0
]
split_str_poses += split_str_pos
windows_start = greater_rows_indices[-1] + windows_start
else:
windows_start = windows_end
substrings = [
text[i:j] for i, j in zip([0] + split_str_poses, split_str_poses + [len(text)])
]
token_pos = [0] + token_pos
return substrings, token_pos
# chunking code docs
print("\n>>>>>>>>> Chunking code docs...")
doc = r"""
Of course, as our first example shows, it is not always _necessary_ to declare an expression holder before it is created or used. But doing so provides an extra measure of clarity to models, so we strongly recommend it.
## Chapter 4 The Basics
## Chapter 5 The DCP Ruleset
### 5.1 A taxonomy of curvature
In disciplined convex programming, a scalar expression is classified by its _curvature_. There are four categories of curvature: _constant_, _affine_, _convex_, and _concave_. For a function \(f:\mathbf{R}^{n}\rightarrow\mathbf{R}\) defined on all \(\mathbf{R}^{n}\)the categories have the following meanings:
\[\begin{array}{llll}\text{constant}&f(\alpha x+(1-\alpha)y)=f(x)&\forall x,y\in \mathbf{R}^{n},\;\alpha\in\mathbf{R}\\ \text{affine}&f(\alpha x+(1-\alpha)y)=\alpha f(x)+(1-\alpha)f(y)&\forall x,y\in \mathbf{R}^{n},\;\alpha\in\mathbf{R}\\ \text{convex}&f(\alpha x+(1-\alpha)y)\leq\alpha f(x)+(1-\alpha)f(y)&\forall x,y \in\mathbf{R}^{n},\;\alpha\in[0,1]\\ \text{concave}&f(\alpha x+(1-\alpha)y)\geq\alpha f(x)+(1-\alpha)f(y)&\forall x,y \in\mathbf{R}^{n},\;\alpha\in[0,1]\end{array}\]
Of course, there is significant overlap in these categories. For example, constant expressions are also affine, and (real) affine expressions are both convex and concave.
Convex and concave expressions are real by definition. Complex constant and affine expressions can be constructed, but their usage is more limited; for example, they cannot appear as the left- or right-hand side of an inequality constraint.
### Top-level rules
CVX supports three different types of disciplined convex programs:
* A _minimization problem_, consisting of a convex objective function and zero or more constraints.
* A _maximization problem_, consisting of a concave objective function and zero or more constraints.
* A _feasibility problem_, consisting of one or more constraints and no objective.
### Constraints
Three types of constraints may be specified in disciplined convex programs:
* An _equality constraint_, constructed using \(==\), where both sides are affine.
* A _less-than inequality constraint_, using \(<=\), where the left side is convex and the right side is concave.
* A _greater-than inequality constraint_, using \(>=\), where the left side is concave and the right side is convex.
_Non_-equality constraints, constructed using \(\sim=\), are never allowed. (Such constraints are not convex.)
One or both sides of an equality constraint may be complex; inequality constraints, on the other hand, must be real. A complex equality constraint is equivalent to two real equality constraints, one for the real part and one for the imaginary part. An equality constraint with a real side and a complex side has the effect of constraining the imaginary part of the complex side to be zero."""
# Chunk the text. The prob_threshold should be between (0, 1). The lower it is, the more chunks will be generated.
# Therefore adjust it to your need, when prob_threshold is small like 0.000001, each token is one chunk,
# when it is set to 1, the whole text will be one chunk.
chunks, token_pos = chunk_text(model, doc, tokenizer, prob_threshold=0.5)
# print chunks
for i, (c, t) in enumerate(zip(chunks, token_pos)):
print(f"-----chunk: {i}----token_idx: {t}--------")
print(c)
# chunking ads
print("\n>>>>>>>>> Chunking ads...")
ad = r"""The causes and effects of dropouts in vocational and professional education are more pressing than ever. A decreasing attractiveness of vocational education, particularly in payment and quality, causes higher dropout rates while hitting ongoing demographic changes resulting in extensive skill shortages for many regions. Therefore, tackling the internationally high dropout rates is of utmost political and scientific interest. This thematic issue contributes to the conceptualization, analysis, and prevention of vocational and professional dropouts by bringing together current research that progresses to a deeper processual understanding and empirical modelling of dropouts. It aims to expand our understanding of how dropout and decision processes leading to dropout can be conceptualized and measured in vocational and professional contexts. Another aim is to gather empirical studies on both predictors and dropout consequences. Based on this knowledge, the thematic issue intends to provide evidence of effective interventions to avoid dropouts and identify promising ways for future dropout research in professional and vocational education to support evidence-based vocational education policy.
We thus welcome research contributions (original empirical and conceptual/measurement-related articles, literature reviews, meta-analyses) on dropouts (e.g., premature terminations, intentions to terminate, vertical and horizontal dropouts) that are situated in vocational and professional education at workplaces, schools, or other tertiary professional education institutions.
Part 1 of the thematic series outlines central theories and measurement concepts for vocational and professional dropouts. Part 2 outlines measurement approaches for dropout. Part 3 investigates relevant predictors of dropout. Part 4 analyzes the effects of dropout on an individual, organizational, and systemic level. Part 5 deals with programs and interventions for the prevention of dropouts.
We welcome papers that include but are not limited to:
Theoretical papers on the concept and processes of vocational and professional dropout or retention
Measurement approaches to assess dropout or retention
Quantitative and qualitative papers on the causes of dropout or retention
Quantitative and qualitative papers on the effects of dropout or retention on learners, providers/organizations and the (educational) system
Design-based research and experimental papers on dropout prevention programs or retention
Submission instructions
Before submitting your manuscript, please ensure you have carefully read the Instructions for Authors for Empirical Research in Vocational Education and Training. The complete manuscript should be submitted through the Empirical Research in Vocational Education and Training submission system. To ensure that you submit to the correct thematic series please select the appropriate section in the drop-down menu upon submission. In addition, indicate within your cover letter that you wish your manuscript to be considered as part of the thematic series on series title. All submissions will undergo rigorous peer review, and accepted articles will be published within the journal as a collection.
Lead Guest Editor:
Prof. Dr. Viola Deutscher, University of Mannheim
viola.deutscher@uni-mannheim.de
Guest Editors:
Prof. Dr. Stefanie Findeisen, University of Konstanz
stefanie.findeisen@uni-konstanz.de
Prof. Dr. Christian Michaelis, Georg-August-University of Göttingen
christian.michaelis@wiwi.uni-goettingen.de
Deadline for submission
This Call for Papers is open from now until 29 February 2023. Submitted papers will be reviewed in a timely manner and published directly after acceptance (i.e., without waiting for the accomplishment of all other contributions). Thanks to the Empirical Research in Vocational Education and Training (ERVET) open access policy, the articles published in this thematic issue will have a wide, global audience.
Option of submitting abstracts: Interested authors should submit a letter of intent including a working title for the manuscript, names, affiliations, and contact information for all authors, and an abstract of no more than 500 words to the lead guest editor Viola Deutscher (viola.deutscher@uni-mannheim.de) by July, 31st 2023. Due to technical issues, we also ask authors who already submitted an abstract before May, 30th to send their abstracts again to the address stated above. However, abstract submission is optional and is not mandatory for the full paper submission.
Different dropout directions in vocational education and training: the role of the initiating party and trainees’ reasons for dropping out
The high rates of premature contract termination (PCT) in vocational education and training (VET) programs have led to an increasing number of studies examining the reasons why adolescents drop out. Since adol...
Authors:Christian Michaelis and Stefanie Findeisen
Citation:Empirical Research in Vocational Education and Training 2024 16:15
Content type:Research
Published on: 6 August 2024"
"""
# Chunk the text. The prob_threshold should be between (0, 1). The lower it is, the more chunks will be generated.
# Therefore adjust it to your need, when prob_threshold is small like 0.000001, each token is one chunk,
# when it is set to 1, the whole text will be one chunk.
chunks, token_pos = chunk_text(model, ad, tokenizer, prob_threshold=0.5)
# print chunks
for i, (c, t) in enumerate(zip(chunks, token_pos)):
print(f"-----chunk: {i}----token_idx: {t}--------")
print(c)
💻 使用例
基本的な使用法
import torch
from transformers import AutoTokenizer, BertForTokenClassification
import math
model_path = "tim1900/bert-chunker-3"
tokenizer = AutoTokenizer.from_pretrained(
model_path,
padding_side="right",
model_max_length=255,
trust_remote_code=True,
)
device = "cpu" # or 'cuda'
model = BertForTokenClassification.from_pretrained(
model_path,
).to(device)
def chunk_text(model, text, tokenizer, prob_threshold=0.5):
# slide context window chunking
MAX_TOKENS = 255
tokens = tokenizer(text, return_tensors="pt", truncation=False)
input_ids = tokens["input_ids"]
attention_mask = tokens["attention_mask"][:, 0:MAX_TOKENS]
attention_mask = attention_mask.to(model.device)
CLS = input_ids[:, 0].unsqueeze(0)
SEP = input_ids[:, -1].unsqueeze(0)
input_ids = input_ids[:, 1:-1]
model.eval()
split_str_poses = []
token_pos = []
windows_start = 0
windows_end = 0
logits_threshold = math.log(1 / prob_threshold - 1)
print(f"Processing {input_ids.shape[1]} tokens...")
while windows_end <= input_ids.shape[1]:
windows_end = windows_start + MAX_TOKENS - 2
ids = torch.cat((CLS, input_ids[:, windows_start:windows_end], SEP), 1)
ids = ids.to(model.device)
output = model(
input_ids=ids,
attention_mask=torch.ones(1, ids.shape[1], device=model.device),
)
logits = output["logits"][:, 1:-1, :]
chunk_decision = logits[:, :, 1] > (logits[:, :, 0] - logits_threshold)
greater_rows_indices = torch.where(chunk_decision)[1].tolist()
# null or not
if len(greater_rows_indices) > 0 and (
not (greater_rows_indices[0] == 0 and len(greater_rows_indices) == 1)
):
split_str_pos = [
tokens.token_to_chars(sp + windows_start + 1).start
for sp in greater_rows_indices
if sp > 0
]
token_pos += [
sp + windows_start for sp in greater_rows_indices if sp > 0
]
split_str_poses += split_str_pos
windows_start = greater_rows_indices[-1] + windows_start
else:
windows_start = windows_end
substrings = [
text[i:j] for i, j in zip([0] + split_str_poses, split_str_poses + [len(text)])
]
token_pos = [0] + token_pos
return substrings, token_pos
# chunking code docs
print("\n>>>>>>>>> Chunking code docs...")
doc = r"""
Of course, as our first example shows, it is not always _necessary_ to declare an expression holder before it is created or used. But doing so provides an extra measure of clarity to models, so we strongly recommend it.
## Chapter 4 The Basics
## Chapter 5 The DCP Ruleset
### 5.1 A taxonomy of curvature
In disciplined convex programming, a scalar expression is classified by its _curvature_. There are four categories of curvature: _constant_, _affine_, _convex_, and _concave_. For a function \(f:\mathbf{R}^{n}\rightarrow\mathbf{R}\) defined on all \(\mathbf{R}^{n}\)the categories have the following meanings:
\[\begin{array}{llll}\text{constant}&f(\alpha x+(1-\alpha)y)=f(x)&\forall x,y\in \mathbf{R}^{n},\;\alpha\in\mathbf{R}\\ \text{affine}&f(\alpha x+(1-\alpha)y)=\alpha f(x)+(1-\alpha)f(y)&\forall x,y\in \mathbf{R}^{n},\;\alpha\in\mathbf{R}\\ \text{convex}&f(\alpha x+(1-\alpha)y)\leq\alpha f(x)+(1-\alpha)f(y)&\forall x,y \in\mathbf{R}^{n},\;\alpha\in[0,1]\\ \text{concave}&f(\alpha x+(1-\alpha)y)\geq\alpha f(x)+(1-\alpha)f(y)&\forall x,y \in\mathbf{R}^{n},\;\alpha\in[0,1]\end{array}\]
Of course, there is significant overlap in these categories. For example, constant expressions are also affine, and (real) affine expressions are both convex and concave.
Convex and concave expressions are real by definition. Complex constant and affine expressions can be constructed, but their usage is more limited; for example, they cannot appear as the left- or right-hand side of an inequality constraint.
### Top-level rules
CVX supports three different types of disciplined convex programs:
* A _minimization problem_, consisting of a convex objective function and zero or more constraints.
* A _maximization problem_, consisting of a concave objective function and zero or more constraints.
* A _feasibility problem_, consisting of one or more constraints and no objective.
### Constraints
Three types of constraints may be specified in disciplined convex programs:
* An _equality constraint_, constructed using \(==\), where both sides are affine.
* A _less-than inequality constraint_, using \(<=\), where the left side is convex and the right side is concave.
* A _greater-than inequality constraint_, using \(>=\), where the left side is concave and the right side is convex.
_Non_-equality constraints, constructed using \(\sim=\), are never allowed. (Such constraints are not convex.)
One or both sides of an equality constraint may be complex; inequality constraints, on the other hand, must be real. A complex equality constraint is equivalent to two real equality constraints, one for the real part and one for the imaginary part. An equality constraint with a real side and a complex side has the effect of constraining the imaginary part of the complex side to be zero."""
# Chunk the text. The prob_threshold should be between (0, 1). The lower it is, the more chunks will be generated.
# Therefore adjust it to your need, when prob_threshold is small like 0.000001, each token is one chunk,
# when it is set to 1, the whole text will be one chunk.
chunks, token_pos = chunk_text(model, doc, tokenizer, prob_threshold=0.5)
# print chunks
for i, (c, t) in enumerate(zip(chunks, token_pos)):
print(f"-----chunk: {i}----token_idx: {t}--------")
print(c)
# chunking ads
print("\n>>>>>>>>> Chunking ads...")
ad = r"""The causes and effects of dropouts in vocational and professional education are more pressing than ever. A decreasing attractiveness of vocational education, particularly in payment and quality, causes higher dropout rates while hitting ongoing demographic changes resulting in extensive skill shortages for many regions. Therefore, tackling the internationally high dropout rates is of utmost political and scientific interest. This thematic issue contributes to the conceptualization, analysis, and prevention of vocational and professional dropouts by bringing together current research that progresses to a deeper processual understanding and empirical modelling of dropouts. It aims to expand our understanding of how dropout and decision processes leading to dropout can be conceptualized and measured in vocational and professional contexts. Another aim is to gather empirical studies on both predictors and dropout consequences. Based on this knowledge, the thematic issue intends to provide evidence of effective interventions to avoid dropouts and identify promising ways for future dropout research in professional and vocational education to support evidence-based vocational education policy.
We thus welcome research contributions (original empirical and conceptual/measurement-related articles, literature reviews, meta-analyses) on dropouts (e.g., premature terminations, intentions to terminate, vertical and horizontal dropouts) that are situated in vocational and professional education at workplaces, schools, or other tertiary professional education institutions.
Part 1 of the thematic series outlines central theories and measurement concepts for vocational and professional dropouts. Part 2 outlines measurement approaches for dropout. Part 3 investigates relevant predictors of dropout. Part 4 analyzes the effects of dropout on an individual, organizational, and systemic level. Part 5 deals with programs and interventions for the prevention of dropouts.
We welcome papers that include but are not limited to:
Theoretical papers on the concept and processes of vocational and professional dropout or retention
Measurement approaches to assess dropout or retention
Quantitative and qualitative papers on the causes of dropout or retention
Quantitative and qualitative papers on the effects of dropout or retention on learners, providers/organizations and the (educational) system
Design-based research and experimental papers on dropout prevention programs or retention
Submission instructions
Before submitting your manuscript, please ensure you have carefully read the Instructions for Authors for Empirical Research in Vocational Education and Training. The complete manuscript should be submitted through the Empirical Research in Vocational Education and Training submission system. To ensure that you submit to the correct thematic series please select the appropriate section in the drop-down menu upon submission. In addition, indicate within your cover letter that you wish your manuscript to be considered as part of the thematic series on series title. All submissions will undergo rigorous peer review, and accepted articles will be published within the journal as a collection.
Lead Guest Editor:
Prof. Dr. Viola Deutscher, University of Mannheim
viola.deutscher@uni-mannheim.de
Guest Editors:
Prof. Dr. Stefanie Findeisen, University of Konstanz
stefanie.findeisen@uni-konstanz.de
Prof. Dr. Christian Michaelis, Georg-August-University of Göttingen
christian.michaelis@wiwi.uni-goettingen.de
Deadline for submission
This Call for Papers is open from now until 29 February 2023. Submitted papers will be reviewed in a timely manner and published directly after acceptance (i.e., without waiting for the accomplishment of all other contributions). Thanks to the Empirical Research in Vocational Education and Training (ERVET) open access policy, the articles published in this thematic issue will have a wide, global audience.
Option of submitting abstracts: Interested authors should submit a letter of intent including a working title for the manuscript, names, affiliations, and contact information for all authors, and an abstract of no more than 500 words to the lead guest editor Viola Deutscher (viola.deutscher@uni-mannheim.de) by July, 31st 2023. Due to technical issues, we also ask authors who already submitted an abstract before May, 30th to send their abstracts again to the address stated above. However, abstract submission is optional and is not mandatory for the full paper submission.
Different dropout directions in vocational education and training: the role of the initiating party and trainees’ reasons for dropping out
The high rates of premature contract termination (PCT) in vocational education and training (VET) programs have led to an increasing number of studies examining the reasons why adolescents drop out. Since adol...
Authors:Christian Michaelis and Stefanie Findeisen
Citation:Empirical Research in Vocational Education and Training 2024 16:15
Content type:Research
Published on: 6 August 2024"
"""
# Chunk the text. The prob_threshold should be between (0, 1). The lower it is, the more chunks will be generated.
# Therefore adjust it to your need, when prob_threshold is small like 0.000001, each token is one chunk,
# when it is set to 1, the whole text will be one chunk.
chunks, token_pos = chunk_text(model, ad, tokenizer, prob_threshold=0.5)
# print chunks
for i, (c, t) in enumerate(zip(chunks, token_pos)):
print(f"-----chunk: {i}----token_idx: {t}--------")
print(c)
高度な使用法
import torch
from transformers import AutoTokenizer, BertForTokenClassification
import math
model_path = "tim1900/bert-chunker-3"
tokenizer = AutoTokenizer.from_pretrained(
model_path,
padding_side="right",
model_max_length=255,
trust_remote_code=True,
)
device = "cpu" # or 'cuda'
model = BertForTokenClassification.from_pretrained(
model_path,
).to(device)
def chunk_text_with_max_chunk_size(model, text, tokenizer, prob_threshold=0.5,max_tokens_per_chunk = 400):
with torch.no_grad():
# slide context window chunking
MAX_TOKENS = 255
tokens = tokenizer(text, return_tensors="pt", truncation=False)
input_ids = tokens["input_ids"]
attention_mask = tokens["attention_mask"][:, 0:MAX_TOKENS]
attention_mask = attention_mask.to(model.device)
CLS = input_ids[:, 0].unsqueeze(0)
SEP = input_ids[:, -1].unsqueeze(0)
input_ids = input_ids[:, 1:-1]
model.eval()
split_str_poses = []
token_pos = []
windows_start = 0
windows_end = 0
logits_threshold = math.log(1 / prob_threshold - 1)
unchunk_tokens = 0
backup_pos = None
best_logits = torch.finfo(torch.float32).min
STEP = round(((MAX_TOKENS - 2)//2)*1.75 )
print(f"Processing {input_ids.shape[1]} tokens...")
# while windows_end <= input_ids.shape[1]:#记得改成windstart
while windows_start < input_ids.shape[1]:#记得改成windstart
windows_end = windows_start + MAX_TOKENS - 2
ids = torch.cat((CLS, input_ids[:, windows_start:windows_end], SEP), 1)
ids = ids.to(model.device)
output = model(
input_ids=ids,
attention_mask=torch.ones(1, ids.shape[1], device=model.device),
)
logits = output["logits"][:, 1:-1, :]
logit_diff = logits[:, :, 1] - logits[:, :, 0]
chunk_decision = logit_diff > - logits_threshold
greater_rows_indices = torch.where(chunk_decision)[1].tolist()
# null or not
if len(greater_rows_indices) > 0 and (
not (greater_rows_indices[0] == 0 and len(greater_rows_indices) == 1)
):
unchunk_tokens_this_window = greater_rows_indices[0] if greater_rows_indices[0]!=0 else greater_rows_indices[1]#exclude the fist index
# manually chunk
if unchunk_tokens + unchunk_tokens_this_window > max_tokens_per_chunk:
big_windows_end = max_tokens_per_chunk - unchunk_tokens
max_value, max_index= logit_diff[:,1:big_windows_end].max(), logit_diff[:,1:big_windows_end].argmax() + 1
if best_logits < max_value:
backup_pos = windows_start + max_index
windows_start = backup_pos
split_str_pos = [tokens.token_to_chars(backup_pos + 1).start]
split_str_poses = split_str_poses + split_str_pos
token_pos = token_pos + [backup_pos]
best_logits = torch.finfo(torch.float32).min
backup_pos = -1
unchunk_tokens = 0
# auto chunk
else:
if len(greater_rows_indices) >= 2:
for gi, (gri0,gri1) in enumerate(zip(greater_rows_indices[:-1],greater_rows_indices[1:])):
if gri1 - gri0 > max_tokens_per_chunk:
greater_rows_indices=greater_rows_indices[:gi+1]
break
split_str_pos = [tokens.token_to_chars(sp + windows_start + 1).start for sp in greater_rows_indices if sp > 0]
split_str_poses = split_str_poses + split_str_pos
token_pos = token_pos+ [sp + windows_start for sp in greater_rows_indices if sp > 0]
windows_start = greater_rows_indices[-1] + windows_start
best_logits = torch.finfo(torch.float32).min
backup_pos = -1
unchunk_tokens = 0
else:
# unchunk_tokens_this_window = min(windows_end - windows_start,STEP)
unchunk_tokens_this_window = min(windows_start+STEP,input_ids.shape[1]) - windows_start
# manually chunk
if unchunk_tokens + unchunk_tokens_this_window > max_tokens_per_chunk:
big_windows_end = max_tokens_per_chunk - unchunk_tokens
if logit_diff.shape[1] > 1:
max_value, max_index= logit_diff[:,1:big_windows_end].max(), logit_diff[:,1:big_windows_end].argmax() + 1
if best_logits < max_value:
backup_pos = windows_start + max_index
windows_start = backup_pos
split_str_pos = [tokens.token_to_chars(backup_pos + 1).start]
split_str_poses = split_str_poses + split_str_pos
token_pos = token_pos + [backup_pos]
best_logits = torch.finfo(torch.float32).min
backup_pos = -1
unchunk_tokens = 0
else:
# auto leave
if logit_diff.shape[1] > 1:
max_value, max_index= logit_diff[:,1:].max(), logit_diff[:,1:].argmax() + 1
if best_logits < max_value:
best_logits = max_value
backup_pos = windows_start + max_index
unchunk_tokens = unchunk_tokens + STEP
windows_start = windows_start + STEP
substrings = [
text[i:j] for i, j in zip([0] + split_str_poses, split_str_poses + [len(text)])
]
token_pos = [0] + token_pos
return substrings, token_pos
# chunking ads
print("\n>>>>>>>>> Chunking ads...")
ad = r"""The causes and effects of dropouts in vocational and professional education are more pressing than ever. A decreasing attractiveness of vocational education, particularly in payment and quality, causes higher dropout rates while hitting ongoing demographic changes resulting in extensive skill shortages for many regions. Therefore, tackling the internationally high dropout rates is of utmost political and scientific interest. This thematic issue contributes to the conceptualization, analysis, and prevention of vocational and professional dropouts by bringing together current research that progr
📄 ライセンス
本プロジェクトはMITライセンスを採用しています。
Indonesian Roberta Base Posp Tagger
MIT
これはインドネシア語RoBERTaモデルをファインチューニングした品詞タグ付けモデルで、indonluデータセットで訓練され、インドネシア語テキストの品詞タグ付けタスクに使用されます。
シーケンスラベリング
Transformers その他

I
w11wo
2.2M
7
Bert Base NER
MIT
BERTを微調整した命名エンティティ識別モデルで、4種類のエンティティ(場所(LOC)、組織(ORG)、人名(PER)、その他(MISC))を識別できます。
シーケンスラベリング 英語
B
dslim
1.8M
592
Deid Roberta I2b2
MIT
このモデルはRoBERTaをファインチューニングしたシーケンスラベリングモデルで、医療記録内の保護対象健康情報(PHI/PII)を識別・除去します。
シーケンスラベリング
Transformers 複数言語対応

D
obi
1.1M
33
Ner English Fast
Flairに組み込まれた英語の高速4クラス固有表現認識モデルで、Flair埋め込みとLSTM-CRFアーキテクチャを使用し、CoNLL-03データセットで92.92のF1スコアを達成しています。
シーケンスラベリング
PyTorch 英語
N
flair
978.01k
24
French Camembert Postag Model
Camembert-baseをベースとしたフランス語の品詞タグ付けモデルで、free-french-treebankデータセットを使用して学習されました。
シーケンスラベリング
Transformers フランス語

F
gilf
950.03k
9
Xlm Roberta Large Ner Spanish
XLM - Roberta - largeアーキテクチャに基づいて微調整されたスペイン語の命名エンティティ認識モデルで、CoNLL - 2002データセットで優れた性能を発揮します。
シーケンスラベリング
Transformers スペイン語

X
MMG
767.35k
29
Nusabert Ner V1.3
MIT
NusaBert-v1.3を基にインドネシア語NERタスクでファインチューニングした固有表現認識モデル
シーケンスラベリング
Transformers その他

N
cahya
759.09k
3
Ner English Large
Flairフレームワークに組み込まれた英語の4種類の大型NERモデルで、文書レベルのXLM - R埋め込みとFLERT技術に基づいており、CoNLL - 03データセットでF1スコアが94.36に達します。
シーケンスラベリング
PyTorch 英語
N
flair
749.04k
44
Punctuate All
MIT
xlm - roberta - baseを微調整した多言語句読点予測モデルで、12種類の欧州言語の句読点自動補完に対応しています。
シーケンスラベリング
Transformers

P
kredor
728.70k
20
Xlm Roberta Ner Japanese
MIT
xlm-roberta-baseをファインチューニングした日本語固有表現認識モデル
シーケンスラベリング
Transformers 複数言語対応

X
tsmatz
630.71k
25
おすすめAIモデル
Llama 3 Typhoon V1.5x 8b Instruct
タイ語専用に設計された80億パラメータの命令モデルで、GPT-3.5-turboに匹敵する性能を持ち、アプリケーションシナリオ、検索拡張生成、制限付き生成、推論タスクを最適化
大規模言語モデル
Transformers 複数言語対応

L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-TinyはSODAデータセットでトレーニングされた超小型対話モデルで、エッジデバイス推論向けに設計されており、体積はCosmo-3Bモデルの約2%です。
対話システム
Transformers 英語

C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
RoBERTaアーキテクチャに基づく中国語抽出型QAモデルで、与えられたテキストから回答を抽出するタスクに適しています。
質問応答システム 中国語
R
uer
2,694
98