Dictabert - Joint Open-source Model - Supports Five Practical Language Analysis Tasks Including Hebrew Prefix Segmentation

Dictabert Joint

Developed by dicta-il

State-of-the-art multi-task joint parsing BERT model for Modern Hebrew, supporting five tasks: prefix segmentation, morphological disambiguation, lexical analysis, syntactic parsing, and named entity recognition

Sequence Labeling

Transformers

Other#Hebrew Joint Parsing #Multi-task NLP #Morphosyntactic Analysis

Downloads 3,678

Release Time : 1/10/2024

Model Overview

This model is a joint parsing model designed for Modern Hebrew, capable of handling multiple natural language processing tasks simultaneously, including lexical, syntactic, and semantic analysis.

Model Features

Multi-task Joint Parsing

A single model simultaneously handles lexical, syntactic, and semantic analysis tasks for Hebrew

Syntax Tree Visualization Support

Output results can be directly used to generate syntax tree visualizations

Flexible Task Combination

Selectively enable/disable specific task heads to use model functionalities as needed

Multiple Output Formats

Supports three output formats: JSON, UD format, and IAHLT-style UD format

Model Capabilities

Hebrew prefix segmentation

Hebrew morphological disambiguation

Hebrew lexical analysis (lemmatization)

Hebrew syntactic parsing (dependency tree)

Hebrew named entity recognition

Use Cases

Academic Research

Hebrew Linguistic Analysis

Used to study the morphological and syntactic features of Hebrew

Provides comprehensive linguistic analysis results

Educational Applications

Hebrew Learning Assistance

Helps learners understand Hebrew syntactic structures and morphological changes

Visualized grammatical analysis results

🚀 DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew

DictaBERT is a state-of-the-art language model for Hebrew, which can jointly handle multiple tasks such as prefix segmentation, morphological disambiguation, etc.

🚀 Quick Start

DictaBERT is a cutting-edge language model for Hebrew, released here. It is a fine-tuned model designed for the joint parsing of the following tasks:

Prefix Segmentation
Morphological Disambiguation
Lexicographical Analysis (Lemmatization)
Syntactical Parsing (Dependency-Tree)
Named-Entity Recognition

You can find a live demo of the model with instant visualization of the syntax tree here. For a faster model, you can use the equivalent bert-tiny model for this task here. For the bert-base models for other tasks, see here.

💻 Usage Examples

Basic Usage

The model currently supports 3 types of output:

JSON: The model returns a JSON object for each sentence in the input, where for each sentence we have the sentence text, the NER entities, and the list of tokens. For each token we include the output from each of the tasks.
```
model.predict(..., output_style='json')
```
UD: The model returns the full UD output for each sentence, according to the style of the Hebrew UD Treebank.
```
model.predict(..., output_style='ud')
```
UD, in the style of IAHLT: This model returns the full UD output, with slight modifications to match the style of IAHLT. The differences are mostly the granularity of some dependency relations, how the suffix of a word is broken up, and implicit definite articles. The actual tagging behavior doesn't change.
```
model.predict(..., output_style='iahlt_ud')
```

If you only need the output for one of the tasks, you can tell the model to not initialize some of the heads, for example:

model = AutoModel.from_pretrained('dicta-il/dictabert-joint', trust_remote_code=True, do_lex=False)

The list of options are: do_lex, do_syntax, do_ner, do_prefix, do_morph.

Advanced Usage

Here is a sample usage:

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictabert-joint')
model = AutoModel.from_pretrained('dicta-il/dictabert-joint', trust_remote_code=True)

model.eval()

sentence = 'בשנת 1948 השלים אפרים קישון את לימודיו בפיסול מתכת ובתולדות האמנות והחל לפרסם מאמרים הומוריסטיים'
print(model.predict([sentence], tokenizer, output_style='json')) # see below for other return formats

Output:

[
  {
    "text": "בשנת 1948 השלים אפרים קישון את לימודיו בפיסול מתכת ובתולדות האמנות והחל לפרסם מאמרים הומוריסטיים",
    "tokens": [
      {
        "token": "בשנת",
        "syntax": {
          "word": "בשנת",
          "dep_head_idx": 2,
          "dep_func": "obl",
          "dep_head": "השלים"
        },
        "seg": [
          "ב",
          "שנת"
        ],
        "lex": "שנה",
        "morph": {
          "token": "בשנת",
          "pos": "NOUN",
          "feats": {
            "Gender": "Fem",
            "Number": "Sing"
          },
          "prefixes": [
            "ADP"
          ],
          "suffix": false
        }
      },
      {
        "token": "1948",
        "syntax": {
          "word": "1948",
          "dep_head_idx": 0,
          "dep_func": "compound",
          "dep_head": "בשנת"
        },
        "seg": [
          "1948"
        ],
        "lex": "1948",
        "morph": {
          "token": "1948",
          "pos": "NUM",
          "feats": {},
          "prefixes": [],
          "suffix": false
        }
      },
      {
        "token": "השלים",
        "syntax": {
          "word": "השלים",
          "dep_head_idx": -1,
          "dep_func": "root",
          "dep_head": "הומוריסטיים"
        },
        "seg": [
          "השלים"
        ],
        "lex": "השלים",
        "morph": {
          "token": "השלים",
          "pos": "VERB",
          "feats": {
            "Gender": "Masc",
            "Number": "Sing",
            "Person": "3",
            "Tense": "Past"
          },
          "prefixes": [],
          "suffix": false
        }
      },
      {
        "token": "אפרים",
        "syntax": {
          "word": "אפרים",
          "dep_head_idx": 2,
          "dep_func": "nsubj",
          "dep_head": "השלים"
        },
        "seg": [
          "אפרים"
        ],
        "lex": "אפרים",
        "morph": {
          "token": "אפרים",
          "pos": "PROPN",
          "feats": {},
          "prefixes": [],
          "suffix": false
        }
      },
      {
        "token": "קישון",
        "syntax": {
          "word": "קישון",
          "dep_head_idx": 3,
          "dep_func": "flat",
          "dep_head": "אפרים"
        },
        "seg": [
          "קישון"
        ],
        "lex": "קישון",
        "morph": {
          "token": "קישון",
          "pos": "PROPN",
          "feats": {},
          "prefixes": [],
          "suffix": false
        }
      },
      {
        "token": "את",
        "syntax": {
          "word": "את",
          "dep_head_idx": 6,
          "dep_func": "case",
          "dep_head": "לימודיו"
        },
        "seg": [
          "את"
        ],
        "lex": "את",
        "morph": {
          "token": "את",
          "pos": "ADP",
          "feats": {},
          "prefixes": [],
          "suffix": false
        }
      },
      {
        "token": "לימודיו",
        "syntax": {
          "word": "לימודיו",
          "dep_head_idx": 2,
          "dep_func": "obj",
          "dep_head": "השלים"
        },
        "seg": [
          "לימודיו"
        ],
        "lex": "לימוד",
        "morph": {
          "token": "לימודיו",
          "pos": "NOUN",
          "feats": {
            "Gender": "Masc",
            "Number": "Plur"
          },
          "prefixes": [],
          "suffix": "PRON",
          "suffix_feats": {
            "Gender": "Masc",
            "Number": "Sing",
            "Person": "3"
          }
        }
      },
      {
        "token": "בפיסול",
        "syntax": {
          "word": "בפיסול",
          "dep_head_idx": 6,
          "dep_func": "nmod",
          "dep_head": "לימודיו"
        },
        "seg": [
          "ב",
          "פיסול"
        ],
        "lex": "פיסול",
        "morph": {
          "token": "בפיסול",
          "pos": "NOUN",
          "feats": {
            "Gender": "Masc",
            "Number": "Sing"
          },
          "prefixes": [
            "ADP"
          ],
          "suffix": false
        }
      },
      {
        "token": "מתכת",
        "syntax": {
          "word": "מתכת",
          "dep_head_idx": 7,
          "dep_func": "compound",
          "dep_head": "בפיסול"
        },
        "seg": [
          "מתכת"
        ],
        "lex": "מתכת",
        "morph": {
          "token": "מתכת",
          "pos": "NOUN",
          "feats": {
            "Gender": "Fem",
            "Number": "Sing"
          },
          "prefixes": [],
          "suffix": false
        }
      },
      {
        "token": "ובתולדות",
        "syntax": {
          "word": "ובתולדות",
          "dep_head_idx": 7,
          "dep_func": "conj",
          "dep_head": "בפיסול"
        },
        "seg": [
          "וב",
          "תולדות"
        ],
        "lex": "תולדה",
        "morph": {
          "token": "ובתולדות",
          "pos": "NOUN",
          "feats": {
            "Gender": "Fem",
            "Number": "Plur"
          },
          "prefixes": [
            "CCONJ",
            "ADP"
          ],
          "suffix": false
        }
      },
      {
        "token": "האמנות",
        "syntax": {
          "word": "האמנות",
          "dep_head_idx": 9,
          "dep_func": "compound",
          "dep_head": "ובתולדות"
        },
        "seg": [
          "ה",
          "אמנות"
        ],
        "lex": "אומנות",
        "morph": {
          "token": "האמנות",
          "pos": "NOUN",
          "feats": {
            "Gender": "Fem",
            "Number": "Sing"
          },
          "prefixes": [
            "DET"
          ],
          "suffix": false
        }
      },
      {
        "token": "והחל",
        "syntax": {
          "word": "והחל",
          "dep_head_idx": 2,
          "dep_func": "conj",
          "dep_head": "השלים"
        },
        "seg": [
          "ו",
          "החל"
        ],
        "lex": "החל",
        "morph": {
          "token": "והחל",
          "pos": "VERB",
          "feats": {
            "Gender": "Masc",
            "Number": "Sing",
            "Person": "3",
            "Tense": "Past"
          },
          "prefixes": [
            "CCONJ"
          ],
          "suffix": false
        }
      },
      {
        "token": "לפרסם",
        "syntax": {
          "word": "לפרסם",
          "dep_head_idx": 11,
          "dep_func": "xcomp",
          "dep_head": "והחל"
        },
        "seg": [
          "לפרסם"
        ],
        "lex": "פרסם",
        "morph": {
          "token": "לפרסם",
          "pos": "VERB",
          "feats": {},
          "prefixes": [],
          "suffix": false
        }
      },
      {
        "token": "מאמרים",
        "syntax": {
          "word": "מאמרים",
          "dep_head_idx": 12,
          "dep_func": "obj",
          "dep_head": "לפרסם"
        },
        "seg": [
          "מאמרים"
        ],
        "lex": "מאמר",
        "morph": {
          "token": "מאמרים",
          "pos": "NOUN",
          "feats": {
            "Gender": "Masc",
            "Number": "Plur"
          },
          "prefixes": [],
          "suffix": false
        }
      },
      {
        "token": "הומוריסטיים",
        "syntax": {
          "word": "הומוריסטיים",
          "dep_head_idx": 13,
          "dep_func": "amod",
          "dep_head": "מאמרים"
        },
        "seg": [
          "הומוריסטיים"
        ],
        "lex": "הומוריסטי",
        "morph": {
          "token": "הומוריסטיים",
          "pos": "ADJ",
          "feats": {
            "Gender": "Masc",
            "Number": "Plur"
          },
          "prefixes": [],
          "suffix": false
        }
      }
    ],
    "root_idx": 2,
    "ner_entities": [
      {
        "phrase": "1948",
        "label": "TIMEX"
      },
      {
        "phrase": "אפרים קישון",
        "label": "PER"
      }
    ]
  }
]

You can also choose to get your response in UD format:

sentence = 'בשנת 1948 השלים אפרים קישון את לימודיו בפיסול מתכת ובתולדות האמנות והחל לפרסם מאמרים הומוריסטיים'
print(model.predict([sentence], tokenizer, output_style='ud'))

Results:

[
  [
    "# sent_id = 1",
    "# text = בשנת 1948 השלים אפרים קישון את לימודיו בפיסול מתכת ובתולדות האמנות והחל לפרסם מאמרים הומוריסטיים",
    "1-2\tבשנת\t_\t_\t_\t_\t_\t_\t_\t_",
    "1\tב\tב\tADP\tADP\t_\t2\tcase\t_\t_",
    "2\tשנת\tשנה\tNOUN\tNOUN\tGender=Fem|Number=Sing\t4\tobl\t_\t_",
    "3\t1948\t1948\tNUM\tNUM\t\t2\tcompound:smixut\t_\t_",
    "4\tהשלים\tהשלים\tVERB\tVERB\tGender=Masc|Number=Sing|Person=3|Tense=Past\t0\troot\t_\t_",
    "5\tאפרים\tאפרים\tPROPN\tPROPN\t\t4\tnsubj\t_\t_",
    "6\tקישון\tקישון\tPROPN\tPROPN\t\t5\tflat\t_\t_",
    "7\tאת\tאת\tADP\tADP\t\t8\tcase:acc\t_\t_",
    "8-10\tלימודיו\t_\t_\t_\t_\t_\t_\t_\t_",
    "8\tלימוד_\tלימוד\tNOUN\tNOUN\tGender=Masc|Number=Plur\t4\tobj\t_\t_",
    "9\t_של_\tשל\tADP\tADP\t_\t10\tcase\t_\t_",
    "10\t_הוא\tהוא\tPRON\tPRON\tGender=Masc|Number=Sing|Person=3\t8\tnmod:poss\t_\t_",
    "11-12\tבפיסול\t_\t_\t_\t_\t_\t_\t_\t_",
    "11\tב\tב\tADP\tADP\t_\t12\tcase\t_\t_",
    "12\tפיסול\tפיסול\tNOUN\tNOUN\tGender=Masc|Number=Sing\t8\tnmod\t_\t_",
    "13\tמתכת\tמתכת\tNOUN\tNOUN\tGender=Fem|Number=Sing\t12\tcompound:smixut\t_\t_",
    "14-16\tובתולדות\t_\t_\t_\t_\t_\t_\t_\t_",
    "14\tו\tו\tCCONJ\tCCONJ\t_\t16\tcc\t_\t_",
    "15\tב\tב\tADP\tADP\t_\t16\tcase\t_\t_",
    "16\tתולדות\tתולדה\tNOUN\tNOUN\tGender=Fem|Number=Plur\t12\tconj\t_\t_",
    "17-18\tהאמנות\t_\t_\t_\t_\t_\t_\t_\t_",
    "17\tה\tה\tDET\tDET\t_\t18\tdet\t_\t_",
    "18\tאמנות\tאומנות\tNOUN\tNOUN\tGender=Fem|Number=Sing\t16\tcompound:smixut\t_\t_",
    "19-20\tוהחל\t_\t_\t_\t_\t_\t_\t_\t_",
    "19\tו\tו\tCCONJ\tCCONJ\t_\t20\tcc\t_\t_",
    "20\tהחל\tהחל\tVERB\tVERB\tGender=Masc|Number=Sing|Person=3|Tense=Past\t4\tconj\t_\t_",
    "21\tלפרסם\tפרסם\tVERB\tVERB\t\t20\txcomp\t_\t_",
    "22\tמאמרים\tמאמר\tNOUN\tNOUN\tGender=Masc|Number=Plur\t21\tobj\t_\t_",
    "23\tהומוריסטיים\tהומוריסטי\tADJ\tADJ\tGender=Masc|Number=Plur\t22\tamod\t_\t_"
  ]
]

📄 License

The model is licensed under cc-by-4.0.

📚 Documentation

Citation

If you use DictaBERT-joint in your research, please cite MRL Parsing without Tears: The Case of Hebrew

BibTeX:

@misc{shmidman2024mrl,
      title={MRL Parsing Without Tears: The Case of Hebrew}, 
      author={Shaltiel Shmidman and Avi Shmidman and Moshe Koppel and Reut Tsarfaty},
      year={2024},
      eprint={2403.06970},
      archivePrefix={arXiv},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご