en_core_web_sm Open Source English Language Processing Model - Freely Implement Functions like Word Segmentation and Part-of-Speech Tagging

En Core Web Sm

Developed by spacy

A CPU-optimized small English language processing pipeline provided by spaCy, including core functionalities such as tokenization, part-of-speech tagging, dependency parsing, and named entity recognition

Sequence Labeling EnglishOpen Source License:MIT #English Text Processing #Named Entity Recognition #Dependency Parsing

Downloads 2,707

Release Time : 3/2/2022

Model Overview

This is an English natural language processing model primarily used for basic NLP tasks such as text tokenization, part-of-speech tagging, dependency parsing, and named entity recognition. The model is optimized for CPU usage, making it suitable for lightweight application scenarios.

Model Features

CPU Optimization

Specially optimized for CPU usage scenarios, suitable for resource-limited environments

Multi-task Processing

Single pipeline simultaneously handles tokenization, part-of-speech tagging, dependency parsing, and named entity recognition

Lightweight

Small model size without word vectors, suitable for quick deployment

High Accuracy

Achieves high accuracy on standard datasets such as OntoNotes 5

Model Capabilities

Text Tokenization

Part-of-speech Tagging

Dependency Parsing

Named Entity Recognition

Sentence Segmentation

Lemmatization

Use Cases

Text Analysis

Information Extraction

Extract entity information such as person names, locations, and organizations from text

F1 score 84.56%

Syntax Analysis

Analyze sentence grammatical structure and word dependency relationships

Dependency Parsing UAS 91.75%

Content Processing

Text Preprocessing

Prepare text data for machine learning models, including tokenization and part-of-speech tagging

Tokenization accuracy 99.86%

🚀 en_core_web_sm

An English pipeline optimized for CPU, equipped with components like tok2vec, tagger, parser, senter, ner, attribute_ruler, and lemmatizer.

🚀 Quick Start

The en_core_web_sm is an English pipeline optimized for CPU. For more details, refer to: Details

✨ Features

Optimized for CPU processing.
Contains multiple components for different NLP tasks.

📦 Installation

No installation steps provided in the original document.

💻 Usage Examples

No code examples provided in the original document.

📚 Documentation

Model Information

Property	Details
Model Name	`en_core_web_sm`
Version	`3.7.1`
spaCy Compatibility	`>=3.7.2,<3.8.0`
Default Pipeline	`tok2vec`, `tagger`, `parser`, `attribute_ruler`, `lemmatizer`, `ner`
Components	`tok2vec`, `tagger`, `parser`, `senter`, `attribute_ruler`, `lemmatizer`, `ner`
Vectors	0 keys, 0 unique vectors (0 dimensions)
Sources	OntoNotes 5 (Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, Mohammed El-Bachouti, Robert Belvin, Ann Houston) ClearNLP Constituent-to-Dependency Conversion (Emory University) WordNet 3.0 (Princeton University)
License	`MIT`
Author	Explosion

Label Scheme

View label scheme (113 labels for 3 components)

Component	Labels
`tagger`	`$`, `''`, `,`, `-LRB-`, `-RRB-`, `.`, `:`, `ADD`, `AFX`, `CC`, `CD`, `DT`, `EX`, `FW`, `HYPH`, `IN`, `JJ`, `JJR`, `JJS`, `LS`, `MD`, `NFP`, `NN`, `NNP`, `NNPS`, `NNS`, `PDT`, `POS`, `PRP`, `PRP$`, `RB`, `RBR`, `RBS`, `RP`, `SYM`, `TO`, `UH`, `VB`, `VBD`, `VBG`, `VBN`, `VBP`, `VBZ`, `WDT`, `WP`, `WP$`, `WRB`, `XX`, `_SP`, ````
`parser`	`ROOT`, `acl`, `acomp`, `advcl`, `advmod`, `agent`, `amod`, `appos`, `attr`, `aux`, `auxpass`, `case`, `cc`, `ccomp`, `compound`, `conj`, `csubj`, `csubjpass`, `dative`, `dep`, `det`, `dobj`, `expl`, `intj`, `mark`, `meta`, `neg`, `nmod`, `npadvmod`, `nsubj`, `nsubjpass`, `nummod`, `oprd`, `parataxis`, `pcomp`, `pobj`, `poss`, `preconj`, `predet`, `prep`, `prt`, `punct`, `quantmod`, `relcl`, `xcomp`
`ner`	`CARDINAL`, `DATE`, `EVENT`, `FAC`, `GPE`, `LANGUAGE`, `LAW`, `LOC`, `MONEY`, `NORP`, `ORDINAL`, `ORG`, `PERCENT`, `PERSON`, `PRODUCT`, `QUANTITY`, `TIME`, `WORK_OF_ART`

Accuracy

Type	Score
`TOKEN_ACC`	99.86
`TOKEN_P`	99.57
`TOKEN_R`	99.58
`TOKEN_F`	99.57
`TAG_ACC`	97.25
`SENTS_P`	92.02
`SENTS_R`	89.21
`SENTS_F`	90.59
`DEP_UAS`	91.75
`DEP_LAS`	89.87
`ENTS_P`	84.55
`ENTS_R`	84.57
`ENTS_F`	84.56

Model Index

Model Name: en_core_web_sm
Results:
- Task: NER (Token Classification)
  - Metrics:
    - NER Precision: 0.8454836771
    - NER Recall: 0.8456530449
    - NER F Score: 0.8455683525
- Task: TAG (Token Classification)
  - Metrics:
    - TAG (XPOS) Accuracy: 0.97246532
- Task: UNLABELED_DEPENDENCIES (Token Classification)
  - Metrics:
    - Unlabeled Attachment Score (UAS): 0.9175304332
- Task: LABELED_DEPENDENCIES (Token Classification)
  - Metrics:
    - Labeled Attachment Score (LAS): 0.89874821
- Task: SENTS (Token Classification)
  - Metrics:
    - Sentences F-Score: 0.9059485531

🔧 Technical Details

No technical details provided in the original document.

📄 License

The model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご