CaCl2#
Other language versions: Simplified Chinese, English
I. Introduction to CaCl2#
CaCl2 (CaCl2: Chinese Lexicon) English name: CA Chinese Language Lexicon, derived from a domestic financial industry NLP project. By analyzing existing corpora, it obtains a vast amount of entry data while cataloging and classifying entries according to financial industry standards. In the natural language processing (NLP) process, it can be used for word segmentation, keyword extraction, content summarization, entity recognition, and other purposes. The goal of the CaCl2 project is to provide the internet with an industry-specific, complete, and accurate lexicon, completing the foundational work of Chinese language NLP, allowing users to devote more energy to business research. CaCl2 is an important component of the open project CaOCl (CA Open Chinese Lexical Analysis Toolkit).
Statistics#
1. Number of Entries#
Date | Total Entries | Candidate Entries | Public Entries | Preview Entries |
---|---|---|---|---|
2021-02-01 | Approximately 21,000,000 | Approximately 3,000,000 | 2,553,806 | 280,000 |
2. Number of Industry Dictionaries#
Date | Industry | Number of Dictionaries | Public | Preview | Unpublished |
---|---|---|---|---|---|
2021-02-01 | Primary Industry | 28 | 2 | 26 | 0 |
2021-02-01 | Secondary Industry | 104 | 5 | 99 | 0 |
For detailed statistical status, please refer to the link: CaCl2 Open Status Statistics
II. Quick Start#
1. Clone or Download CaCl2 Lexicon as Needed#
Clone
git clone https://github.com/limccn/cacl2.git
Download
wget https://github.com/limccn/cacl2/blob/master/archive/v0.2/\[dictionary code].zip
2. Import and Configure the Lexicon#
The publicly available CaCl2 lexicon supports use in various word segmentation tools and environments.
Usage Example#
import jieba
dict_name = '480000.txt'
jieba.load_userdict(os.path.join(BASE_PATH_TO_DICT), dict_name))
Usage Example#
<properties>
<entry key="ext_dict">480000.txt;480100.txt;</entry>
</properties>
3. Test and Start Using CaCl2, Enjoy!#
III. Lexicon Open Source Progress#
1. Open Sourced#
Industry Code | Lexicon Name | Number of Entries | Public Release Date | Current Version | Format | Download Link |
---|---|---|---|---|---|---|
480000 | Banking - General | 40,612 | 2021-02 | v0.2 | txt | 480000.zip |
480100 | Banking - Bank | 224,433 | 2021-02 | v0.2 | txt | 480100.zip |
490000 | Non-Bank Finance - General | 341,235 | 2021-02 | v0.2 | txt | 490000.zip |
490100 | Non-Bank Finance - Securities | 311,121 | 2021-02 | v0.2 | txt | 490100.zip |
490200 | Non-Bank Finance - Insurance | 31,020 | 2021-02 | v0.2 | txt | 480200.zip |
2. Planned Open Source#
Industry Code | Lexicon Name | Number of Entries | Planned Public Release Date | Current Version | Format | Download Link |
---|---|---|---|---|---|---|
490300 | Non-Bank Finance - Diversified Finance | 10,000 | Q2 2021 | v0.2 | txt | 490300.zip |
3. Technical Preview#
Before the public release of the dictionaries, we provide a technical preview of 10,000 entries from each of the 28 primary industries. For the actual number of entries contained in the dictionaries, please refer to the link: CaCl2 Open Status Statistics
Industry Code | Lexicon Name | Number of Entries | Format | Download Link |
---|---|---|---|---|
110000 | Agriculture, Forestry, Animal Husbandry, and Fishery - General | 10,000 | txt | 110000.zip |
210000 | Mining - General | 10,000 | txt | 210000.zip |
220000 | Chemical Industry - General | 10,000 | txt | 220000.zip |
230000 | Steel - General | 10,000 | txt | 230000.zip |
240000 | Non-Ferrous Metals - General | 10,000 | txt | 240000.zip |
270000 | Electronics - General | 10,000 | txt | 270000.zip |
280000 | Automotive - General | 10,000 | txt | 280000.zip |
330000 | Home Appliances - General | 10,000 | txt | 330000.zip |
340000 | Food and Beverage - General | 10,000 | txt | 340000.zip |
350000 | Textiles and Apparel - General | 10,000 | txt | 350000.zip |
360000 | Light Industry Manufacturing - General | 10,000 | txt | 360000.zip |
370000 | Pharmaceuticals and Biotechnology - General | 10,000 | txt | 370000.zip |
410000 | Public Utilities - General | 10,000 | txt | 410000.zip |
420000 | Transportation - General | 10,000 | txt | 420000.zip |
430000 | Real Estate - General | 10,000 | txt | 430000.zip |
450000 | Commercial Trade - General | 10,000 | txt | 450000.zip |
460000 | Leisure Services - General | 10,000 | txt | 460000.zip |
480000 | Banking - General | 10,000 | txt | 480000.zip |
490000 | Non-Bank Finance - General | 10,000 | txt | 490000.zip |
510000 | Comprehensive - General | 10,000 | txt | 510000.zip |
610000 | Building Materials - General | 10,000 | txt | 610000.zip |
620000 | Building Decoration - General | 10,000 | txt | 620000.zip |
630000 | Electrical Equipment - General | 10,000 | txt | 630000.zip |
640000 | Machinery and Equipment - General | 10,000 | txt | 640000.zip |
650000 | National Defense and Military Industry - General | 10,000 | txt | 650000.zip |
710000 | Computers - General | 10,000 | txt | 710000.zip |
720000 | Media - General | 10,000 | txt | 720000.zip |
730000 | Communication - General | 10,000 | txt | 730000.zip |
**For original format entries, please refer to: /dicts For detailed open status, please refer to the link: CaCl2 Open Status Statistics
IV. Usage Effects#
1. Tool Testing Comparison#
1.1 Comparison of Word Segmentation Results Using CaCl2 Standard Lexicon and Jieba Standard Library (@CaoWJ)#
1.2 Word Segmentation Comparison Using CaCl2 and Financial Industry Lexicon [Zhaojin Ciku] (@CaoWJ)#
1.3 Word Segmentation and Summary Using CaCl2 and Financial Industry Lexicon [Zhaojin Ciku] (@CaoWJ)#
2. Metrics and Scores#
2.1 Industry Dataset Testing#
2.1.1 Financial Industry (Banking Industry), Word Segmentation Test#
2.1.2 Financial Industry (Financial Industry, Excluding Banking), Word Segmentation Test#
2.2 Standard Dataset Testing#
V. History and Change Log#
1. Regular Release Versions#
Version | Release Date | Change Log |
---|---|---|
0.2 | 2021 | Version in release |
0.1.1 | 2020 | Cataloged and classified the lexicon using the Shenwan industry classification, with a total of 28 primary industries and 104 secondary industries |
0.1 | 2019 | First release version, containing 21 million Chinese entries from the internet, mainly from Baidu Baike, Wikipedia Chinese, and other sources |
2. Automatic Release Versions#
Latest Version | Release Cycle | Release Date | Change Log |
---|---|---|---|
v0.2.21.01 | monthly | 2021-02-01 | Release of financial industry (banking and non-bank finance) lexicon |
v0.2.20.12 | monthly | 2021-01-01 | Initial version of version 0.2, first open-source version, providing a preview of 10,000 entries from each of the 28 primary industries |
For historical automatic release versions, please refer to the link: Version History
VI. License#
1. Open Source Software License#
The source code of CaCl2 is open-sourced under the Apache License 2.0.
Copyright 2021 limc.cn All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
2. Co-Creation License#
The open lexicon, corpora, models, and other materials of CaCl2 are licensed under the Creative Commons BY-NC-SA 4.0 license.
VII. Contributions and Contributors#
Thanks to all contributors of CaCl2 for their efforts. We welcome all contributors who wish to participate and contribute to the CaCl2 project.
1. How to Contribute?#
1.1 Fork or Star our CaCl2#
1.2 Participate in CaCl2 community discussions on GitHub#
2. Contributors#
@CaoWJ
VIII. Frequently Asked Questions#
IX. Other Notes#
Some content of CaCl2 comes from publicly available information and data on the internet. CaCl2 does not guarantee the completeness and accuracy of the data and does not constitute any advice. We do not hold any securities mentioned in this article and have no affiliation with the companies mentioned in this article.