banner
李大仁博客

李大仁博客

天地虽大,但有一念向善,心存良知,虽凡夫俗子,皆可为圣贤。

CaCl2(Chinese)

CaCl2#

Other language versions: Simplified Chinese, English

I. Introduction to CaCl2#

CaCl2 (CaCl2: Chinese Lexicon) English name: CA Chinese Language Lexicon, derived from a domestic financial industry NLP project. By analyzing existing corpora, it obtains a vast amount of entry data while cataloging and classifying entries according to financial industry standards. In the natural language processing (NLP) process, it can be used for word segmentation, keyword extraction, content summarization, entity recognition, and other purposes. The goal of the CaCl2 project is to provide the internet with an industry-specific, complete, and accurate lexicon, completing the foundational work of Chinese language NLP, allowing users to devote more energy to business research. CaCl2 is an important component of the open project CaOCl (CA Open Chinese Lexical Analysis Toolkit).

Statistics#

1. Number of Entries#

DateTotal EntriesCandidate EntriesPublic EntriesPreview Entries
2021-02-01Approximately 21,000,000Approximately 3,000,0002,553,806280,000

2. Number of Industry Dictionaries#

DateIndustryNumber of DictionariesPublicPreviewUnpublished
2021-02-01Primary Industry282260
2021-02-01Secondary Industry1045990

For detailed statistical status, please refer to the link: CaCl2 Open Status Statistics

II. Quick Start#

1. Clone or Download CaCl2 Lexicon as Needed#

Clone

git clone https://github.com/limccn/cacl2.git

Download

wget https://github.com/limccn/cacl2/blob/master/archive/v0.2/\[dictionary code].zip

2. Import and Configure the Lexicon#

The publicly available CaCl2 lexicon supports use in various word segmentation tools and environments.

Usage Example#

import jieba
dict_name = '480000.txt'
jieba.load_userdict(os.path.join(BASE_PATH_TO_DICT), dict_name))

Usage Example#

<properties>
<entry key="ext_dict">480000.txt;480100.txt;</entry>
</properties>

3. Test and Start Using CaCl2, Enjoy!#

III. Lexicon Open Source Progress#

1. Open Sourced#

Industry CodeLexicon NameNumber of EntriesPublic Release DateCurrent VersionFormatDownload Link
480000Banking - General40,6122021-02v0.2txt480000.zip
480100Banking - Bank224,4332021-02v0.2txt480100.zip
490000Non-Bank Finance - General341,2352021-02v0.2txt490000.zip
490100Non-Bank Finance - Securities311,1212021-02v0.2txt490100.zip
490200Non-Bank Finance - Insurance31,0202021-02v0.2txt480200.zip

2. Planned Open Source#

Industry CodeLexicon NameNumber of EntriesPlanned Public Release DateCurrent VersionFormatDownload Link
490300Non-Bank Finance - Diversified Finance10,000Q2 2021v0.2txt490300.zip

3. Technical Preview#

Before the public release of the dictionaries, we provide a technical preview of 10,000 entries from each of the 28 primary industries. For the actual number of entries contained in the dictionaries, please refer to the link: CaCl2 Open Status Statistics

Industry CodeLexicon NameNumber of EntriesFormatDownload Link
110000Agriculture, Forestry, Animal Husbandry, and Fishery - General10,000txt110000.zip
210000Mining - General10,000txt210000.zip
220000Chemical Industry - General10,000txt220000.zip
230000Steel - General10,000txt230000.zip
240000Non-Ferrous Metals - General10,000txt240000.zip
270000Electronics - General10,000txt270000.zip
280000Automotive - General10,000txt280000.zip
330000Home Appliances - General10,000txt330000.zip
340000Food and Beverage - General10,000txt340000.zip
350000Textiles and Apparel - General10,000txt350000.zip
360000Light Industry Manufacturing - General10,000txt360000.zip
370000Pharmaceuticals and Biotechnology - General10,000txt370000.zip
410000Public Utilities - General10,000txt410000.zip
420000Transportation - General10,000txt420000.zip
430000Real Estate - General10,000txt430000.zip
450000Commercial Trade - General10,000txt450000.zip
460000Leisure Services - General10,000txt460000.zip
480000Banking - General10,000txt480000.zip
490000Non-Bank Finance - General10,000txt490000.zip
510000Comprehensive - General10,000txt510000.zip
610000Building Materials - General10,000txt610000.zip
620000Building Decoration - General10,000txt620000.zip
630000Electrical Equipment - General10,000txt630000.zip
640000Machinery and Equipment - General10,000txt640000.zip
650000National Defense and Military Industry - General10,000txt650000.zip
710000Computers - General10,000txt710000.zip
720000Media - General10,000txt720000.zip
730000Communication - General10,000txt730000.zip

**For original format entries, please refer to: /dicts For detailed open status, please refer to the link: CaCl2 Open Status Statistics

IV. Usage Effects#

1. Tool Testing Comparison#

1.1 Comparison of Word Segmentation Results Using CaCl2 Standard Lexicon and Jieba Standard Library (@CaoWJ)#

1.2 Word Segmentation Comparison Using CaCl2 and Financial Industry Lexicon [Zhaojin Ciku] (@CaoWJ)#

1.3 Word Segmentation and Summary Using CaCl2 and Financial Industry Lexicon [Zhaojin Ciku] (@CaoWJ)#

2. Metrics and Scores#

2.1 Industry Dataset Testing#

2.1.1 Financial Industry (Banking Industry), Word Segmentation Test#

2.1.2 Financial Industry (Financial Industry, Excluding Banking), Word Segmentation Test#

2.2 Standard Dataset Testing#

V. History and Change Log#

1. Regular Release Versions#

VersionRelease DateChange Log
0.22021Version in release
0.1.12020Cataloged and classified the lexicon using the Shenwan industry classification, with a total of 28 primary industries and 104 secondary industries
0.12019First release version, containing 21 million Chinese entries from the internet, mainly from Baidu Baike, Wikipedia Chinese, and other sources

2. Automatic Release Versions#

Latest VersionRelease CycleRelease DateChange Log
v0.2.21.01monthly2021-02-01Release of financial industry (banking and non-bank finance) lexicon
v0.2.20.12monthly2021-01-01Initial version of version 0.2, first open-source version, providing a preview of 10,000 entries from each of the 28 primary industries

For historical automatic release versions, please refer to the link: Version History

VI. License#

1. Open Source Software License#

The source code of CaCl2 is open-sourced under the Apache License 2.0.

    Copyright 2021 limc.cn All rights reserved.
    
    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.
    You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.

2. Co-Creation License#

The open lexicon, corpora, models, and other materials of CaCl2 are licensed under the Creative Commons BY-NC-SA 4.0 license.

VII. Contributions and Contributors#

Thanks to all contributors of CaCl2 for their efforts. We welcome all contributors who wish to participate and contribute to the CaCl2 project.

1. How to Contribute?#

1.1 Fork or Star our CaCl2#

1.2 Participate in CaCl2 community discussions on GitHub#

2. Contributors#

@CaoWJ

VIII. Frequently Asked Questions#

IX. Other Notes#

Some content of CaCl2 comes from publicly available information and data on the internet. CaCl2 does not guarantee the completeness and accuracy of the data and does not constitute any advice. We do not hold any securities mentioned in this article and have no affiliation with the companies mentioned in this article.

X. References#

  1. Shenwan Guotai Junan Research Institute Industry Classification Standards. 2014
Loading...
Ownership of this page data is guaranteed by blockchain and smart contracts to the creator alone.