1. Platform background
Launched by the China Cyberspace Security Association and the National Internet Emergency Response Center, it aims to provide high-quality and reliable Chinese Internet corpus resources to support artificial intelligence model training, natural language processing research and other applications.
2. Resource characteristics
Theplatform has launched "Chinese Internet Basic Corpus 2.0", covering 27 datasets with a total volume of about 2.7TB, of which the basic corpus part is about 120GB, containing about 38 million pieces of data. All data is source-verified, content filtered, and deduplicated to ensure the accuracy and reliability of the content.
3. Open source value
After registration and certification, it can be downloaded and used to meet various needs such as scientific research and industry, promote the development of open source ecology, and promote the innovation and application of large models and natural language processing technology in the Chinese field.
For details, please refer to the official website:
https://corpus.cybersac.cn/?home#/index