Reduction of computational requirements for generating C4_200M

Obtaining and generating the clean version of the C4 corpus version 2.2.1 seems to be computationally expensive.

Are there any plans for an alternative way to generate the C4_200M dataset?

For example, obtaining the C4 clean version 3.0.1 seems more easy as it is available by [allennlp](https://github.com/allenai/allennlp/discussions/5056), or, alternatively, providing a downloadable C4 clean version 2.2.1 would also make it easier.

Thank you 🙂

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduction of computational requirements for generating C4_200M #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reduction of computational requirements for generating C4_200M #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions