Obtaining and generating the clean version of the C4 corpus version 2.2.1 seems to be computationally expensive.
Are there any plans for an alternative way to generate the C4_200M dataset?
For example, obtaining the C4 clean version 3.0.1 seems more easy as it is available by allennlp, or, alternatively, providing a downloadable C4 clean version 2.2.1 would also make it easier.
Thank you 🙂