CogVideo is the largest pretrained transformer for text-to-video generation in the general domain, which is of 9.4 billion parameters. CogVideo elegantly and efficiently finetunes a pretrained text-to-image generative model (CogView2) for text-to-image generation, and adopts multi-frame-rate hierarchical training strategy.