StabilityAI has made a new contribution with the introduction of Stable Cascade—a cutting-edge text-to-image model that is set to redefine the way we interact with AI-generated visuals. Tailored for enthusiasts and developers alike, Stable Cascade stands out by being released under a non-commercial license, which opens the doors for countless non-commercial applications and learning opportunities.
This model leverages a three-stage approach, making it not only groundbreaking but also exceptionally user-friendly in terms of training and fine-tuning—even on standard consumer hardware. The creators of Stable Cascade have revolutionized the field with their hierarchical compression technique, which facilitates the creation of high-quality images from a highly compressed latent space. This offers a powerful and efficient method for generating images that could potentially transform the industry.
Not just that but, Stable Cascade has been engineered to provide seamless integration with the diffusers library, ensuring that users can employ the model for inference with ease. In a move to foster transparency and collaboration, StabilityAI has made the model’s training and inference code publicly accessible on their GitHub page.
Features of Stable Cascade
What sets Stable Cascade apart is its unique architecture, which consists of three distinct stages—A, B, and C—that work in concert to produce exceptional outputs. This departure from the Stable Diffusion models showcases StabilityAI’s commitment to innovation and versatility within the AI space.
Adding to its impressive capabilities, the model offers additional features such as image variations and image-to-image generation. These features not only enhance the creative possibilities but also demonstrate the flexibility of the model to cater to a wide range of artistic and practical applications.
The comprehensive release of Stable Cascade does not stop at the model itself. It includes all the necessary code for training and fine-tuning, accompanied by tools like ControlNet and LoRA, which aim to lower the barriers to further experimentation and refinement of this already remarkable architecture.
As StabilityAI unveils Stable Cascade to the world, the potential for creativity and innovation in the realm of text-to-image models takes a monumental leap forward, promising to unlock new possibilities for creators and developers alike.
Stable Cascade’s Unique Architecture
Stable Cascade is a new text to image model released by Stability AI. It is built on a three-stage architecture, comprising Stages A, B, and C, which allows for a hierarchical compression of images, achieving remarkable outputs while utilizing a highly compressed latent space. The model is exceptionally easy to train and finetune on consumer hardware, and it is being released under a non-commercial license that permits non-commercial use only.
The three stages of the Stable Cascade architecture are:
- Stage A: This stage generates a low-resolution version of the image.
- Stage B: This stage refines the image from Stage A and adds more detail.
- Stage C: This stage generates the final, high-resolution image.
Stable Cascade introduces an interesting three-stage approach, setting new benchmarks for quality, flexibility, fine-tuning, and efficiency with a focus on further eliminating hardware barriers. The model is available for inference in the diffusers library. The architecture of Stable Cascade allows for additional training or finetuning, including ControlNets and LoRAs, to be completed singularly on Stage C, which comes with a 16x cost reduction compared to training a similar-sized Stable Diffusion model. The model’s modular approach helps keep the expected VRAM requirements for inference to approximately 20gb but can be further lowered by using the smaller variants. Stable Cascade performs best in both prompt alignment and aesthetic quality in almost all model comparisons.
In addition to standard text-to-image generation, Stable Cascade can generate image variations and image-to-image generations. The release includes all the code for training, finetuning, ControlNet, and LoRA to lower the requirements to experiment with this architecture further.
Final Thoughts
The model overall looks promising. It seems to do pretty well with text in images, something AI has seemed to strugle with. However, most AI image models are getting better at it. Ideogram was one of the first to release decent text in images, then came DALL-E 3 and eventually Midjourney.
My concern with these models has always been whether they can be freely downloaded and fucked around with. As long as the community is able to get their hands on them and fine-tune them, train new base models and LoRAs, and just generally break them in new and unexpected ways, then the existence of a commercial license seems completely fine to me. From what I’ve seen it works better. not 100% perfect, but hands and text seem a lot better finally.
While I’m excited about the new base model and architecture from Stability AI, which is akin to SD 1.5, SDXL, and Cascade in terms of being a foundational model that needs fine-tuning by the open-source community, there’s one concern weighing on my mind. Specifically, it’s the $20/month licensing fee – if I have to pay this even without generating any net earnings from a project, it could make devs pause before diving in. Ideally, I’d prefer a structure where I only need to pay once my earnings can cover the cost. It’s worth noting that Stability AI is currently losing $8 million per month and relies heavily on support from its community for survival. Nonetheless, stability remains crucial as it ensures continued progress in this field.