Stream-Omni

Stream-Omni is an end-to-end language-vision-speech chatbot that simultaneously supports interaction across various modality combinations, with the following features💡: - Omni Interaction: Support any multimodal inputs including text, vision, and speech, and generate both text and speech responses. - Seamless "see-while-hear" Experience: Simultaneously output intermediate textual results (e.g., ASR transcriptions and model responses) during speech interactions, like the advanced voice service of GPT-4o. - Efficient Training: Require only a small amount of omni-modal data for training.