Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide examples to use alternative Text-to-Speech services #26

Open
Fu-u0718 opened this issue Apr 9, 2024 · 6 comments
Open

Provide examples to use alternative Text-to-Speech services #26

Fu-u0718 opened this issue Apr 9, 2024 · 6 comments

Comments

@Fu-u0718
Copy link

Fu-u0718 commented Apr 9, 2024

アバターと日本語だけではなく、英語での会話も行ってみたいと考えているのですが、コードで使用しているVOICEVOXは英語が話せないと知りました。例えば、GoogleやAzureのText-to-Speechを使用するなどして組んだプログラムはお作りになっていませんか?

@uezo uezo changed the title Text-to-Speechについて Provide examples to use alternative Text-to-Speech services Apr 13, 2024
@uezo
Copy link
Owner

uezo commented Apr 13, 2024

@Fu-u0718 says:
I'm interested in having conversations not only in Japanese using an avatar, but also in English. However, I found out that the VOICEVOX software used in the code does not support English. Have you created any programs that utilize Text-to-Speech services like Google or Azure for this purpose?

@uezo
Copy link
Owner

uezo commented Apr 13, 2024

Hi @Fu-u0718 ,
You can make custom SpeechController that is based on TTS services you like.

  1. Make SpeechController that implements aiavatar.speech.SpeechController
  2. Set the instance of your custom SpeechController to AvatarController

Here is an example for Azure:

  1. Make AzureSpeechController
import aiohttp
import asyncio
import io
from logging import getLogger, NullHandler
import traceback
import wave
import numpy
import sounddevice
from . import SpeechController


class VoiceClip:
    def __init__(self, text: str):
        self.text = text
        self.download_task = None
        self.audio_clip = None


class AzureSpeechController(SpeechController):
    def __init__(self, api_key: str, region: str, speaker_name: str="ja-JP-AoiNeural", speaker_gender: str="Female", lang="ja-JP", device_index: int=-1, playback_margin: float=0.1):
        self.logger = getLogger(__name__)
        self.logger.addHandler(NullHandler())

        self.api_key = api_key
        self.region = region
        self.speaker_name = speaker_name
        self.speaker_gender = speaker_gender
        self.lang = lang

        self.device_index = device_index
        self.playback_margin = playback_margin
        self.voice_clips = {}
        self._is_speaking = False

    async def download(self, voice: VoiceClip):
        url = f"https://{self.region}.tts.speech.microsoft.com/cognitiveservices/v1"
        headers = {
            "X-Microsoft-OutputFormat": "riff-16khz-16bit-mono-pcm",
            "Content-Type": "application/ssml+xml",
            "Ocp-Apim-Subscription-Key": self.api_key
        }
        ssml_text = f"<speak version='1.0' xml:lang='{self.lang}'><voice xml:lang='{self.lang}' xml:gender='{self.speaker_gender}' name='{self.speaker_name}'>{voice.text}</voice></speak>"
        data = ssml_text.encode("utf-8")

        async with aiohttp.ClientSession() as session:
            async with session.post(url, headers=headers, data=data) as response:
                if response.status == 200:
                    voice.audio_clip = await response.read()

    def prefetch(self, text: str):
        v = self.voice_clips.get(text)
        if v:
            return v

        v = VoiceClip(text)
        v.download_task = asyncio.create_task(self.download(v))
        self.voice_clips[text] = v
        return v

    async def speak(self, text: str):
        voice = self.prefetch(text)
        
        if not voice.audio_clip:
            await voice.download_task
        
        with wave.open(io.BytesIO(voice.audio_clip), "rb") as f:
            try:
                self._is_speaking = True
                data = numpy.frombuffer(
                    f.readframes(f.getnframes()),
                    dtype=numpy.int16
                )
                framerate = f.getframerate()
                sounddevice.play(data, framerate, device=self.device_index, blocking=False)
                await asyncio.sleep(len(data) / framerate + self.playback_margin)

            except Exception as ex:
                self.logger.error(f"Error at speaking: {str(ex)}\n{traceback.format_exc()}")

            finally:
                self._is_speaking = False

    def is_speaking(self) -> bool:
        return self._is_speaking
  1. Set the instance of your custom SpeechController to AvatarController
app.avatar_controller.speech_controller = AzureSpeechController(
    AZURE_SUBSCRIPTION_KEY, AZURE_REGION,
    speaker_name="en-US-AvaNeural",
    speaker_gender="Female",
    lang="en-US",
    device_index=2    # Set output device number on you PC
)

However, I've found that AIAvatar has an issue handling English responses from ChatGPT. I will fix it soon.

@uezo
Copy link
Owner

uezo commented Apr 13, 2024

I've fixed it👍
#32

@Fu-u0718
Copy link
Author

Fu-u0718 commented Apr 13, 2024

thank you! You will learn a lot. I would also like to enjoy conversation in English. Thank you for taking the time out of your busy schedule to respond!

@mosu7
Copy link

mosu7 commented Jul 15, 2024

Hi I tried with openai speech service, however it got stucked on [INFO] 2024-07-15 17:28:44,009 : Listening... (OpenAIWakewordListener)

@uezo
Copy link
Owner

uezo commented Jul 15, 2024

Hi @mosu7,
Thank you for your post but we are discussing about Text-to-Speech in this issue, not wake word listener.
Make another issue if you want discuss about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants