Abstract

In the age of personal voice assistants, we have witnessed the proliferation of "personal" vocal companions across smartphones, smart speakers, and other smart devices. Yet, the question arises: Are these virtual assistants genuinely "personal"? The answer may surprise you. Most of these digital companions lack the ability to remember past interactions or truly understand who you are. They heavily rely on an internet connection to process your spoken words in remote servers. Even though users provide informed consent for these interactions, concerns linger regarding potential misuse of speech data, like invasive targeted advertising. The advent of high-performance co-processors, such as GPUs and TPUs, in modern smartphones has rendered cloud-based speech processing obsolete, paving the way for local, on-device solutions.Personal assistants for the elderly serve a unique role, requiring functionalities distinct from those catering to digital natives. Notably, they must excel at aiding memory recall during conversations, making them invaluable in scenarios like medical examinations. By documenting and contextualizing exchanges during medical visits through diarization, a personal assistant can empower individuals or caregivers to revisit and understand the details at their convenience. This autonomy necessitates operation without an internet connection, ensuring utmost privacy during such sensitive interactions.The e-ViTA project has successfully developed a versatile conversational application with a rich set of features:• Local use on both Android and iOS smartphones, no internet connection required.• The capability to remember previous interactions.• Speaker recognition for personalized experiences.• Local processing for automatic speech recognition, spoken language understanding, dialogue management, and speech synthesis.• Secure web searches after anonymizing requests.• The ability to handle telephone calls, read emails, SMS, and messages.• Text preparation through voice dictation.• Assistance with daily activities and acting as a companion or butler.• Facilitating inter-lingual communication via integration with TalkMondo, among other functions.Unlike facial recognition, vocal recognition, and speaker differentiation provide a less invasive and cost-effective solution. Being based on the smartphone's microphone, they do not rely on the camera, which would necessitate complex mechanisms to track the speaker's position.This paper highlights the critical importance of speaker diarization, which allows the system to preserve users' conversations while ensuring the highest level of privacy. Additionally, when deployed on embedded devices, this technology can contribute to monitoring the well-being of the elderly, offering vital contextual information enriched by domotics sensors (motion, intrusion, door or window sensors), actimetry sensors from smartphones or smartwatches, or weather station. The data fusion of these different data streams leverages more personalized and optimized assistance and services, through user-adapted dialogues, or the elderly based on his context and activity.In conclusion, the ability of a system to generate personalized dialogue synthesis is pivotal in the realm of personal voice assistants. With secure, local processing and advanced features, such as speaker differentiation and diarization, enriched by sensor data fusion, we can ensure that virtual companions truly cater to the individual needs of users, without compromising their privacy or data security. This marks a significant step towards a more "personal" experience with our digital assistants.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call