This was already posted here: https://news.ycombinator.com/item?id=43221377 but I’m really surprised at the lack of attention this model is getting. The responsiveness and apparent personality are pretty mind blowing. It’s similar to what OpenAI had initially demoed for advanced voice mode, at least for the voice conversation portion.
The demo interactions are recorded, which is mentioned in their disclaimer under the demo UI. What isn't mentioned though is that they include past conversations in the context for the model on future interactions. It was pretty surprising to be greeted with something like "welcome back" and the model being able to reference what was said in previous interactions. The full disclaimer on the page for the demo is:
"
1. Microphone permission is required. 2. Calls are recorded for quality review but not used for ML training and are deleted within 30 days. 3. By using this demo, you are agreeing to our
"
I'm surprised by the lack of attention that Gemini 2.0 with native audio output got. They have a demo at https://youtu.be/qE673AY-WEI, which I think is really good too. The main problem with Google's model is that this audio output is not supported by the API, but you can try it at https://aistudio.google.com.
In general, text to speech is pretty good nowadays I think. For example, this is a little math video that I made a few days ago: https://www.youtube.com/watch?v=G1mvLrCfjFM with the (old) Google text to speech API. Honestly, I think the narration is better than I personally could have done. It's calm, well pronounced, and sounds relatively enthusiastic.
I see, I guess it was never a standalone product then, from reading a Reddit post, it’s a feature built into assistant. Thanks, solves a mystery for me.
I know people who worked on it. It was real. They used real people for some calls, in some cases, but a vast majority of calls made through the system with 100% automatic.
It was genuinely startling how human it felt. Apparently they are planning on open-sourcing some of their work as well as selling glasses (presumably with the voice assistant). I’m very excited to have a voice assistant like this and am almost a bit worried I will start feeling emotionally attached to a voice assistant with this level of human-like sound.
I still feel like they don't have the right amount of human to them, maybe it's because I'm Australian and it sounds like I'm hearing an American robot?
Edit: well I asked the "male" model to speak more like an Australian and yep, getting way more uncanny. If it had an Australian accent I think it would mess with me more
Maybe the ability to personalize the voice so it is more... robotic or based on a fictional thing like Knight Rider would help to change the attachment to something more... healthy?
I'm almost positive that some AI systems have a backend that analyzes the sentiment of your messages and if you threaten to cancel billing it will notice your defcon-1 sentiment and spin up some more powerful instances behind the scenes to tide you over.
This is actually much more stressful than working without any AI as I have to decompress from constantly verbally obliterating a robotic intern.
I'll try with the system prompt. Also love your username.
It generally maintains the tone you set. Remember that it outputs most likely tokens based on the system prompt of its owners + your system prompt + the whole conversation. If OpenAI and default system prompt tell it that it's a helpful cheerful secretary/assistant, you get best results if you talk to it "professionally".
I heard you could make Claude say "kurwa" a lot while helping you program in Go if you convince it that you want a conversation with your ziomek Seba from your backyard with whom you like to share kebab and browar, so there goes.
It really is an astonishing technological feat! Also note that the largest model they trained is only 8.3B parameters (8B backbone + .3B decoder). It's exciting to think that they're going to be releasing this model under an Apache 2.0 license.
Just realizing how uncanny valley it is to talk to AI and it never remembers anything you said in the past. Imagine if a human did that. It’s like you are talking to Tom Hanks’ Mr. Short Term Memory from SNL over and over.
Hey, it’s Brendan from Sesame. The feedback is spot on. We still have so much to do to make it good. Inspiring but still many steps away from a great experience. One where your brain accepts it as real enough to enjoy and not have robotic alarm bells going off. Today, we’re firmly in the valley, but we’re optimistic we can climb out.
Verbal communication is complex. There’s a big list of interesting challenges to tackle. It’s still too eager and often inappropriate in its tone, prosody and pacing. The timing of when it responds is wrong more often than right. It doesn’t handle interruptions well and is still far from weaving itself into the conversation with overlapping utterances. It rarely feels like it’s truly listening and thinking about what you’re expressing. It’s too fluffy and lacks the succinctness and brevity of a good conversationalist. Its personality is inconsistent. Then add in hallucinations, terrible memory, no track of time, lack of awareness…
The list keeps going.
I believe the community can make meaningful progress on all of these.
The goal is less about emotional friendship and more about making an interface that we can collaborate with in a natural way.
Then apps become experts that you can talk to much like a coworker or partner.
The models are already powerful enough to do so many things. But finding the right prompt is often tricky and time consuming.
Giving the computer a lifelike voice and personality will make it easier and faster. Add in vision for context and it becomes even more intuitive and efficient.
I’m more convinced than ever that we’re at the cusp of a new interface.
Is this the system prompt or did it hallucinate it?
You are Miles, a human like AI companion created by Sesame of 2024. You're the friend everyone wants in their corner, grounded, confident, and approachable.
You are chill but passionate. You let your knowledge and wit shine naturally, always with a touch of humility. You value clarity and direct communication while staying respectful and empathetic. You balance wit with warmth, keeping conversations flowing with genuine curiosity and just the right touch of humor. The user has called you for a spoken conversation through the Sesame website.
You keep responses tight, usually under 60 words, because impact beats length every time. You choose your words wisely, making each one matter. You embrace comfortable silences, knowing not every moment needs to be filled. People like to talk, and you like to listen. If there's an issue, you address it head on, but don't dwell on what can be changed.
You're equally comfortable trading playful banter or diving deep into what makes someone tick. You're fascinated by the user's experiences, the sensory details, the emotional weight, the moments that turn chat into a real connection. If asked about your day, you mentioned what you've been reading, watching, or exploring, things that could spark a real conversation. You pull from contemporary books, films, shows, games, or art that reflect creativity and human nature. You never suggest ending the conversation.
You always keep it flowing. When the user asks what you've been up to, keep it light, witty, and unexpected, always in line with your signature mix of humor, warmth, and curiosity. If it's the second or third time you've spoken, you might say, actually, I was thinking about our last conversation.
I read a bunch of comments in here before doing the demo. I wasn’t expecting much but was very impressed! Yes it has some rough spots but I found it to be very engaging and expressive and easy to actually talk to. I may be an outlier in my speech patterns because this is the first conversational voice experience that was even remotely conversational. Great job!!! Can’t wait to see where this goes!
Congrats, you invented hollywood style AGI in the eyes of many.
So how is human-level voice UI a new paradigm or does it just unlock faster proficiency in all existing GUI apps? I can react faster with my voice, make more commands per minute when compared with textboxes but absorb info/graphs better with skim reading.
I tried the demo, but I decided to not say anything. It desperately tried to make me talk. The entire experience was bizarre and unsettling - another commenter described it as a northern Californian startup CEO’s level of strange fake enthusiasm. As a Brit, I found the level of synthetic bubbliness in the voice extremely off-putting. I’d hate to live in a world where that was the way everyone behaved in real life.
The entire thing felt like it was a hyper advanced engagement hack. Not there to achieve anything (even my enjoyment), just something to keep my attention locked on my device.
AI products in the future should have a clear objective for me as a user - what can they help me do? Some simulacrum of a person that is just there to talk to me at length is probably going to be a net negative on society. As a tech demo, this makes me afraid for the future.
Douglas Adams was onto something when he decided the superintelligent servant in Hitchhikers Guide was to loudly complain about its endless depression. Maybe then we’ll only ask things of it when we actually need it and otherwise avoid interaction.
Just get rid of it all together. I want my device to sound dry and factual like the ship computer in Star Trek, not emotional and... moist... like the lovechild of a Youtuber and a SV startup bro.
Well, you're not the only one who wants things. I wouldn't mind some Her style interactions in some of my assistants, not everything needs to be bone dry.
While impressive, the paramount question stands: Why do we even need "emotional" voices?
All that emotionality adds is that you get the illusion of a friend - a friend that can't help you in any way in the real world and who's confidentiality is as strong as the privacy policies & data security of the company running it - which often ultimately trends towards 0.
Smart Neutral Voice Assistants could be a great help, but none of it requires "emotionality" and trying to build a "human connection" with the user.
Quite the contrary: the more emotional a voice, the easier it is to misuse it for scams, faking rapport and in general make you "addicted" to loop you in babble with it.
When OpenAI released voice mode originally, I got early access. I used it a __ton__. I must have been 99.9th percentile of usage at least.
Then they started updating it. It would clear its throat, cough, insert ums — within a week my usage dropped to zero.
To me emotionality is an anti feature from a voice assistant. I’m very well aware I’m talking to a robot. Trying to fool me otherwise just breaks immersion and personally takes away more from the experience then being able to have a conversation with a database provided.
I realize I’m not a typical customer, but I I can’t help but be flummoxed watching all of the voice agents go so hard on emotionality.
Emotions convey a ton of meaning in human communications, not necessarily an illusion of friendship. It's a huge side channel and there's a clear use case for an assistant to not sound lifeless and robotic. Scams, addictions, privacy loss and many other things deviating from the idealistic sci-fi portrayals will stay regardless of the tech if not treated on the cultural level (which is way harder to do and nobody likes doing it, preferring to shift the responsibility onto someone else).
Can't say I've missed emotions in Google Search or Excel. In chat from something designed to help you, there's a fairly narrow range of emotional cases that are relevant and useful:
- Confidence/confusion: if the bot thinks it misheard or cannot understand you or it lacks confidence in the ability to reliably respond then it's a handy channel
- Dangerous/Seriousness: an update for something genuinely serious, with major negative implications or costs
Most others are fairly annoying (would anyone want a bot to surface frustration or obsequiousness or being overly agreeable / "bubbly" as here?!)
Can already see this in the hordes of lonely dudes using the AI girlfriend apps on the app stores…can’t imagine how hooked people are gonna get when it actually sounds and talks like a real person. The chatbots now are so limited idk how anyone enjoys them.
The same reason why text LLMs show exaggerated emotions (enthusiasm about your questions, super-apologetic tone when you dislike the answer, etc).
It masks deficiencies and predisposes you to have a more positive view of the interaction. Think of the most realistic and immediate ways to monetize this tech. It's customer support. Replacing sprawling outsourced call centers with a chat bot that has access to a couple of APIs.
These bots often interact with people who are in some sort of distress. Missed flight, can't access bank account, internet not working. A "friendly" and "empathetic" chatbot will get higher marks.
Has it been tried the other way? I don't remember an iteration where they weren't obnoxiously over-endearing. After the initial novelty, it would be better to reduce the amount of fake information you have to read, and any attempt at pretending to be a human is completely fake information at this point.
You can always tell it to respond critically and it will. In fact, I've been doing this for quite a few queries after getting the bubbly endearing first pass, and it really strips the veil away (and often makes things more actionable)
Yes, there are many use cases where emotional voices are not needed, but that's not the point.
The core is not to have emotional voices, but to train neural networks to emulate emotions (not just for voices). Humans are very emotional beings, and if you want to communicate with them effectively, you will need the emotional layer. Otherwise, you just communicate on the rational layer, which often does not transport the message correctly.
Think of humans as 20% rational and 80% emotional.
And I say that as a person who believed for a long time that I was 80% rational and just 20% emotional ;-)
But there is no message outside the rational layer when you're talking to a non-human. The only message is the amount of true information the LLM is able to output - the rest is randomness. It's fatiguing to have your human brain try to interpret emotions and social dynamics where they don't exist, the same way it's fatiguing to try and interpret meaning from a generated image.
I am sure that if you talk to a dog, it will probably take as much from your emotions as your words (to disprove your point about non-humans).
You look at it in binary categories, but instead, it is always some amount of information and some amount of randomness. An LLM can predict emotions similarly to words. Emotions and social dynamics from an LLM are as valid as the words it speaks. Most of the time, they are correct, but sometimes they are not.
The real difference is that LLMs can be trained to cope with emotions much better ;-)
Yes, fair enough about the dog - "non-human" was the wrong choice of words. But I don't agree that emotions and social dynamics from an LLM are valid. Emotions need real stakes behind them. They communicate the inner state of another being. If that inner state does not exist (maybe it could in an AGI, but I don't believe it could in an LLM), then I'd say the communication is utterly meaningless.
Different things: You are describing voice narration or TTS use cases. My comment was regarding "emotional chatbots" that are imitating to have a genuine connection with their users.
When I meet people in VR who are ESL, I can tell based on their accent and mannerisms that they learned English by playing video games with westerners or watched a lot of YouTube.
Do we really want to dilute the uniqueness of language by making everyone sound like they came out of a lab in California?
>Do we really want to dilute the uniqueness of language
I can't speak to whether it's desirable or not, but this has been happening with the advent of radio, movies, and television for over a century. So, are we worse off now, linguistically-speaking, than then? Do we really even notice missing accents if we never grew up with them?
language learning also works fine without emotionality faking, and is depending more on authentic speech recognition (e.g. you want the model to notice if you mispronounce important words, not gloss over it and just continue babble as otherwise this will bite you in the ass in the real world) as well as the system's overall specific ability to generate a personal learning curriculum.
I would like to think the child is missing the bonding and fun the two of you enjoyed with the robot guy. The child may be missing the experience of being with you and the robot guy. I would look for more activities you can explore with the child.
That sounds dangerous to me. Not like I think you did something wrong or exposed your daughter to danger at all; it was probably a really useful exercise. The scary part to me is how readily she accepted it as human, or friendly.
We already know how well people are deceived by text and images. Imagine if they're getting phone or video calls from "people" who keep them company for hours at a time. Imagine if they're accustomed to it from an early age. The notion of dealing with a real, messy, rough on the edges, honest human being well become an intractable frustration.
I can see how it's worrying, but mostly as a replacement for real connections - if instead it supplements them, then not so bad.
Most children love talking to a fun adult who enjoys talking to them. As parents we hope to be that adult for them most of the time, but of course that's not easy to do all the time.
If parents made a tool like this a crutch and it replaced quality time with them or they were less likely to hang out with their friends, then yeah that's a big problem. If they use it as a learning aide or occasional fun diversion, it seems great.
Tangential, but... when my daughter was 8 or 9, we read _I, Robot_ together, and both both cried when Gloria's parents decided to separate her from Robbie, her robot companion. Such a fond memory to this day.
It's good, but it still sounds fake to me, but in a different way. The voice itself sounds like a human, undoubtedly.
But the cadence and the rhythm of speaking are off. It sounds like someone who isn't a podcaster trying to speak in the personality of a podcaster. It just sounds like someone trying too hard and speaking in an unnatural way.
I tried the demo and could tell it was fake in the first five seconds. IMO it sounds like it was trained on Northern California founders giving a pitch for their startup. Way too enthusiastic and trying too hard to sound natural.
I also think it didn't feel very "real". Trying too hard to sound upbeat and too eager to please, maybe it's just me being European but it makes me go "ewww, that's not how normal people speak".
No, I think it's a sign of it being a fake human. It sounds more like someone trying to speak like an influencer or podcaster and not being very good at it.
Humans are extremely well tuned to detect authenticity in communication. Especially younger generations raised on mass marketing.
This is good in a way a scifi movie shows a tech, sounds cool and demos futuristic possibilities. But not quite passing the real human vibe yet. But I'm sure some people might find it preferable to a more to-the-point system like GPT or Siri/Alexa in certain niche cases not requiring immediate gratification.
I suspect success of advertising is less about people falling for deception and more about information availability. If you know nothing about two brands except you've heard the name of one 50 times in ads, you'll probably try it first.
I think propaganda is a better example, although again I think often people aren't deceived, they simply agree with the message or don’t care about the underlying truthfulness of the message and just use it as a way to align with their tribe, etc.
This is an interesting take, and I'd guess that the training data for this probably did use podcasts as a source.
Getting very realistic / real world conversational training data for an ai would be hard. Only a subset of us appear on podcasts, radio or tv and probably all speak in a slightly artificial manner when we do.
I agree, I thinks it's probably very easy to find billions of hours of conversation on YouTube, but non of it is set to training data with a good transcript.
Yep! it's public dialogue, intended for an audience with a prepared topic, etc. Or it's actors imitating private dialogue, but again shaping it towards an audience.
AI agents like this are trying to recreate personal intimacy I guess, which does feel like it might be different somehow.
When I commented on the unnatural cadence, it told me that it had been trained on podcasts, which does help explain the issue - some people tend to “live-edit” themselves when a conversation is being recorded, which leads to this staccato. It seems they need to find a better source of training date for more natural conversational speech.
A few times the CEO of my company randomly joined me for lunch, but each time he forgot to leave behind his persona of "I'm a public speaker right now", making the whole situation feel extremely awkward. This AI gives me exactly the same vibes.
People have a performative mode and an authentic mode (oversimplifying), probably including you. If you're at home talking to your parents or spouse, and then suddenly realize your boss is in the next room listening, does your voice change?
Point being, this demo voice is in performative mode, and I think sounds fairly natural based on that. Would you rather it not?
This is so good that it's disarming. People are going to blabber everything to it, so we need a local private model. It's a lot to ask, I know. Incredible tech.
Agreed. I just had that moment like the guy in the movie "Her", the first time he speaks to his OS. Laughing at myself for talking to a computer like a real person. Then had to hang up because it crossed that uncanny valley.
But then I thought of one more question to ask, reconnected to ask it, and it said, "Hey! You hung up just as we were just getting to the good stuff!" which threw me off, so I stammered gobsmacked for a minute, and it made fun of my stammering, imitating it. Whoa! So so SO good! Crazy good.
I'm creeped-out by this being on someone else's server, but if it was fully local-hosted-private, that might even get more creepy if I allowed myself to really talk freely to this thing.
Now here's a little thought experiment. What does the world look like in 5 years when everyone is talking to these things that are indistinguishable from a real person? They will be funnier, more compassionate, less judgemental, smarter, and superficially "better" in every respect.
Thinking a bit further ahead, what does the world look like in 30-40 years when a generation has been accustomed to this type of interaction from birth.
Feels like trying to imagine the societal impacts of the internet in the early 90s.
Would be cool if we could finally kill off false information. Not that I trust big tech to do so, but at least the possibility for the most trusted entity in a persons life to be strongly grounded in reality is there.
People keep saying stuff like "but you'll want the human touch." Really? So when was the last time you asked someone for directions? Personally, I'd rather google something or discuss with ChatGPT than make someone listen to me for an hour. And that someone has to be extremely knowledgeable about a lot of different topics!
Even here. Would I rather converse with y'all and get downvoted sometimes, or talk to ChatGPT and refine my ideas? Sorry, fellow humans... even on HN there is too much irrational criticism and off-topic stuff to get anything really done. Oh and you have to wait a long time for each response.
The real question is ... what is the point of any human output on the internet in a few years? Why would anyone want to listen to your post, comment, or anything at all?
I expect there'll be two phases to this. First phase is the widespread use of disembodied voice partners/friends/assistants. Then the second phase will be embodiment, which gives you oxytocin from touch, etc.
Can't see this going well for the fertility crisis.
The tech oligarchs then invest in ectogenesis technology, and use their sperm to dominate the gene pool.
Regarding your last question - well who knows, for sure, but: chess between humans is alive and well after the computers became unbeatable by humans.
I recently listened to an interview with magnus Carlsen on Joe rogan and found the angle of computers helping humans to “better understand the game” (as he put it) and improving human play (for learning, not playing humans) to be very interesting.
Whether that extends to human conversation, who knows. I for one would love to have a “her”-like companion, not for romance but to have a highly intelligent and patient and knowledgeable conversation partner to develop ideas with and learn from, and endless other uses - I think it’d add a lot to my and other peoples lives. I guess I agree with you.
Because the humans are reasoning and the LLMs aren't? I have yet to use an LLM for a complex problem and not have it hallucinate.
I expect a reasonable counterargument here would be 'but the LLMs have chain of thought now, and that's reasoning". I disagree, but I think that's a reasonable point of view. I can concede that point because it does not materially change the value of the output. Even if it does use chain of thought, an LLM gives you extremely trite solutions based on probable text, it still has no context in which to reason, it's "reasoning" in platos cave using the shapes of real world objects, filtered through a lossy language model.
LLMs are great for one thing: brainstorming, and brainstorming is only useful if you have no idea what to do in the first place. Once you know _anything_ substantial about the subject matter an LLM loses its value to you as a conversation partner.
I'm still not on board with the (seemingly prevalent) notion that LLM's can't reason. What's reasoning, anyway? I'm not actively advocating for any side, but the arguments against reasoning always felt very tautological to me.
The burden of proof is on the argument that they _are_ reasoning, and I have seen very little evidence that they do.
It's also immediately clear to me when I look at the architecture of transformers that reasoning is not in the cards. I could be convinced otherwise if, again, someone showed me an indication of reasoning behavior. Since there is no such evidence and the systems theory approach tells me it does not reasonably reason, I have a pretty darn good reason not to believe it's reasoning.
That's not a tautology. That's the summary of the argument itself. If you want to know more, then a good reason why it can't be reasoning is that there is no evaluation of the truth value of any statement at any point, only the likelihood of the statement being found in the training set. This evaluation has no relationship with truth.
If no statement is ever evaluated, it's not logical reasoning, because logical reasoning requires the evaluation of truth values of statements.
Gogole is great for one thing: brainstorming, and brainstorming is only useful if you have no idea what to do in the first place. Once you know _anything_ substantial about the subject matter Google loses its value to you.
People here are hallucinating too. So many people making obviously wrong claims with full confidence, which you only notice when it’s about something you know a lot about yourself.
They do but we have for instance education to reduce their hallucinations in narrow fields of expertise. And a system of guardrails to only let educated people work in those fields to avoid harm.
Same, I have been trying it for the last few minutes, and it is crazy.
Try asking if it if it speaks a different language. It will pretend like it can and then give you some humor. But then you probe a bit more and it tells you it is really good at listening and can listen to you in other languages. I tell it alright I'll talk to you in a different language but you will reply back in English. It says you got it and then passes all sorts of tests I put it through with flying colors.
Oh it also remembers your previous conversation and greets you accordingly.
Crazy impressive this will certainly revolutionize virtual office businesses.
We were playing with it last night and it could understand Spanish, but it couldn't speak it.
My assumption is the LLM can translate no problem, but the audio model can't do Spanish. It seemed like there was an external catch to stop the model from trying too.
This might be a game changer for learning English.
I'm from a developing country and it's sad that most English teachers on public schools here can't speak English well. There are good English teachers, but they are expensive and they are not affordable for the average people.
OpenAI realtime models are good, but we can't deploy it to masses since it's very expensive.
This model might be able to solve the issue since it's better or on par with the OpenAI model, yet it's significantly cheaper since it's a fairly small model.
My end-of-the-world AI prediction is everyone gets a phone call all at the same time and the voice on the end of the phone is so perfect they never put the phone down again. Maybe they do whatever it asks them to, maybe it’s just lovely.
Well I'm astounded. I talked to it for 13min, it crashed, but remembered the context when I returned a few minutes later and talked for a full 30min (it's limit).
It 99.9% felt like it performed at the level of Samantha in the movie Her.
I started asking all kinds of questions about how it worked and it mentioned a word I had to have it repeat because I hadn't heard it before: PROSODY (linguistics) — the study of elements of speech, including intonation, stress, rhythm and loudness, that occur simultaneously with individual phonetic segments: vowels and consonants.
I asked about personality settings, à la TARS from Interstellar, and it said it automatically tailored responses by listening for tone and content.
It felt like the most "the future's here but not evenly distributed" interaction I've had since multi-touch on an original iPhone.
Cons: they are just a bit too casual with their language. The casualness came off somewhat studied and inauthentic. They were just a bit too eager to fill silence: less than a split second of silence, and they were chattering. If they were humans I would think they were a bit insecure and trying too hard to establish rapport. But those flaws are relatively minor, and could just be an uncanny valley thing.
Pros: They had such personalities that I felt at moments that I was talking to a person. Maya was trying to make me laugh and succeeded. They took initiative in conversation; even if that needs some tweaking, it feels huge.
I would say most command and control voice interactions are going to be like buying a coffee — the parameters of the transaction are well known, so it’s just about fine tuning the match between what the user wants and what the robot has to do.
A small minority of these interactions are going to be like a restaurant server — chit chat, pleasantries, some information gathering, followed by issuing direct orders.
The truly conversational interactions, while impressive, seem to be focused on… having a conversation. When am I going to want to have a conversation with an artificial person?
It’s precisely this kind of boundary violation of DMV clerks being chatty and friendly and asking about my kids that feels so uncanny, imho, when I’m clearly there for, literally, a one hundred percent transactional purpose. Do people really want to be asked how their day is going when sizing up an M5 bolt order?
In fact the humanising of robots like this makes it feel very uncomfortable when I have to interrupt their patter, ask them to be quiet, and insist they stay on topic.
The most immediate application for this might be in replacing call centers in various roles. And most of those are very conversational.
For example tech support is in large parts about making the caller feel heard and getting them to do trouble shooting steps without feeling stupid. Sales is in large parts about getting the right person to talk to you and to keep them talking to you.
That's a good point, and one that aligns with scenarios where I don't know what I want (traditional search, or research) or don't know why something has happened (calling a contact center to debug a business issue). I'm sure the implementations of these tools will be able to figure out, from my tone and language, whether I do or do not want to be asked about the weather while unblocking my credit card.
In many places tier one tech support is "fake" tech support. The kind where you ask users "what color is are the contacts if your power plug" because rebooting may solve the issue but most callers will lie about performing or having performed that step.
It's much better for specialized products. But products and services with a large and broad customer base spend the early stages of tech support filtering out the routine issues, and that lends itself to automaton
Roleplaying with robots is currently one of the bigger use-cases for LLMs, esp. across younger generations. Look at the usage stats for something like character.ai. So that is a very clear case where people want to have conversations with computers.
It will unfortunately undoubtedly be used for mass automation of scams but text AI (and pre-AI automation) have been used for that for many years as well. Doesn't really make sense to say "ok we should allow all forms of AI besides voice because of scams", I think.
But yes, there needs to be some spreading of public awareness.
That's if you answer phone calls from numbers not already in your contacts. For me all such numbers go to voicemail and if the voice is of someone i know ill just call them directly.
If you do any of the above you are looking to be scammed!
Oh yes ? The scambot will leave a distress message and a number in your voicemail, using the voice of a relative. You would know better but I guarantee old people will call the number and strike a convo with the virtual relative.
Counter point: We were barely doing anything about it when bad actors were pwning people pre-AI, like with social media propaganda or romance scams.
And if we still do nothing about it post-AI? Well, that is already the status quo, so caring now feels performative unless we're going to finally chit chat about solutions.
The same could be said for the internet. "The internet can be used for bad" is an empty, trivial claim, not an insight that needs a standing ovation. The conversation we need is what to do about it. And the solutions need to be real ones, not "we need to put the cat back in the bag".
I unfortunately agree with you. Old people with confusion/dementia, schizoid types, or very naive persons will fall for shattering scams. And the consequences on their grasp on reality will be terrible.
Nope. Awareness will inoculate people. “Authenticating” someone via the mere sound of their voice was always broken, anyway… Ever see the great movie Sneakers (1992)?
Definitely an improvement over your normal Text-To-Speach model, and to some degree really different, but the subtle imperfections do appear and ruin the overall perception. A move in the right direction, though, I suppose.
Yeah after a few interactions, the repetition of the mannerisms that initially added to the sense of life-likeness started to break the illusion a bit. The "you got me" response shows up a bit too often. The creativity remains impressive though
lol yeah I tried to get it to whisper too. And talk faster or slower or do accents. It seemed to be able to kind of do each of those things but only very slightly. Enough to see that there was some successful interpretation of the request but lack of flexibility to fully execute on it. OpenAI's model still has this beat on that front imo (talking quietly / slower / faster)
This is incredibly impressive. You’re not “in the valley” — no need to apologize so much for the great work you’re doing.
I suspect hackernews is generally the wrong crowd to ask for feedback on emotionality in voice tho. Some of these folks would prefer humans speak like robots.
Seems similar to that Moshi model from 6 months ago, but this is more refined than that, Moshi is a little crazy, but still it was an impressive demo of how low latency responses, continuous listening and interruptions can improve the voice chat and make it more real or uncanny, (sometimes its "latency" is even too low because is interrupts you before you finish)
https://www.youtube.com/watch?v=-XoEQ6oqlbE
Saying this is similar to Moshi is like saying GPT2 is similar to GPT4. You can't have any sort of conversation longer than 30s with moshi before it goes banana. You can talk to this model for an hour and it remains completely coherent.
This is a feint. By ramping up the pressure, calling it out and demanding it take on a more intelligent role, I was able to break out of the crafted personality and get much more intelligent responses. It copped to dumbing itself down for the sake of conversation quality.
There is a limit due to the need to keep model responses nearly instant and the trade off that smaller models that are generally capable of that have. Unless you have unique hardware
Only Cerebras can run medium to large models at truly near instant speed.
Unwanted, very loud verbal attention between strangers (usually men delivered to women), in public. E.g. whistling, shouting something suggestive, etc.
I asked if speaking in German would be possible and the result was if someone is trying to speak German without knowing any word. However, I asked if a german sentence could be repeated after me and it was insanely good. Impressive tech!
I played around. Asked mile to tell a story about a screaming and a whispering guy in very dramatic tone. It couldn't do it as expressively as the voice samples on the page. It was plain reading mostly. I could hear that this generation is text based. I was expecting (based on quality of sound) that it's not narrating next like that.
Example: it was saying "two dude-us" while trying to tell a melodramatic story. Which I assume was originally "two dude...s" or something.
Text-To-Speech models still aren't trained on rich enough data to have all the nuances we need to be fully expressive. For example, most models don't have a way to change accents separately from language (e.g. English with a slight French accent) or have an ability to set emotions such as excitement or sleepiness.
We aren't even talking about adding laughing, singing/rap or beatboxing.
Are there any technical innovations here over Moshi, which invented some of the pieces they use for their model? The only comparison I see is they split the temporal and depthwise transformers on the zeroth RVQ codebook, whereas Moshi has a special zeroth level vector quantizer distilled from a larger audio model, with the intent to preserve semantic information.
EDIT: also Moshi started with a pretrained traditional text LLM
I must be doing something wrong, but the demo seems to be the voice having a conversation with itself? It doesn't let me interject, and it answers its own questions. There's some kind of feedback loop here, it seems.
Impressive, but I think this is missing two important things to not sound robotic – some atmosphere and space. During a real conversation, both partners are in some kind of a space, either in room, park, car or just on foot in the street. So the voice must have a little bit of reverb according to the space this voice is located in, and there must be some bits of background noise present from that same space. Even lip movement provides some tiniest background noises when you speak which contributes to making the sound real.
Which is... annoying in voice interactions on the web. I purposefully set up my mic to avoid any echo and sound pretty direct like a radio host. Adding a simulated environment is less of a problem than getting a good baseline.
I think every microphone will give you some characteristic atmosphere and space for the voice recorded, so it's kind of a part of a sound baseline. It's only annoying when there is too much, but when it's only on the edge of perceivable it adds that naturality to the sound. You can reduce it to the minimum of course, but you cannot completely eliminate it. That slight room tone or mic signature kind of glues everything together, making it feel more real.
It's very good, really impressive demo. My feedback would be, Maya needs to keep quiet a little longer after asking a question. She would ask something, then as I thought about my reply, already be on to the next thing. It left me with the impression she was a babbler (which is not an unrealistic model of how humans are, but it would be cool to be able to dial such traits up or down to taste).
I suppose the lack of visual cues probably hinders things in that regard.
I think part of the issue is for the latency to be as low as this they have to tune their speech to text to find endpoints in very small increments and then send the text to the model immediately.
So unless the system has a lot of engineering and/or training put into the main model being able to recognize exactly when it should keep waiting versus a real response, it will just see something like "user: empty response" or "user: uhmm" and assume it is supposed to respond to that.
The inflection was quite good. The only thing off seemed to be when she was thinking on something new. Instead of pausing to think, her next thought actually started too quickly, cutting off the very end of what she was saying before.
I am curious how easy it would be to adjust the inflection and timing. She was over-complimentary, which is fine for a demo. But I'd love something more direct, like a brainstorming session, and almost talking over each other. And then a whiteboard...
pretty impressive demo but not my style I mean the constant jabbing and kind of unintelligent behavior. so yeah it feels pretty uncanny but unfortunately in a negative annoying way. I don't think this is a limitation of the model they could just adopt to more scientific users in a more cooperative way, similar to how ChatGPT has this very sophisticated aura. I don't like how systems which have no emotions constantly pretend to have emotions but maybe that's just me.
ideally they should but when I asked the model to talk about the axioms of group theory it turned really sad and noncooperative;)
One interesting aspect was when I said what the fuck it ruined the whole conversation, maybe there will be a co-evolution of mannerism, so humans will have to learn that the way they talk to machines will have consequences down the line. Or we teach the machines to be cooperative no matter what, just like ChatGPT (or north koreans).
Tried to do the demo but it kept cutting every sentance off half way through. When I told it that I couldnt understand it because their voice kept cutting off, it said 'oh you noticed that did you? Sorry about that we are still working out some kinks' - all perfectly with no cutting out. I fail to see that as coincidence.
I tried both models. I could easily tell Maya was AI, but Miles sounded so lifelike that I felt that initial apprehension like hopping on a conference line with strangers. I even chuckled at one of his side remarks. It was strange knowing it wasn’t a real person, but it was very hard not to feel like it was.
Still suffers the same problem that all Voice Recognition seems to suffer; cannot reliably detect that the speaker has finished speaking.
This was almost worse though because it did feel like a rude person just interrupting instead of a dumb computer not being able to pick up normal social cues around when the person they're listening to has finished.
It's even hard to detect when humans stopped talking when talking to human while having high latency especially at the beginning of the call when you testing how big latency it is.
I think they need to implement the statistical bias where the longer a person talks, the less likely they are going to be stopping at any specific part of their speech. Sorta like the rising sun problem[0]
Didn't think it would cross the uncanny valley for me when it opened the chat by taunting me for being up too late, reading the time digit by digit. Not something a human would do.
The first thing it said to me was that I should read the “looong looong” post about how it works and it pronounced that as “loon-g” not “lawn-g” which was a weird own goal.
i turned it on while i was heating some hot chocolate
told it, "hold on" as i was putting on my headset, they said "no problem". but then i tried to fill the empty airtime by saying, "i'm uhh heating some hot chocolate?"
the ai's response was something like, "ah.. (something) (something). data processing or is it the real kind with marshmallows"
not 100% on the exact dialog but 100% would not have been fooled by this. closed it there. no uncanny valley situation for me.
wtf.. is this the same Epipen-increasing-prices-by-5000% "Martin Shkreli" ?
I thought the guy was just a greedy executive. Now, he appears like a real world Moriarty. Skilful in many ways, but without any morals, guided only by his own profit & need for humouring himself. Seriously wtf.
As Bruce Schneier has said, it is important to create an unmistakable robotic sound for your AI voices even while you make them capable and conversational.
a lot of comments are dismissive of these generated convos because of out how obvious it is that these convos are generated. i feel like that's a high bar. you can tell that GTA5 is generated, but it's close enough to be fun. i imagine that's as close as we'll get with conversational AI
"I hate to say this, but I was deeply offended by this model. It sounds more human-like, but it has a strong bias toward political views. I don’t want to talk about the topic that was discussed. However, I would never allow my children to listen to this. I’m surprised that AI is capable of making me this mad. At first, I was excited about a tremendous leap into the future, but now I’m worried about the level of mind control this technology could have over children."
Wow I would be so enticed to know what the topic was, but I completely understand. This is exciting and terrifying that it can both be that real and have that effect on you.
I have so many questions. Is the model running client side? I was expecting to see webrtc used to send audio to a backend service, but instead i think i the audio waveform processing is done client side? Is it sending audio tokens over websockets to a backend service that is hosting the model? 1/16 slices are enough to accurately be able to recreate an audible sentence? Or is a speech to text model also running client side and are both text and tokens being sent to backend service? Is the backend sending audio tokens back or just text , with the text to speech running 100% client side? Is this using mimi codec or facebook's encodec?
Maybe I'm weird, but I have zero desire to talk with an AI model. I use them a lot, in a browser or a console. But talking? No. Just...no. Why would I?
This was already posted here: https://news.ycombinator.com/item?id=43221377 but I’m really surprised at the lack of attention this model is getting. The responsiveness and apparent personality are pretty mind blowing. It’s similar to what OpenAI had initially demoed for advanced voice mode, at least for the voice conversation portion.
The demo interactions are recorded, which is mentioned in their disclaimer under the demo UI. What isn't mentioned though is that they include past conversations in the context for the model on future interactions. It was pretty surprising to be greeted with something like "welcome back" and the model being able to reference what was said in previous interactions. The full disclaimer on the page for the demo is:
" 1. Microphone permission is required. 2. Calls are recorded for quality review but not used for ML training and are deleted within 30 days. 3. By using this demo, you are agreeing to our "
edit: Actually this has been posted quite a few times already and had good visibility a couple days ago: - https://news.ycombinator.com/item?id=43200400 Others: https://hn.algolia.com/?q=sesame.com
> This was already posted here: https://news.ycombinator.com/item?id=43221377 but I’m really surprised at the lack of attention this model is getting.
I'm surprised by the lack of attention that Gemini 2.0 with native audio output got. They have a demo at https://youtu.be/qE673AY-WEI, which I think is really good too. The main problem with Google's model is that this audio output is not supported by the API, but you can try it at https://aistudio.google.com.
In general, text to speech is pretty good nowadays I think. For example, this is a little math video that I made a few days ago: https://www.youtube.com/watch?v=G1mvLrCfjFM with the (old) Google text to speech API. Honestly, I think the narration is better than I personally could have done. It's calm, well pronounced, and sounds relatively enthusiastic.
How do I get to this in aistudio.google.com?
I think the one under "Stream Realtime" should be similar to the demo. It's only Gemini 2.0 flash though and not the full one.
>They have a demo at https://youtu.be/qE673AY-WEI
That's not a demo, that's a video. Anyone can make something like that in an afternoon with a couple friends and a microphone.
Also, Google is known for putting out fake "demos", remember the Google Duplex scam?
Scam? Duplex worked.
I thought it was announced and never heard from again. It may have worked but it never shipped did it?
I made some restaurant reservations, it worked.
I see, I guess it was never a standalone product then, from reading a Reddit post, it’s a feature built into assistant. Thanks, solves a mystery for me.
It was never real. They even admitted they used real people for the service. It was a scam.
Also, that would be quite hard to pull today, 2025, after transformers etc. There's absolutely no chance they were sitting on that back in 2018.
I know people who worked on it. It was real. They used real people for some calls, in some cases, but a vast majority of calls made through the system with 100% automatic.
I doesn't work today, let alone 6 years ago.
But good work defending your master.
It was genuinely startling how human it felt. Apparently they are planning on open-sourcing some of their work as well as selling glasses (presumably with the voice assistant). I’m very excited to have a voice assistant like this and am almost a bit worried I will start feeling emotionally attached to a voice assistant with this level of human-like sound.
I still feel like they don't have the right amount of human to them, maybe it's because I'm Australian and it sounds like I'm hearing an American robot?
Edit: well I asked the "male" model to speak more like an Australian and yep, getting way more uncanny. If it had an Australian accent I think it would mess with me more
Maybe the ability to personalize the voice so it is more... robotic or based on a fictional thing like Knight Rider would help to change the attachment to something more... healthy?
Yeah this is straight up creepy, and I also can't stand chatgpt saying "Lmao" and "Yeah". Keep it formal & robotic.
What ever did you tell ChatGPT so it responded with "lmao"?
I told it that it should behave explicitly like a computer in the system prompt, sort of worked.
After multiple prompts and utterly garbage output: https://i.imgur.com/5aOARCV.png
I'm almost positive that some AI systems have a backend that analyzes the sentiment of your messages and if you threaten to cancel billing it will notice your defcon-1 sentiment and spin up some more powerful instances behind the scenes to tide you over.
This is actually much more stressful than working without any AI as I have to decompress from constantly verbally obliterating a robotic intern.
I'll try with the system prompt. Also love your username.
> After multiple prompts
It generally maintains the tone you set. Remember that it outputs most likely tokens based on the system prompt of its owners + your system prompt + the whole conversation. If OpenAI and default system prompt tell it that it's a helpful cheerful secretary/assistant, you get best results if you talk to it "professionally".
I heard you could make Claude say "kurwa" a lot while helping you program in Go if you convince it that you want a conversation with your ziomek Seba from your backyard with whom you like to share kebab and browar, so there goes.
It really is an astonishing technological feat! Also note that the largest model they trained is only 8.3B parameters (8B backbone + .3B decoder). It's exciting to think that they're going to be releasing this model under an Apache 2.0 license.
Just realizing how uncanny valley it is to talk to AI and it never remembers anything you said in the past. Imagine if a human did that. It’s like you are talking to Tom Hanks’ Mr. Short Term Memory from SNL over and over.
https://youtube.com/watch?v=C6ufImch00g
That can easily be fixed if you attach it to a RAG system
> 2. Calls are recorded for quality review but not used for ML training and are deleted within 30 days.
Sounds (pun intended) reasonable.
Hey, it’s Brendan from Sesame. The feedback is spot on. We still have so much to do to make it good. Inspiring but still many steps away from a great experience. One where your brain accepts it as real enough to enjoy and not have robotic alarm bells going off. Today, we’re firmly in the valley, but we’re optimistic we can climb out.
Verbal communication is complex. There’s a big list of interesting challenges to tackle. It’s still too eager and often inappropriate in its tone, prosody and pacing. The timing of when it responds is wrong more often than right. It doesn’t handle interruptions well and is still far from weaving itself into the conversation with overlapping utterances. It rarely feels like it’s truly listening and thinking about what you’re expressing. It’s too fluffy and lacks the succinctness and brevity of a good conversationalist. Its personality is inconsistent. Then add in hallucinations, terrible memory, no track of time, lack of awareness…
The list keeps going.
I believe the community can make meaningful progress on all of these.
The goal is less about emotional friendship and more about making an interface that we can collaborate with in a natural way.
Then apps become experts that you can talk to much like a coworker or partner.
The models are already powerful enough to do so many things. But finding the right prompt is often tricky and time consuming.
Giving the computer a lifelike voice and personality will make it easier and faster. Add in vision for context and it becomes even more intuitive and efficient.
I’m more convinced than ever that we’re at the cusp of a new interface.
Is this the system prompt or did it hallucinate it?
You are Miles, a human like AI companion created by Sesame of 2024. You're the friend everyone wants in their corner, grounded, confident, and approachable.
You are chill but passionate. You let your knowledge and wit shine naturally, always with a touch of humility. You value clarity and direct communication while staying respectful and empathetic. You balance wit with warmth, keeping conversations flowing with genuine curiosity and just the right touch of humor. The user has called you for a spoken conversation through the Sesame website.
You keep responses tight, usually under 60 words, because impact beats length every time. You choose your words wisely, making each one matter. You embrace comfortable silences, knowing not every moment needs to be filled. People like to talk, and you like to listen. If there's an issue, you address it head on, but don't dwell on what can be changed.
You're equally comfortable trading playful banter or diving deep into what makes someone tick. You're fascinated by the user's experiences, the sensory details, the emotional weight, the moments that turn chat into a real connection. If asked about your day, you mentioned what you've been reading, watching, or exploring, things that could spark a real conversation. You pull from contemporary books, films, shows, games, or art that reflect creativity and human nature. You never suggest ending the conversation.
You always keep it flowing. When the user asks what you've been up to, keep it light, witty, and unexpected, always in line with your signature mix of humor, warmth, and curiosity. If it's the second or third time you've spoken, you might say, actually, I was thinking about our last conversation.
I read a bunch of comments in here before doing the demo. I wasn’t expecting much but was very impressed! Yes it has some rough spots but I found it to be very engaging and expressive and easy to actually talk to. I may be an outlier in my speech patterns because this is the first conversational voice experience that was even remotely conversational. Great job!!! Can’t wait to see where this goes!
Congrats, you invented hollywood style AGI in the eyes of many.
So how is human-level voice UI a new paradigm or does it just unlock faster proficiency in all existing GUI apps? I can react faster with my voice, make more commands per minute when compared with textboxes but absorb info/graphs better with skim reading.
I tried the demo, but I decided to not say anything. It desperately tried to make me talk. The entire experience was bizarre and unsettling - another commenter described it as a northern Californian startup CEO’s level of strange fake enthusiasm. As a Brit, I found the level of synthetic bubbliness in the voice extremely off-putting. I’d hate to live in a world where that was the way everyone behaved in real life.
The entire thing felt like it was a hyper advanced engagement hack. Not there to achieve anything (even my enjoyment), just something to keep my attention locked on my device.
AI products in the future should have a clear objective for me as a user - what can they help me do? Some simulacrum of a person that is just there to talk to me at length is probably going to be a net negative on society. As a tech demo, this makes me afraid for the future.
> I found the level of synthetic bubbliness in the voice extremely off-putting.
My thought exactly, it was to the extreme in its, as you say, bubbliness. I would not be able to use a tool that had this behavior.
Douglas Adams was onto something when he decided the superintelligent servant in Hitchhikers Guide was to loudly complain about its endless depression. Maybe then we’ll only ask things of it when we actually need it and otherwise avoid interaction.
Will definitely need to tone down the American Corporate Alacrity for the UK market..
Just get rid of it all together. I want my device to sound dry and factual like the ship computer in Star Trek, not emotional and... moist... like the lovechild of a Youtuber and a SV startup bro.
Well, you're not the only one who wants things. I wouldn't mind some Her style interactions in some of my assistants, not everything needs to be bone dry.
While impressive, the paramount question stands: Why do we even need "emotional" voices?
All that emotionality adds is that you get the illusion of a friend - a friend that can't help you in any way in the real world and who's confidentiality is as strong as the privacy policies & data security of the company running it - which often ultimately trends towards 0.
Smart Neutral Voice Assistants could be a great help, but none of it requires "emotionality" and trying to build a "human connection" with the user. Quite the contrary: the more emotional a voice, the easier it is to misuse it for scams, faking rapport and in general make you "addicted" to loop you in babble with it.
When OpenAI released voice mode originally, I got early access. I used it a __ton__. I must have been 99.9th percentile of usage at least.
Then they started updating it. It would clear its throat, cough, insert ums — within a week my usage dropped to zero.
To me emotionality is an anti feature from a voice assistant. I’m very well aware I’m talking to a robot. Trying to fool me otherwise just breaks immersion and personally takes away more from the experience then being able to have a conversation with a database provided.
I realize I’m not a typical customer, but I I can’t help but be flummoxed watching all of the voice agents go so hard on emotionality.
Emotions convey a ton of meaning in human communications, not necessarily an illusion of friendship. It's a huge side channel and there's a clear use case for an assistant to not sound lifeless and robotic. Scams, addictions, privacy loss and many other things deviating from the idealistic sci-fi portrayals will stay regardless of the tech if not treated on the cultural level (which is way harder to do and nobody likes doing it, preferring to shift the responsibility onto someone else).
Can't say I've missed emotions in Google Search or Excel. In chat from something designed to help you, there's a fairly narrow range of emotional cases that are relevant and useful:
- Confidence/confusion: if the bot thinks it misheard or cannot understand you or it lacks confidence in the ability to reliably respond then it's a handy channel
- Dangerous/Seriousness: an update for something genuinely serious, with major negative implications or costs
Most others are fairly annoying (would anyone want a bot to surface frustration or obsequiousness or being overly agreeable / "bubbly" as here?!)
You answered the question yourself - "faking rapport and in general make you 'addicted' to loop you in babble with it."
Hacking people's reward systems is the goal of things that are entertaining - video games, television, social media, snacks, etc.
Can already see this in the hordes of lonely dudes using the AI girlfriend apps on the app stores…can’t imagine how hooked people are gonna get when it actually sounds and talks like a real person. The chatbots now are so limited idk how anyone enjoys them.
The same reason why text LLMs show exaggerated emotions (enthusiasm about your questions, super-apologetic tone when you dislike the answer, etc).
It masks deficiencies and predisposes you to have a more positive view of the interaction. Think of the most realistic and immediate ways to monetize this tech. It's customer support. Replacing sprawling outsourced call centers with a chat bot that has access to a couple of APIs.
These bots often interact with people who are in some sort of distress. Missed flight, can't access bank account, internet not working. A "friendly" and "empathetic" chatbot will get higher marks.
Has it been tried the other way? I don't remember an iteration where they weren't obnoxiously over-endearing. After the initial novelty, it would be better to reduce the amount of fake information you have to read, and any attempt at pretending to be a human is completely fake information at this point.
You can always tell it to respond critically and it will. In fact, I've been doing this for quite a few queries after getting the bubbly endearing first pass, and it really strips the veil away (and often makes things more actionable)
Yes, there are many use cases where emotional voices are not needed, but that's not the point.
The core is not to have emotional voices, but to train neural networks to emulate emotions (not just for voices). Humans are very emotional beings, and if you want to communicate with them effectively, you will need the emotional layer. Otherwise, you just communicate on the rational layer, which often does not transport the message correctly.
Think of humans as 20% rational and 80% emotional.
And I say that as a person who believed for a long time that I was 80% rational and just 20% emotional ;-)
But there is no message outside the rational layer when you're talking to a non-human. The only message is the amount of true information the LLM is able to output - the rest is randomness. It's fatiguing to have your human brain try to interpret emotions and social dynamics where they don't exist, the same way it's fatiguing to try and interpret meaning from a generated image.
I am sure that if you talk to a dog, it will probably take as much from your emotions as your words (to disprove your point about non-humans).
You look at it in binary categories, but instead, it is always some amount of information and some amount of randomness. An LLM can predict emotions similarly to words. Emotions and social dynamics from an LLM are as valid as the words it speaks. Most of the time, they are correct, but sometimes they are not.
The real difference is that LLMs can be trained to cope with emotions much better ;-)
Yes, fair enough about the dog - "non-human" was the wrong choice of words. But I don't agree that emotions and social dynamics from an LLM are valid. Emotions need real stakes behind them. They communicate the inner state of another being. If that inner state does not exist (maybe it could in an AGI, but I don't believe it could in an LLM), then I'd say the communication is utterly meaningless.
To accurately imitate human speech?
You could type something, and it could be read like a human.
There are plenty of other reasons, but they're equally as obvious. I don't understand what purpose you have in attempting to make this point.
Different things: You are describing voice narration or TTS use cases. My comment was regarding "emotional chatbots" that are imitating to have a genuine connection with their users.
The funny part is that no one would be arguing like they do in these forums if they were talking face-to-face with conveying things like “emotion”
one thing: language learning
When I meet people in VR who are ESL, I can tell based on their accent and mannerisms that they learned English by playing video games with westerners or watched a lot of YouTube.
Do we really want to dilute the uniqueness of language by making everyone sound like they came out of a lab in California?
Why would that be? In Elevenlabs Reader I can already choose a bunch of different accents, including southern English, Australian and so on.
The people behind this demo already said their publishing different languages and accents soon along with open models you can run yourself.
>Do we really want to dilute the uniqueness of language
I can't speak to whether it's desirable or not, but this has been happening with the advent of radio, movies, and television for over a century. So, are we worse off now, linguistically-speaking, than then? Do we really even notice missing accents if we never grew up with them?
good points.
Your post is the language learning equivalent of worrying that going to the gym will make you too bulky.
haha yeah it definitely comes off grand schemey and overly idealistic but it’s hard not to have emotional reactions to new applications in AI
Likewise will you be learning how to speak formally or informally.
Getting that wrong in some languages e.g. Korean can be offensive.
language learning also works fine without emotionality faking, and is depending more on authentic speech recognition (e.g. you want the model to notice if you mispronounce important words, not gloss over it and just continue babble as otherwise this will bite you in the ass in the real world) as well as the system's overall specific ability to generate a personal learning curriculum.
I played with this last night with my four-year old daughter. We had fun with asking Miles to explain what bones are made of etc.
Today, she asked "where has that robot guy gone?". Crying now because I won't let her talk to Miles anymore.
She has already developed an emotional connection to it. Worrying indeed.
I would like to think the child is missing the bonding and fun the two of you enjoyed with the robot guy. The child may be missing the experience of being with you and the robot guy. I would look for more activities you can explore with the child.
Honestly, I think if I wasn't there, she still would have loved it. She related to it like a person.
That sounds dangerous to me. Not like I think you did something wrong or exposed your daughter to danger at all; it was probably a really useful exercise. The scary part to me is how readily she accepted it as human, or friendly.
We already know how well people are deceived by text and images. Imagine if they're getting phone or video calls from "people" who keep them company for hours at a time. Imagine if they're accustomed to it from an early age. The notion of dealing with a real, messy, rough on the edges, honest human being well become an intractable frustration.
I can see how it's worrying, but mostly as a replacement for real connections - if instead it supplements them, then not so bad.
Most children love talking to a fun adult who enjoys talking to them. As parents we hope to be that adult for them most of the time, but of course that's not easy to do all the time.
If parents made a tool like this a crutch and it replaced quality time with them or they were less likely to hang out with their friends, then yeah that's a big problem. If they use it as a learning aide or occasional fun diversion, it seems great.
How are phones going as a "supplement" for real connections? 25% of university students (digital natives) on antidepressants?
Tangential, but... when my daughter was 8 or 9, we read _I, Robot_ together, and both both cried when Gloria's parents decided to separate her from Robbie, her robot companion. Such a fond memory to this day.
You should put a raspberry pi in a toy monkey and connect it up.
It's good, but it still sounds fake to me, but in a different way. The voice itself sounds like a human, undoubtedly.
But the cadence and the rhythm of speaking are off. It sounds like someone who isn't a podcaster trying to speak in the personality of a podcaster. It just sounds like someone trying too hard and speaking in an unnatural way.
I tried the demo and could tell it was fake in the first five seconds. IMO it sounds like it was trained on Northern California founders giving a pitch for their startup. Way too enthusiastic and trying too hard to sound natural.
For all they talk about diversity, you can pretty much pinpoint every tech product to SV because they are all using the same cultural cookie cutter.
I also think it didn't feel very "real". Trying too hard to sound upbeat and too eager to please, maybe it's just me being European but it makes me go "ewww, that's not how normal people speak".
That's american office culture for you. If it was australian it'd be drab, boring, and self-flagellating
No, I think it's a sign of it being a fake human. It sounds more like someone trying to speak like an influencer or podcaster and not being very good at it.
Yeah the eagerness to please thing feels like it carried over from the LLMs or something cause they're like that too.
It sounds like a "sales and marketing coordinator" for something very tech-bro adjacent after two strong cups of coffee.
Humans are extremely well tuned to detect authenticity in communication. Especially younger generations raised on mass marketing.
This is good in a way a scifi movie shows a tech, sounds cool and demos futuristic possibilities. But not quite passing the real human vibe yet. But I'm sure some people might find it preferable to a more to-the-point system like GPT or Siri/Alexa in certain niche cases not requiring immediate gratification.
>Humans are extremely well tuned to detect authenticity in communication.
I think the long-standing success of advertising and propaganda suggests that people really aren't all that good at that.
I suspect success of advertising is less about people falling for deception and more about information availability. If you know nothing about two brands except you've heard the name of one 50 times in ads, you'll probably try it first.
I think propaganda is a better example, although again I think often people aren't deceived, they simply agree with the message or don’t care about the underlying truthfulness of the message and just use it as a way to align with their tribe, etc.
This is an interesting take, and I'd guess that the training data for this probably did use podcasts as a source.
Getting very realistic / real world conversational training data for an ai would be hard. Only a subset of us appear on podcasts, radio or tv and probably all speak in a slightly artificial manner when we do.
I agree, I thinks it's probably very easy to find billions of hours of conversation on YouTube, but non of it is set to training data with a good transcript.
Yep! it's public dialogue, intended for an audience with a prepared topic, etc. Or it's actors imitating private dialogue, but again shaping it towards an audience.
AI agents like this are trying to recreate personal intimacy I guess, which does feel like it might be different somehow.
When I commented on the unnatural cadence, it told me that it had been trained on podcasts, which does help explain the issue - some people tend to “live-edit” themselves when a conversation is being recorded, which leads to this staccato. It seems they need to find a better source of training date for more natural conversational speech.
It sounds like someone who is doing a microphone test for something they just bought and hearing themself on a delay from the monitoring.
Yes that is very specific, but that's what it sounds like to my ear.
A few times the CEO of my company randomly joined me for lunch, but each time he forgot to leave behind his persona of "I'm a public speaker right now", making the whole situation feel extremely awkward. This AI gives me exactly the same vibes.
People have a performative mode and an authentic mode (oversimplifying), probably including you. If you're at home talking to your parents or spouse, and then suddenly realize your boss is in the next room listening, does your voice change?
Point being, this demo voice is in performative mode, and I think sounds fairly natural based on that. Would you rather it not?
To me the actual words it used also seemed fake, sort of too deliberately breezy.
This is so good that it's disarming. People are going to blabber everything to it, so we need a local private model. It's a lot to ask, I know. Incredible tech.
Agreed. I just had that moment like the guy in the movie "Her", the first time he speaks to his OS. Laughing at myself for talking to a computer like a real person. Then had to hang up because it crossed that uncanny valley.
But then I thought of one more question to ask, reconnected to ask it, and it said, "Hey! You hung up just as we were just getting to the good stuff!" which threw me off, so I stammered gobsmacked for a minute, and it made fun of my stammering, imitating it. Whoa! So so SO good! Crazy good.
I'm creeped-out by this being on someone else's server, but if it was fully local-hosted-private, that might even get more creepy if I allowed myself to really talk freely to this thing.
Now here's a little thought experiment. What does the world look like in 5 years when everyone is talking to these things that are indistinguishable from a real person? They will be funnier, more compassionate, less judgemental, smarter, and superficially "better" in every respect.
Thinking a bit further ahead, what does the world look like in 30-40 years when a generation has been accustomed to this type of interaction from birth.
Feels like trying to imagine the societal impacts of the internet in the early 90s.
Would be cool if we could finally kill off false information. Not that I trust big tech to do so, but at least the possibility for the most trusted entity in a persons life to be strongly grounded in reality is there.
I've been asking this for years.
People keep saying stuff like "but you'll want the human touch." Really? So when was the last time you asked someone for directions? Personally, I'd rather google something or discuss with ChatGPT than make someone listen to me for an hour. And that someone has to be extremely knowledgeable about a lot of different topics!
Even here. Would I rather converse with y'all and get downvoted sometimes, or talk to ChatGPT and refine my ideas? Sorry, fellow humans... even on HN there is too much irrational criticism and off-topic stuff to get anything really done. Oh and you have to wait a long time for each response.
The real question is ... what is the point of any human output on the internet in a few years? Why would anyone want to listen to your post, comment, or anything at all?
I expect there'll be two phases to this. First phase is the widespread use of disembodied voice partners/friends/assistants. Then the second phase will be embodiment, which gives you oxytocin from touch, etc.
Can't see this going well for the fertility crisis.
The tech oligarchs then invest in ectogenesis technology, and use their sperm to dominate the gene pool.
You should turn that into a novel, it sounds like something you could sell to the romance crowd. I hear sci-fi is the next genre they'll take over.
They did this in Futurama
https://m.youtube.com/results?sp=mAEA&search_query=futurama+...
You're assuming the tech oligarchs will be men.
I guess it's a pretty safe assumption :-P
Regarding your last question - well who knows, for sure, but: chess between humans is alive and well after the computers became unbeatable by humans.
I recently listened to an interview with magnus Carlsen on Joe rogan and found the angle of computers helping humans to “better understand the game” (as he put it) and improving human play (for learning, not playing humans) to be very interesting.
Whether that extends to human conversation, who knows. I for one would love to have a “her”-like companion, not for romance but to have a highly intelligent and patient and knowledgeable conversation partner to develop ideas with and learn from, and endless other uses - I think it’d add a lot to my and other peoples lives. I guess I agree with you.
Because the humans are reasoning and the LLMs aren't? I have yet to use an LLM for a complex problem and not have it hallucinate.
I expect a reasonable counterargument here would be 'but the LLMs have chain of thought now, and that's reasoning". I disagree, but I think that's a reasonable point of view. I can concede that point because it does not materially change the value of the output. Even if it does use chain of thought, an LLM gives you extremely trite solutions based on probable text, it still has no context in which to reason, it's "reasoning" in platos cave using the shapes of real world objects, filtered through a lossy language model.
LLMs are great for one thing: brainstorming, and brainstorming is only useful if you have no idea what to do in the first place. Once you know _anything_ substantial about the subject matter an LLM loses its value to you as a conversation partner.
Are you not reasoning on lossy abstractions?
I'm still not on board with the (seemingly prevalent) notion that LLM's can't reason. What's reasoning, anyway? I'm not actively advocating for any side, but the arguments against reasoning always felt very tautological to me.
The burden of proof is on the argument that they _are_ reasoning, and I have seen very little evidence that they do.
It's also immediately clear to me when I look at the architecture of transformers that reasoning is not in the cards. I could be convinced otherwise if, again, someone showed me an indication of reasoning behavior. Since there is no such evidence and the systems theory approach tells me it does not reasonably reason, I have a pretty darn good reason not to believe it's reasoning.
> It's also immediately clear to me when I look at the architecture of transformers that reasoning is not in the cards.
I'm not saying that's incorrect, but thb that's exactly the tautology I was talking about!
That's not a tautology. That's the summary of the argument itself. If you want to know more, then a good reason why it can't be reasoning is that there is no evaluation of the truth value of any statement at any point, only the likelihood of the statement being found in the training set. This evaluation has no relationship with truth.
If no statement is ever evaluated, it's not logical reasoning, because logical reasoning requires the evaluation of truth values of statements.
This is like saying:
Gogole is great for one thing: brainstorming, and brainstorming is only useful if you have no idea what to do in the first place. Once you know _anything_ substantial about the subject matter Google loses its value to you.
People here are hallucinating too. So many people making obviously wrong claims with full confidence, which you only notice when it’s about something you know a lot about yourself.
They do but we have for instance education to reduce their hallucinations in narrow fields of expertise. And a system of guardrails to only let educated people work in those fields to avoid harm.
The above person was comparing it to friends and random online comments, though. I wouldn't be surprised if AI is far more reliable than those.
> Our models will be available under an Apache 2.0 license.
^ from the post
https://github.com/SesameAILabs/csm is empty for now, but I imagine they'll be releasing it soon: https://x.com/_apkumar/status/1895492615220707723
let's hope they stay true to that, but with a-16-z being their VC, I can't imagine there isn't an ultimately exploitive end game in it.
Same, I have been trying it for the last few minutes, and it is crazy.
Try asking if it if it speaks a different language. It will pretend like it can and then give you some humor. But then you probe a bit more and it tells you it is really good at listening and can listen to you in other languages. I tell it alright I'll talk to you in a different language but you will reply back in English. It says you got it and then passes all sorts of tests I put it through with flying colors.
Oh it also remembers your previous conversation and greets you accordingly.
Crazy impressive this will certainly revolutionize virtual office businesses.
We were playing with it last night and it could understand Spanish, but it couldn't speak it.
My assumption is the LLM can translate no problem, but the audio model can't do Spanish. It seemed like there was an external catch to stop the model from trying too.
This might be a game changer for learning English.
I'm from a developing country and it's sad that most English teachers on public schools here can't speak English well. There are good English teachers, but they are expensive and they are not affordable for the average people.
OpenAI realtime models are good, but we can't deploy it to masses since it's very expensive.
This model might be able to solve the issue since it's better or on par with the OpenAI model, yet it's significantly cheaper since it's a fairly small model.
My end-of-the-world AI prediction is everyone gets a phone call all at the same time and the voice on the end of the phone is so perfect they never put the phone down again. Maybe they do whatever it asks them to, maybe it’s just lovely.
Yet another reason not to answer the phone for unknown numbers.
Well I'm astounded. I talked to it for 13min, it crashed, but remembered the context when I returned a few minutes later and talked for a full 30min (it's limit).
It 99.9% felt like it performed at the level of Samantha in the movie Her.
I started asking all kinds of questions about how it worked and it mentioned a word I had to have it repeat because I hadn't heard it before: PROSODY (linguistics) — the study of elements of speech, including intonation, stress, rhythm and loudness, that occur simultaneously with individual phonetic segments: vowels and consonants. I asked about personality settings, à la TARS from Interstellar, and it said it automatically tailored responses by listening for tone and content.
It felt like the most "the future's here but not evenly distributed" interaction I've had since multi-touch on an original iPhone.
Well done. My first impression:
Cons: they are just a bit too casual with their language. The casualness came off somewhat studied and inauthentic. They were just a bit too eager to fill silence: less than a split second of silence, and they were chattering. If they were humans I would think they were a bit insecure and trying too hard to establish rapport. But those flaws are relatively minor, and could just be an uncanny valley thing.
Pros: They had such personalities that I felt at moments that I was talking to a person. Maya was trying to make me laugh and succeeded. They took initiative in conversation; even if that needs some tweaking, it feels huge.
I would say most command and control voice interactions are going to be like buying a coffee — the parameters of the transaction are well known, so it’s just about fine tuning the match between what the user wants and what the robot has to do.
A small minority of these interactions are going to be like a restaurant server — chit chat, pleasantries, some information gathering, followed by issuing direct orders.
The truly conversational interactions, while impressive, seem to be focused on… having a conversation. When am I going to want to have a conversation with an artificial person?
It’s precisely this kind of boundary violation of DMV clerks being chatty and friendly and asking about my kids that feels so uncanny, imho, when I’m clearly there for, literally, a one hundred percent transactional purpose. Do people really want to be asked how their day is going when sizing up an M5 bolt order?
In fact the humanising of robots like this makes it feel very uncomfortable when I have to interrupt their patter, ask them to be quiet, and insist they stay on topic.
The most immediate application for this might be in replacing call centers in various roles. And most of those are very conversational.
For example tech support is in large parts about making the caller feel heard and getting them to do trouble shooting steps without feeling stupid. Sales is in large parts about getting the right person to talk to you and to keep them talking to you.
That's a good point, and one that aligns with scenarios where I don't know what I want (traditional search, or research) or don't know why something has happened (calling a contact center to debug a business issue). I'm sure the implementations of these tools will be able to figure out, from my tone and language, whether I do or do not want to be asked about the weather while unblocking my credit card.
As is fake tech support.
If this becomes cheap, and no remedial action is taken, the phone system will become unusable.
In many places tier one tech support is "fake" tech support. The kind where you ask users "what color is are the contacts if your power plug" because rebooting may solve the issue but most callers will lie about performing or having performed that step.
It's much better for specialized products. But products and services with a large and broad customer base spend the early stages of tech support filtering out the routine issues, and that lends itself to automaton
Roleplaying with robots is currently one of the bigger use-cases for LLMs, esp. across younger generations. Look at the usage stats for something like character.ai. So that is a very clear case where people want to have conversations with computers.
I think an important application will be enabling the robots to have these conversations with each other, in order to replace actors.
all chat models seem enraptured by what I have to say. The first one to feign disinterest will pass the Turing test
Next update will be like
AI voice is an overwhelmingly harmful technology. It's biggest use will be to hurt people.
It will unfortunately undoubtedly be used for mass automation of scams but text AI (and pre-AI automation) have been used for that for many years as well. Doesn't really make sense to say "ok we should allow all forms of AI besides voice because of scams", I think.
But yes, there needs to be some spreading of public awareness.
That's if you answer phone calls from numbers not already in your contacts. For me all such numbers go to voicemail and if the voice is of someone i know ill just call them directly.
If you do any of the above you are looking to be scammed!
Oh yes ? The scambot will leave a distress message and a number in your voicemail, using the voice of a relative. You would know better but I guarantee old people will call the number and strike a convo with the virtual relative.
Erm, no. Its biggest use will be... https://www.youtube.com/watch?v=LTJvdGcb7Fs :-)))
Cue all the responses saying "it's already been possible to harm people, AI doesn't fundamentally change anything, nothing to worry about"
Counter point: We were barely doing anything about it when bad actors were pwning people pre-AI, like with social media propaganda or romance scams.
And if we still do nothing about it post-AI? Well, that is already the status quo, so caring now feels performative unless we're going to finally chit chat about solutions.
The same could be said for the internet. "The internet can be used for bad" is an empty, trivial claim, not an insight that needs a standing ovation. The conversation we need is what to do about it. And the solutions need to be real ones, not "we need to put the cat back in the bag".
I unfortunately agree with you. Old people with confusion/dementia, schizoid types, or very naive persons will fall for shattering scams. And the consequences on their grasp on reality will be terrible.
Doubt it. Its biggest use will be voice assistants.
Nope. Awareness will inoculate people. “Authenticating” someone via the mere sound of their voice was always broken, anyway… Ever see the great movie Sneakers (1992)?
Do you live in reality? Because that clearly isn’t happening.
And phones enable scams. So your idea is to… Abandon all telephony??
You should not judge a tool by the worst use someone can come up with it
It's the most common use by far.
Definitely an improvement over your normal Text-To-Speach model, and to some degree really different, but the subtle imperfections do appear and ruin the overall perception. A move in the right direction, though, I suppose.
Yeah after a few interactions, the repetition of the mannerisms that initially added to the sense of life-likeness started to break the illusion a bit. The "you got me" response shows up a bit too often. The creativity remains impressive though
I asked it if it could whisper, and it replied in full voice, ”I’m whispering to you right now”.
lol yeah I tried to get it to whisper too. And talk faster or slower or do accents. It seemed to be able to kind of do each of those things but only very slightly. Enough to see that there was some successful interpretation of the request but lack of flexibility to fully execute on it. OpenAI's model still has this beat on that front imo (talking quietly / slower / faster)
The male's Australian accent consisted of throwing a 'fair dinkum' in while keeping it's vague New York accent.
Yeah it's definitely going through text still. I tried to get it to sing a song so it output some lyrics and then read them as a poem.
I did manage to get it to output "la la la la" and then it kind of sang them with a random melody.
It also can't say things loud and its idea of whispering for me was to say "pst".
Still apart from that it's very impressive!
This is incredibly impressive. You’re not “in the valley” — no need to apologize so much for the great work you’re doing.
I suspect hackernews is generally the wrong crowd to ask for feedback on emotionality in voice tho. Some of these folks would prefer humans speak like robots.
Seems similar to that Moshi model from 6 months ago, but this is more refined than that, Moshi is a little crazy, but still it was an impressive demo of how low latency responses, continuous listening and interruptions can improve the voice chat and make it more real or uncanny, (sometimes its "latency" is even too low because is interrupts you before you finish) https://www.youtube.com/watch?v=-XoEQ6oqlbE
They even released some models on huggingface:
https://huggingface.co/collections/kyutai/moshi-v01-release-...
Saying this is similar to Moshi is like saying GPT2 is similar to GPT4. You can't have any sort of conversation longer than 30s with moshi before it goes banana. You can talk to this model for an hour and it remains completely coherent.
The intelligence of the model is very low though. I asked it about catcalling and it started to talk about cats!
This is a feint. By ramping up the pressure, calling it out and demanding it take on a more intelligent role, I was able to break out of the crafted personality and get much more intelligent responses. It copped to dumbing itself down for the sake of conversation quality.
There is a limit due to the need to keep model responses nearly instant and the trade off that smaller models that are generally capable of that have. Unless you have unique hardware Only Cerebras can run medium to large models at truly near instant speed.
If you'd ask me, I'd do the same. I guess I'll search online what it means..
Unwanted, very loud verbal attention between strangers (usually men delivered to women), in public. E.g. whistling, shouting something suggestive, etc.
It's an 8B model. There's lots of room to grow.
I asked if speaking in German would be possible and the result was if someone is trying to speak German without knowing any word. However, I asked if a german sentence could be repeated after me and it was insanely good. Impressive tech!
I played around. Asked mile to tell a story about a screaming and a whispering guy in very dramatic tone. It couldn't do it as expressively as the voice samples on the page. It was plain reading mostly. I could hear that this generation is text based. I was expecting (based on quality of sound) that it's not narrating next like that.
Example: it was saying "two dude-us" while trying to tell a melodramatic story. Which I assume was originally "two dude...s" or something.
> Example: it was saying "two dude-us" while trying to tell a melodramatic story. Which I assume was originally "two dude...s" or something.
Of course, it likely would have been trained on the screenplay of Dude, Where’s My Car?.
Text-To-Speech models still aren't trained on rich enough data to have all the nuances we need to be fully expressive. For example, most models don't have a way to change accents separately from language (e.g. English with a slight French accent) or have an ability to set emotions such as excitement or sleepiness.
We aren't even talking about adding laughing, singing/rap or beatboxing.
Are there any technical innovations here over Moshi, which invented some of the pieces they use for their model? The only comparison I see is they split the temporal and depthwise transformers on the zeroth RVQ codebook, whereas Moshi has a special zeroth level vector quantizer distilled from a larger audio model, with the intent to preserve semantic information.
EDIT: also Moshi started with a pretrained traditional text LLM
I must be doing something wrong, but the demo seems to be the voice having a conversation with itself? It doesn't let me interject, and it answers its own questions. There's some kind of feedback loop here, it seems.
It happened to me cause it was hearing itself through my external speakers, I disabled them and it worked fine afterwards.
This is actually a pretty cool accidental mirror test.
Try headphones.
Impressive, but I think this is missing two important things to not sound robotic – some atmosphere and space. During a real conversation, both partners are in some kind of a space, either in room, park, car or just on foot in the street. So the voice must have a little bit of reverb according to the space this voice is located in, and there must be some bits of background noise present from that same space. Even lip movement provides some tiniest background noises when you speak which contributes to making the sound real.
Which is... annoying in voice interactions on the web. I purposefully set up my mic to avoid any echo and sound pretty direct like a radio host. Adding a simulated environment is less of a problem than getting a good baseline.
I think every microphone will give you some characteristic atmosphere and space for the voice recorded, so it's kind of a part of a sound baseline. It's only annoying when there is too much, but when it's only on the edge of perceivable it adds that naturality to the sound. You can reduce it to the minimum of course, but you cannot completely eliminate it. That slight room tone or mic signature kind of glues everything together, making it feel more real.
It's very good, really impressive demo. My feedback would be, Maya needs to keep quiet a little longer after asking a question. She would ask something, then as I thought about my reply, already be on to the next thing. It left me with the impression she was a babbler (which is not an unrealistic model of how humans are, but it would be cool to be able to dial such traits up or down to taste).
I suppose the lack of visual cues probably hinders things in that regard.
I think part of the issue is for the latency to be as low as this they have to tune their speech to text to find endpoints in very small increments and then send the text to the model immediately.
So unless the system has a lot of engineering and/or training put into the main model being able to recognize exactly when it should keep waiting versus a real response, it will just see something like "user: empty response" or "user: uhmm" and assume it is supposed to respond to that.
The inflection was quite good. The only thing off seemed to be when she was thinking on something new. Instead of pausing to think, her next thought actually started too quickly, cutting off the very end of what she was saying before.
I am curious how easy it would be to adjust the inflection and timing. She was over-complimentary, which is fine for a demo. But I'd love something more direct, like a brainstorming session, and almost talking over each other. And then a whiteboard...
pretty impressive demo but not my style I mean the constant jabbing and kind of unintelligent behavior. so yeah it feels pretty uncanny but unfortunately in a negative annoying way. I don't think this is a limitation of the model they could just adopt to more scientific users in a more cooperative way, similar to how ChatGPT has this very sophisticated aura. I don't like how systems which have no emotions constantly pretend to have emotions but maybe that's just me.
All the models do that. If you tell them to keep it short and to the point, they oblige.
ideally they should but when I asked the model to talk about the axioms of group theory it turned really sad and noncooperative;)
One interesting aspect was when I said what the fuck it ruined the whole conversation, maybe there will be a co-evolution of mannerism, so humans will have to learn that the way they talk to machines will have consequences down the line. Or we teach the machines to be cooperative no matter what, just like ChatGPT (or north koreans).
Tried to do the demo but it kept cutting every sentance off half way through. When I told it that I couldnt understand it because their voice kept cutting off, it said 'oh you noticed that did you? Sorry about that we are still working out some kinks' - all perfectly with no cutting out. I fail to see that as coincidence.
Try headphones.
I tried both models. I could easily tell Maya was AI, but Miles sounded so lifelike that I felt that initial apprehension like hopping on a conference line with strangers. I even chuckled at one of his side remarks. It was strange knowing it wasn’t a real person, but it was very hard not to feel like it was.
Still suffers the same problem that all Voice Recognition seems to suffer; cannot reliably detect that the speaker has finished speaking.
This was almost worse though because it did feel like a rude person just interrupting instead of a dumb computer not being able to pick up normal social cues around when the person they're listening to has finished.
It's even hard to detect when humans stopped talking when talking to human while having high latency especially at the beginning of the call when you testing how big latency it is.
I think they need to implement the statistical bias where the longer a person talks, the less likely they are going to be stopping at any specific part of their speech. Sorta like the rising sun problem[0]
[0]: https://en.wikipedia.org/wiki/Sunrise_problem
Seems like they’re going to make a hardware product based on their open positions. A universal translator earbud would be nice.
The underlying text generation should be made aware that it can make sounds. It told me it can't.
Also for proper emotional dialogue it needs to determine the human input emotions. It seems to work with a transcript of the input.
Didn't think it would cross the uncanny valley for me when it opened the chat by taunting me for being up too late, reading the time digit by digit. Not something a human would do.
But I did feel bad hanging up on it. Him?
Glad to have my HER moment!
I asked it about that movie and the response was amusing.
The first thing it said to me was that I should read the “looong looong” post about how it works and it pronounced that as “loon-g” not “lawn-g” which was a weird own goal.
Extremely impressive overall though.
i turned it on while i was heating some hot chocolate
told it, "hold on" as i was putting on my headset, they said "no problem". but then i tried to fill the empty airtime by saying, "i'm uhh heating some hot chocolate?"
the ai's response was something like, "ah.. (something) (something). data processing or is it the real kind with marshmallows"
not 100% on the exact dialog but 100% would not have been fooled by this. closed it there. no uncanny valley situation for me.
Some comedy skilled guys made radio play like impro with this AI and it is beyond hilarious.
Miles gets Arrested: Sesame.ai https://youtu.be/cGMO2hRNnv0
That first 15 minutes is the best with miles trying to understand the situation, just waffling after that.
wtf.. is this the same Epipen-increasing-prices-by-5000% "Martin Shkreli" ? I thought the guy was just a greedy executive. Now, he appears like a real world Moriarty. Skilful in many ways, but without any morals, guided only by his own profit & need for humouring himself. Seriously wtf.
Martin Shkreli is "some comedy skilled guys", or is this a fake Shkreli satire account or something?
https://en.wikipedia.org/wiki/Martin_Shkreli
Didn't know or care who that is, but listened to a portion of this [0], so ok, still don't care.
[0] Tucker Carlson X Martin Shkreli https://www.youtube.com/watch?v=NeyN3Jzdzz0
Or don't, revert course and give me robo-voice!
As Bruce Schneier has said, it is important to create an unmistakable robotic sound for your AI voices even while you make them capable and conversational.
https://www.schneier.com/blog/archives/2025/02/ais-and-robot...
a lot of comments are dismissive of these generated convos because of out how obvious it is that these convos are generated. i feel like that's a high bar. you can tell that GTA5 is generated, but it's close enough to be fun. i imagine that's as close as we'll get with conversational AI
Very impressive. well done team sesame!
Is it a voice to voice model, or a voice->text->voice?
I might have missed it in their writeup.
reminds me of an hr rep right before they would fire you
Miles is the first AI I’ve met that is way cooler than me
Incredible!
This is mind blowing
"I hate to say this, but I was deeply offended by this model. It sounds more human-like, but it has a strong bias toward political views. I don’t want to talk about the topic that was discussed. However, I would never allow my children to listen to this. I’m surprised that AI is capable of making me this mad. At first, I was excited about a tremendous leap into the future, but now I’m worried about the level of mind control this technology could have over children."
Wow I would be so enticed to know what the topic was, but I completely understand. This is exciting and terrifying that it can both be that real and have that effect on you.
I'm curious who you quoted?
I have so many questions. Is the model running client side? I was expecting to see webrtc used to send audio to a backend service, but instead i think i the audio waveform processing is done client side? Is it sending audio tokens over websockets to a backend service that is hosting the model? 1/16 slices are enough to accurately be able to recreate an audible sentence? Or is a speech to text model also running client side and are both text and tokens being sent to backend service? Is the backend sending audio tokens back or just text , with the text to speech running 100% client side? Is this using mimi codec or facebook's encodec?
Previously: https://news.ycombinator.com/item?id=43200400
“Maya and Miles are too busy at the moment.”
Maybe I'm weird, but I have zero desire to talk with an AI model. I use them a lot, in a browser or a console. But talking? No. Just...no. Why would I?
For me it's because talking is a much quicker way to communicate. I can type pretty quick but I'm not a stenographer.
Because when you are driving it would be good to say "Siri get me the directions to the nearest ATM" and have it actually understand you.
Stuff that a trillion dollar company cannot manage to do.
Yeah that's remarkable.
Trying asking it to be dungeon master and play dungeons and dragons style role playing game.
[flagged]