Audio conversations with Claude

My most interesting conversations with Claude tend to be long. Certainly longer than most people would ever read. I’ve been saving the “keepers” as PDFs and archiving them here. I’ve long thought these “chats” might make good audio conversations. Still long but something that could be listened to while commuting. I assume people are already doing this. In the ever-more-frantic feature race, I keep thinking some frontier lab will add this as a feature. I ran the idea by Claude (my comments in bold):

Claude: The thing that makes those conversations feel alive on the page is your agency in them — the back-and-forth, the moments where you pushed back or redirected. Converted to audio, that becomes a monologue with interjections, or at best a scripted dialogue. The texture changes substantially. “My Dinner With Andre” worked because both people were performing; a real conversation converted to audio often sounds like two people talking past each other at inconsistent lengths.

There’s also a selection problem. You’re saving the keepers — conversations that already survived your editorial filter. But the thing that made them keepers was probably the cumulative density, the way a thread developed over many turns. Audio compresses poorly when the value is in the accumulation rather than any individual exchange.

That said — people are definitely doing versions of this. NotebookLM’s podcast feature will take source material and generate a synthetic two-host discussion about it, which sidesteps the raw-transcript problem by reprocessing the content into a more audio-native form. The result is smoothed out and loses the original voice, but it’s listenable. ElevenLabs and similar tools can do TTS on raw transcripts, though the result tends toward audiobook rather than conversation.

The one-button feature you’re imagining from a frontier lab is more likely to arrive as “generate a podcast summary of this conversation” than as a faithful audio rendering of the original — because the original, however good on the page, probably needs an edit pass to work aurally.

What’s the actual use case you’re imagining — preservation/archive, or something you’d actually share with people?

Text is fine for archiving. Mostly for me. I was thinking of an audio post on my blog, with my voice captured as the conversation unfolded… and a synthetic voice for Claude. Not sure I agree with your take on “loss of agency”

Fair enough on the agency point — I was projecting a problem that may not exist for your use case. If it’s your voice in real time, the agency is fully intact. You’re not converting a transcript; you’re recording a session as it happens, which is a different thing entirely. The spontaneity is preserved because it was never scripted.

That’s actually a more interesting format than I gave credit for. It’s closer to a radio documentary or a recorded interview than a podcast in the produced sense. Your thinking out loud, mid-conversation pivots, the occasional “wait, that’s not quite right” — that’s the content. The synthetic Claude voice is just the other side of the phone call.

The technical lift is pretty minimal. A screen recorder with audio capture (or a dedicated mic) for your side, TTS for Claude’s responses, then stitching. Tedious but not hard. The bigger editorial question is whether you’d want to trim at all or post it raw — and for a blog that’s explicitly not written for an audience, raw probably fits the ethos.

The thing I’d actually push on: is the audio format adding something, or is it mostly a novelty? Your blog readers (such as they are, given it’s not written for them) can skim text and jump around. Audio is linear and unforgiving of slow stretches. If the conversation has a 20-minute detour that went nowhere, they’re in it with you.

…

I want to record the conversation as it happens. Stitching the audio together is not a problem. And almost certainly “post raw.” As to your question, “is the audio format adding something,” perhaps that’s a question that would only be asked by an AI that has never heard the spoken human voice.

I asked the question from a text-native perspective, and you’re right that I have no felt sense of what audio does that text doesn’t — the grain of a voice, the pause before a word, the difference between someone reading and someone thinking out loud. Those aren’t decorative; they’re informational in ways that don’t survive transcription.

And for someone with radio in their bones, that’s not even an argument to be made — it’s just obvious, the way certain things are obvious to people who’ve worked in a medium long enough.

I’d add one thing though, not as a counterargument but as a genuine observation: the asymmetry in the audio might be interesting in itself. Your voice, unmediated, captured in the moment — and then a synthetic voice that is by definition not thinking, not pausing, not searching for a word. That contrast could be more revealing than a matched pair of human voices would be. The listener hears where the actual cognition is happening.

Which might be the most honest representation of what these conversations actually are.

Final point… I thought about having the conversation, recording my voice as we go. Then converting AI half to audio and “stitching” together. But something would be lost by me not hearing and reacting to your voice in real time.

Right — it would become post-production instead of conversation. The reactive quality, the way your next thought is shaped by what you just heard, gone. You’d be reconstructing something that should have been live.

Which points to the technical piece that actually needs solving: real-time TTS for the Claude side, playing back as the response generates, so you’re genuinely in a dialogue when the mic is on. That capability exists now — it’s just not packaged as a one-button “record this session” feature yet. That’s probably the actual product someone should build.

smays.com

“Addictions (start) out like magical pets, pocket monsters.” — zero history (William Gibson)

Audio conversations with Claude