Merging each panelist's local feed AFTER transcription to minimize/eliminate transcript misattributions

floatingbones · February 24, 2026, 5:18pm

The Whisper app does very well with transcription, but it gets confused when hosts talk over each other. I asked the AIs; they note solutions to this problem. Each individual local audio recording can be transcribed separately and then merged. There exist specialty apps to merge the individual transcripts. Alternatively, @leo could exercise his mad Claude-skills to spec a Python script to perform the merges.

Leo · February 25, 2026, 12:13am

Unfortunately we don’t normally do local recordings.

floatingbones · February 25, 2026, 7:04pm

The critical factor is not whether you still do local recordings. The critical factor is whether or not you have access to the separate audio streams before they’re mixed together.

TWiT used to describe online what systems you were using to capture the audio/video. It’s good news you no longer have to perform local recording to maintain high podcast quality. What software is TWiT currently using to capture the audio/video?

PHolder · February 26, 2026, 1:13am

It’s Zoom in some cases, and https://restream.io/ in others. I think, in each case, all that is available for recording is the mixed stream.

floatingbones · February 27, 2026, 3:46pm

Zoom definitely lets you record separate audio streams for each source. Here is a discussion that specifies how to do it.
I see no public discussions that restreamio is used as an alternative to Zoom to record a subset of shows. Is there somewhere that discusses their use of an alternative conferencing/recording service? Since all shows are being recorded and packaged with one production team, why would they choose to use a different service for some shows?

iFish · February 27, 2026, 3:50pm

They use restreamio for Windows Weekly and another show. I don’t remember if it was Security Now or Intelligent Machines. I don’t have a link to the source but when they were shutting down the studio it was talked about a bit. MBW is definitely on Zoom. I’m pretty sure it was Security Now but I also seem to have a memory of it not being Security Now because Steve doesn’t like WebRTC or something like that.

They do it because some shows Leo can mostly take care of himself. But maybe it’s changed since.

I also can see a world where it used to be used for IM; but with the move to an interview show, they use Zoom to make it easier on the guest. I’m just guessing

PHolder · February 27, 2026, 8:31pm

Being technically possible doesn’t mean it’s fundamentally feasible in the context of how the shows are produced. The shows are meant to be consumed by humans, not by AIs. While the transcripts are useful, they are not the primary goal of the podcasts, and I don’t think they should interfere with the existing process, possibly making a worse product over all, just to enable better transcripts.

floatingbones · March 2, 2026, 1:51am

I’m not sure what this means. The way to see if it’s fundamentally feasible is to try it: a feasibility study. If it’s feasible, it can be introduced into the workflow.

The transcripts are highly useful for a diverse audience of human subscribers. They allow humans to locate a particular episode where a topic was discussed. Humans with difficulty hearing can use the transcript to fill in gaps or watch the entire conversation. IIRC, there are stories where individuals dramatically improved their English language skills watching weekly through audio enhanced with the transcripts. Maybe that was feedback to a “Security Now!” show.

The TWiT network has had immense production quality since its beginning, and the transcripts are an icing on the cake.

Untangling talkovers would make a better product. The way to see if it would “make a worse product” is to try it and see what happens. There’s no reason to speculate about bad outcomes; just try it.

PHolder · March 2, 2026, 7:36am

I wasn’t actually speculating, it was a turn of phrase. Most co-hosts do not have the computer power and network link that would support the necessary ability to send a live feed for the live recording at the same time as they’re trying to make a local copy of their own recording. Leo HAS tried this already with Restream and it caused all sorts of glitching and other problems.

Leo · March 2, 2026, 1:26pm

And incidentally hardly anyone downloads the transcripts. They’re more for SEO and search than anything else.

floatingbones · March 3, 2026, 8:12pm

Thank you, Leo. SEO and search are valuable things, too, and they help algorithm the podcasts. If my recollection is correct about (SN!) listeners who scrutinize the transcripts for detailed technical study of the content and/or detailed study of English language itself, the humans who do those downloads enjoy a disproportionately high value for those transcripts. It does give an easy way to jump to a particular point of a show. I do that frequently with YouTube videos that provide a TOC with timestamped links.

What I was expecting in this thread was, “Thank you! We could try that out. That’s would be an interesting Claude-assisted project.” Or maybe just, “That’s an interesting end run around Whisper’s confusion with crosstalk. Alex would appreciate that suggestion.” I am expecting that this will be a completely solved and automated problem in the future. Killing those specific transcript bugs seems valuable for many business users of Zoom where attribution is important.

Paul, when I see a timestamp of 2:46am for your last posting, I get concerned. I don’t think there’s anything here worth “correcting” in a 2am posting. Think of the cortisol! It’s not worth it. The Zoom web interface is for saving conversations in the cloud. Since Zoom Cloud services are used through the web interface when the [merged] output is sent to the cloud, it makes sense that the separate audio files would also be sent to the cloud. I asked Claude, and it said that users report those separate copies are indeed stored in the cloud. Claude did note that the separate cloud audio files are stored in a somewhat obscure cloud location archive for each meeting. In the spirit of maintaining a fact-based discussion, I’ll include those details here. Look at the last question in the Claude transcript.

PHolder · March 4, 2026, 3:44am

Phil, you need to stick to tech and avoid inferring into my (or anyone else’s) motivations or personal lives. [We’ve covered that before, let’s not regress please.]

knewman · March 5, 2026, 1:14am

Instead, you got a nuanced discussion around the caveats and challenges of implementing this, beyond that which a chatbot has the capability of generating. If you didn’t want a real discussion, then why post at all?

The reality is that TWiT has limited resources, and they don’t see the value in making changes to production to accommodate this functionality. I tend to agree - the tool should adapt to meet the project, not the other way around. Maybe the transcription software will be good enough someday but it’s not there yet.

floatingbones · March 6, 2026, 6:52pm

No, it hasn’t been a nuanced discussion. We got the claim:

Most co-hosts do not have the computer power and network link that would support the necessary ability to send a live feed for the live recording at the same time as they’re trying to make a local copy of their own recording.

That wasn’t nuanced; it’s not even correct. All changes to the co-hosts’ audio recordings is very simple. It’s on the server side:

It’s one check-box in the Zoom Cloud host configuration for the session. Individual participants don’t change anything; they wouldn’t even know that the option had been selected. There would be no difference in the computational load or in network bandwidth for any co-host. It’s the Zoom Cloud Service that creates the separate file. Those audio files are available in the batch of files available for download from the Zoom Cloud Service.

What would be the delta in production resources? The production team would download each file, queue it up for Whisper transcription, and merge the output together for a single clarified transcript. It’s an automation, and the automation script should be well within the capabilities of Agentic AIs to generate. If Claude Opus 4.6 scores a 53 on Humanity’s Last Exam, I think generating those scripts are well within its capabilities.

Doesn’t that cover it? Can you produce a nuanced discussion spelling out what you think the load would be increased to produce those clarified transcripts?

No, the existing transcription software is already fully capable of doing the job. You just have to perform an end-run around its existing limitation with processing crosstalk. Selecting that one check-box in the Zoom Cloud options for a recording is an easy way to do that.

If this had been a nuanced discussion, everybody would already understand the scope of the proposed change. There is no change in the recording procedure for any co-host. Technically, co-hosts wouldn’t even know that that check-box had been checked. After the episode was recorded the transcription process would be automatic.

I posted because I think it’s an easily-solvable problem, and I provided the receipts. What I’d like to know: why would anyone participating in this discussion still think that some property of co-host’s computers or network connection would somehow prohibit saving the audio file in the Zoom Cloud? Also, why do you think there would need to be any improvement whatsoever in Whisper to automate the proposed changes?

PHolder · March 6, 2026, 7:00pm

Since I presume you’ve never ACTUALLY been a host on one of the shows, it’s very easy for you to imagine that simply checking a single box is all it takes. The issue is that it is much more complex than you imagine, and I tried to gloss over it, because it’s technical, and as I’ve never been such a host, I know I don’t know all the technicalities involved. What I do know is that it HAS been tried, and failed, repeatedly. Despite the claims to the contrary from ISP’s everywhere, bandwidth is not unlimited, or not even sufficient in many cases. Despite the claims of web browser developers, OS suppliers and the various video conferencing software suppliers, there are complications with having a SMOOTH experience when a PC is trying to do too much networking, echo cancellation, backdrop blurrings and disk writing all at once. These limitations lead to hiccups and crackles and drop-outs and all sorts of technical issues that affect the resulting quality of a show.

You don’t seem to want to hear that, so you want to continue to argue that because you think it’s trivial it must certainly be. It’s not, and it’s time to let it go.