Use WebVTT standard instead of a cooked-up JSON file #756

dascritch · 2026-03-04T21:17:44Z

dascritch
Mar 4, 2026

WebVTT is a standard well implemented in browsers for chapters, annotations and captions because it is in WhatWG and W3C recommended implemntations for the Media objects .

In https://github.com/Podcastindex-org/podcast-namespace/blob/main/docs/1.0.md#user-content-chapters , for the tag <podcast:chapters>, it states that browsers aren't supporting ID3 chapters tags. This is true. But instead the proposal use a less standard solution, a brand new json file.

Support for WebVTT files is nearly complete on web browsers and are a W3C standard.](https://www.w3.org/TR/webvtt1/), works smoothly in 99% market-share browsers, are exposed and used in accessibility tools. So why preconize a new file format without native implementation instead to use one perfectly used now for 10 years (in subtitling, but it works perfectly too for chaptering, i'm using it) ?

I suggest to change this to recommend WebVTT as a preferred solution, mime type text/vtt, documentation https://www.w3.org/TR/webvtt1/, and alternatively to suggest application/json+chapters.

cf #315

theDanielJLewis · 2026-03-04T21:58:07Z

theDanielJLewis
Mar 4, 2026

I think it's an interesting idea, especially since they also have a way to for metadata, but I recommend against this for three reasons:

The documentation itself calls chapters with metadata "non-normative."
Simple JSON is much easier for a developer to parse and support. Look at how many apps already support this—including Apple Podcasts, finally.
The "chapters file" is actually not a "chapters file." It's a file containing episode metadata, allowing for much more than only chapters, potentially being able to shift some just-in-time metadata to an external file instead of overloading the RSS feed. And when such features are adopted, it will be extremely easy for developers to support if they already load the chapters, because the other data will be right there in the JSON object they're already processing.

0 replies

samsethi · 2026-03-08T21:22:22Z

samsethi
Mar 8, 2026

Reading the transcript text - https://github.com/Podcastindex-org/podcast-namespace/blob/main/docs/examples/transcripts/transcripts.md

"Want to support only one format? WebVTT is used by Apple Podcasts for ingest, and also natively supported by web browsers. Because the WebVTT format is the most flexible, it's an ideal choice if you can only support one format."

The JSON representation is a flexible format that accomodates various degrees of fidelity in a concise way. At the most precise, it enables word-by-word highlighting. This format for podcast transcripts should adhere to the following specifications.

Apple uses the VTT format with accurate word highlighting.

1 reply

jamescridland Mar 9, 2026
Collaborator

Apple uses the VTT format with accurate word highlighting.

There's no accurate word highlighting in VTT format; and I know I'm not supplying word-by-word to Apple. Yet, they are producing word highlighting for all shows.

Here's how I think Apple works:

Apple does its own transcription on a podcast, which includes accurate word-by-word highlighting.
Where a VTT format file is provided by the publisher, Apple appears to a) ingest that VTT format file; b) compare it to the transcription Apple has done itself; c) if the text is above 90% similar, Apple accepts the publisher's VTT format file, but applies timing from its own transcription.

In both cases, Apple redacts dynamic advertising from its transcript. I don't know how it does that. It isn't on-device, since the transcription appears before the audio is played. It may be requesting different copies of the audio and comparing them.

I don't think this relates to the files being sent to Apple (indeed, I can confirm it doesn't, given Podnews Daily has no word timing information).

jamescridland · 2026-03-09T00:28:25Z

jamescridland
Mar 9, 2026
Collaborator

I think there are three proposals here from @dascritch - let's see if I can help unpack them:

1. "Use WebVTT standard"

I agree with this part of the proposal. I'd like to propose that we retire SRT/TXT/HTML format to simplify the specification.

WebVTT is a standard for browsers, which supports VTT files out of the box for video, and supports VTT files for audio quite simply as well. Here's use of an AUDIO player with VTT support.

SRT files are much less well supported by the web.

If you have to have an SRT file for your application, then it's an easy transform from a VTT file. They're almost identical in nature.

Similarly, a TXT or HTML transcription can be built from a VTT file as well.

Removing complexity from the podcast:transcript specification would enable this feature to be more effective, since publishers would clearly be told to produce ONE file format, and consumers would only have to deal with that one format. The specification at the moment is messy and complex and needn't be.

2. "Use WebVTT chapters"

WebVTT has chapter support. However, podcasting uses four chapter formats currently - "chapters in descriptions" (Apple, YouTube, Spotify); "chapters in podlove format" (Spotify?); "JSON chapters" (Apple); ID3 tags (Apple).

I would be keen to avoid adding a fifth chapter format without a clear understanding of the benefits. I don't believe that chapters are in-use by browser implementations.

3. "Improve the "cooked-up" JSON file"

I do see the benefit of offering a word-by-word format (which VTT isn't). Is there any prior art in word-by-word format? Should we be aligning with a standard?

Next steps

Do we split out the three parts of this proposal?

5 replies

nathangathright Mar 10, 2026

FYI, VTT does support word-by-word formatting via an inline timestamp tag which would allow us to retire JSON transcripts as well.

1
00:16.500 --> 00:18.500
<v Dean Martin>When the moon <00:17.500>hits your eye

2
00:18.500 --> 00:20.500
<v Dean Martin>Like a <00:19.000>big-a <00:19.500>pizza <00:20.000>pie

3
00:20.500 --> 00:21.500
<v Dean Martin>That's <00:21.000>amore

samsethi Mar 10, 2026

TrueFans currently supports both HTML/SRT input and JSON output for word highlighting. As James said above, "I'd like to propose that we retire SRT/TXT/HTML format to simplify the specification" We would like to switch to (Web)VTT.

Admin view of transcripts:

User view of transcripts with person label, timestamps and highlighting.

Besides Apple supporting VTT is there another good reason to deprecate the other formats?

jamescridland Mar 10, 2026
Collaborator

Besides Apple supporting VTT is there another good reason to deprecate the other formats?

(Apple supports SRT as well, let's not forget.)

But:
a) WebVTT is supported by browsers.
b) It's a simple transform if you really need an SRT for something, but supporting two/three/four formats makes it complicated for podcast apps and for podcast producers.
c) Simplification is a good thing for all parts of the Podcasting 2.0 spec, and this is low-hanging fruit.

Deprecating support for SRT, TXT and HTML makes sense purely for simplification purposes.

If we're comfortable that inline timing tags in VTT are good, then we can also deprecate that JSON format as well. (I'd be interested to know the takeup of that format, too).

Marzal Mar 11, 2026
Collaborator

We already have:

Want to support only one format? WebVTT is used by Apple Podcasts for ingest, and also natively supported by web browsers. Because the WebVTT format is the most flexible, it's an ideal choice if you can only support one format

So I would do something like this:
#764
And showcase more the recommend format, and leave the old ones as used in the wild.

I still think that the TXT variant has meaning, as a literal transcription to read, instead of subs to accompany the audio. I've seen some hosting and podcast using it to embed or to link on the episode webpage. I do the same.

PD: PR Just as example.

jamescridland Mar 11, 2026
Collaborator

We already have:

Well, I wrote that. But it isn't part of the spec; it's part of the implementation notes.

Anyway, let's move to #764 for this specific proposal.

There are two other parts here:

"Use WebVTT chapters", which I'd like to suggest adds complication, and doesn't remove it, so, please, no

and "the cooked-up JSON file". I don't really see any specific reason why it needs to be changed for now.

dascritch · 2026-03-11T10:50:45Z

dascritch
Mar 11, 2026
Author

"Use WebVTT chapters", which I'd like to suggest adds complication

Which complications ? It's perfectly used in browsers and webview, that fire events. I'm using webvtt for my JS player for years, and it's far more practical to use those native standards instead to recreate a whole event controller that would be fatally buggy.

The BBC on a (retired and no more available) test site used those native chapters events for their archived "Tomorrow's World" tv show even used those events to fire chapter highlights and trigger graphisms.

We're only in a “Not Invented Here” syndrome.

2 replies

jamescridland Mar 11, 2026
Collaborator

We're only in a “Not Invented Here” syndrome.

If you read anything I write, you'll know that I have no patience for "not invented here" people.

WebVTT has chapter support. However, podcasting uses four chapter formats currently - "chapters in descriptions" (Apple, YouTube, Spotify); "chapters in podlove format" (Spotify?); "JSON chapters" (Apple); ID3 tags (Apple).

I would be keen to avoid adding a fifth chapter format without a clear understanding of the benefits, which are unclear to me.

dascritch Mar 11, 2026
Author

Simple : Native browser support, and so, immediate availability on any native app using a webview. And also it's audited, and standardized by a renowned organization : W3C. I haven't tested yet if it's used in accessibility tools, but I won't be surprised they already do it.

I'm also dealing with both WebVTT and ID3 tags, but as I'm using WebVTT as reference in my workflow to create my ID3 tags during mp3/mp4 encoding, I'm not even thinking about it. Here is my python reformater https://github.com/cpuprogramme/cpu-15/blob/master/bin/mp3chaps_from_webvtt.py

nathangathright · 2026-03-12T03:27:56Z

nathangathright
Mar 12, 2026

I may be missing something here, so I want to sanity check my understanding.

I share the interest in supporting web standards, but as I understand it, the browsers only recognize chapter title text, not the richer metadata we already have in ID3 tags and the JSON format. As I understand it, the only way to include things like images, links, hidden toc markers, would be metadata cues, but those don't share the same broad support.

So I think the practical question is:

Are we talking about adopting plain chapter VTTs, knowing they only cover titles/navigation?
Or are we talking about adopting a new metadata-VTT profile for podcast chapters?

If I’m missing existing support for rich metadata VTT chapter payloads, I’d genuinely love to see examples. That may change the tradeoff quite a bit.

0 replies

J-J-Chiarella · 2026-05-19T15:17:49Z

J-J-Chiarella
May 19, 2026

I think that chapters should be in the RSS field itself. The only times to link to external media are the podcast audio itself, the image for channel (and the item, if different), and transcripts (which can be very long).

Allowing for much more than only chapters, potentially being able to shift some just-in-time metadata to an external file instead of overloading the RSS feed.

Maybe I do not understand the likelihood, implications, or length of the potential JIT metadata, but wouldn’t this same argument also hold for everything in the RSS file?

As it is, episode descriptions are usually far longer than 00:00 Intro / 05:31 Morning <image:https://images.com/morning.jpg>/ 12:59 Midday <image:https://images.com/midday.jpg> / 23:01 Evening <image:https://images.com/evening.jpg>. Right?

The link to an external text file that itself writes another link to an image and contains less overall text than a description . . . I do not understand the justification for the technical hurdles. (Just means that I do not get the reasons, not that the reasons do not exist.)

A major caveat, however: Does the current way of doing chapters (external file) help in accessibility in the same way that transcripts do? Could they?

The insistence of YouTube videos and Spotify podcasts for info in descriptions, the Podlove tag style (my preference probably), Apple’s longtime insistence on the tag alone (I like both file-based ID3 tags for the standalone files after download and for the RSS feed that is read “live” by podcast apps), and general confusion over making JSON (not friendly to non-techie people) and metadata and WebVTT . . . it is currently overwhelming me.

I am missing something, and I would appreciate learning what plans or predictions may be. If, as James Cridland wrote above, the project wants to avoid NIH Syndrome, then what was the impetus behind the creation of a JSON-based standard instead of doing what YouTube, Spotify (and others?) did with chapters in the description. Again, I am just unable to see it. Please forgive my inability to see the greater implications. I am also unable to see it “in action.” I readily admit a chicken-egg problem exists where low support in apps means low implementation by podcast creators, which itself results in low support in apps.

4 replies

developerzeke May 19, 2026

The reason for extracting it all out is for bandwidth.

Large hosts are always concerned about bandwidth. Adding 15+ lines to 100K+ feeds that get requested thousands of times everyday is a big deal.

from Buzzsprout's Kevin Finn

J-J-Chiarella May 19, 2026

I thought about that angle, but if the description is allowed to be so long, shouldn’t a cap exist on description as a first-order task? I do not doubt the need to cut down on bandwidth, and advocate a www-less and no slashing trail world, but should we also do something about description and worse yet, the many podcasts that redundantly copy the entire description into content:encoded? Thinking aloud in part.

In any case, thank you for the link to the discussion.

developerzeke May 19, 2026

True, I'd assume the description takes up the majority of the feed bandwidth. But since the description is necessary, the hosts have to deal with it (and most probably enforce their own character limits, e.g. matching Apple's max length). Chapters and other P2.0 features aren't required, so if they use up too much bandwidth the hosts will just not adopt them.

nathangathright May 19, 2026

Episode show notes are considered a table-stakes feature. Hosts were likely making a reasonable trade-off. Eliminating churn due to complaints like, "my feed from foo.fm doesn't show the description in bar.app, I'm going to switch hosts" probably pays for itself rather quickly over begging an app developer to property handle either/both tags. Historically, chapters have not risen to that degree of importance. Most hosts haven't decided to take on the burden of ensuring that every listener in every app can enjoy your chapters by including multiple redundant formats per episode.

J-J-Chiarella · 2026-05-19T15:24:31Z

J-J-Chiarella
May 19, 2026

Apologies for not writing this in the original post . . .

I am still too old-fashioned to read threaded and sub-threaded as well as I do flat forums. A detail I liked, and compliments to the chef, er, thinker:

The proposal to bake chapter information into WebVTT files

It seems that the transcripts are a thousand times better for accessibility, and that such a standard change could encourage people who create transcript files with chapter information to also put in subtitles. (My hope is that people think Why not, eh?) I will allow that this is one area where GenAI saves energy today—as manufacturing time-stamps for captions means your computers run way longer than they otherwise. (People just have to skim/proof the transcripts so that proper nouns are correct.)

6 replies

J-J-Chiarella May 19, 2026

Good points and thank you for the link. I suppose it may now come to a point of choosing between (1) to establish a de facto standard for a “more comprehensive” WebVTT and see if standards bodies (e.g., W3C) ratify it and (2) to stick to the more customizable JSON.

I would only add that the Open Document Foundation did the former with the various ODF (e.g., .odt). Openoffice.org and then LibreOffice would implement extra features in the software program, and then later that implementation may get an update in OASIS and then ISO/IEC. The overview on Wikipedia is not bad: https://en.wikipedia.org/wiki/OpenDocument_standardization.

J-J-Chiarella May 19, 2026

AFAICT, if we want any of those features, we'd need to define our own schema for VTT metadata cue payloads, which is just JSON stuffed into VTT, defeating the point.

It defeats the point as far as shortening the total files and whether to go through the work of changing the standard, yes, but some podcast hosts do not provide a way to upload or host JSON files (although I have seen a quick uptick in transcripts among hosts) I still hold out hope that jamming this into the file on the podcast:transcript tag could encourage more people to implement subtitles for language learners and the hard of hearing. (Yes, I know I am veering into wish-list territory with that list bit.)

nathangathright May 19, 2026

If a host doesn't support any method to provide chapters, then they don't see it as a risk to their business. I struggle to understand how advocating for a slightly different chapter spec in the namespace would change that.

If it's important to you, vote with your wallet.

J-J-Chiarella May 19, 2026

I was saying that it does not necessarily “sweeten the deal” for hosts if the chapter spec is its own file versus in the RSS feed. The pitch seems to fall flat to me. The pitch seems to be “Hey, podcast hosts, hello! We know that it is too much bandwidth to include chapter information in the RSS feed. Therefore, how about you create a new file-hosting service for chapters and integrate some way for podcast creators to create JSON files into your UI? Why not?”

The alternative pitch is this: “Hey, podcast hosts, hello! We know you are facing pressure to have transcripts available and may even, for legal reasons, have to comply. We will not ask anything from you. Podcast creators may have those transcript files (WebVTT) be slightly larger.”

Nobody likes to see his hard work go to waste. Therefore, why not salvage it with expanding the specification of WebVTT. I could be dead wrong, but I think that the mass adoption of transcripts will make more people include actual transcriptions in those files. They may not have the timing etc., but having some text file, some downloadable text file, is better than putting up the transcript in HTML on some webpage.

If I misinterpreted the advantages in the current deployment or anything like that, I am sorry. I do know that I have serious bias toward transcripts both for language learners and the hard of hearing. Please excuse my stubbornness, but I am just trying to better explain myself, not argue against anyone. Nuance gets lost in text.

nathangathright May 21, 2026

I think we may be working from different assumptions here.

If podcast:chapters added support for WebVTT chapters, that would still be a standalone chapters file referenced by podcast:chapters; it would not be the same file as a WebVTT transcript referenced by podcast:transcript. Under the current tag model, hosts would still need to build support for a separate chapters resource either way.

I'm also not sure the namespace's current format needs any help pitching to hosts. If they care about chapters of any kind, we all win. Apple's adoption of Podcasting 2.0 chapters is a massive success for the namespace, but it doesn't come at the cost of support for other formats.

Use WebVTT standard instead of a cooked-up JSON file #756

Uh oh!

Replies: 7 comments · 18 replies

Uh oh!

Uh oh!

Uh oh!

jamescridland Mar 9, 2026 Collaborator

Uh oh!

jamescridland Mar 9, 2026 Collaborator

Next steps

Uh oh!

Uh oh!

Uh oh!

jamescridland Mar 10, 2026 Collaborator

Uh oh!

Marzal Mar 11, 2026 Collaborator

Uh oh!

jamescridland Mar 11, 2026 Collaborator

Uh oh!

Uh oh!

dascritch Mar 11, 2026 Author

Uh oh!

jamescridland Mar 11, 2026 Collaborator

Uh oh!

Uh oh!

dascritch Mar 11, 2026 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 7 comments 18 replies

jamescridland Mar 9, 2026
Collaborator

jamescridland
Mar 9, 2026
Collaborator

jamescridland Mar 10, 2026
Collaborator

Marzal Mar 11, 2026
Collaborator

jamescridland Mar 11, 2026
Collaborator

dascritch
Mar 11, 2026
Author

jamescridland Mar 11, 2026
Collaborator

dascritch Mar 11, 2026
Author