Media accessibility

Derek Herman

Joe Medley

Published: August 20, 2020

Captions and screen reader descriptions are the only way many users can experience your videos, and in some jurisdictions, they're even required by law or regulation. The WebVTT (Web Video Text Tracks) format is used to describe timed text data, such as closed captions or subtitles, to make your videos more accessible.

Add `<track>` tags

To add captions or screen reader descriptions to a web video, add a <track> tag within a <video> tag. In addition to captions and screen reader descriptions, <track> tags may also be used for subtitles and chapter titles.

Screenshot showing captions displayed using the track element in Chrome on Android.

The <track> tag is similar to the <source> element in that both have a src attribute that points to referenced content. For a <track> tag, it points to a WebVTT file. The label attribute specifies how a particular track can be identified in the interface.

To provide tracks for multiple languages add a separate <track> tag for each WebVTT file you're providing and indicate the language using the srclang attribute.

Take a look at this example <video> tag with two <track> tags. Add a <track> element as a child of the <video> element.

<video controls>
  <source src="https://storage.googleapis.com/webfundamentals-assets/videos/chrome.webm" type="video/webm" />
  <source src="https://storage.googleapis.com/webfundamentals-assets/videos/chrome.mp4" type="video/mp4" />
  <track src="chrome-subtitles-en.vtt" label="English captions" kind="captions" srclang="en" default>
  <track src="chrome-subtitles-zh.vtt" label="中文字幕" kind="captions" srclang="zh">
  <p>This browser does not support the video element.</p>
</video>

There's also a sample you can view on Glitch.

WebVTT file structure

Here's a hypothetical WebVTT file for the demo. This is a text file containing a series of cues. Each cue is a block of text to display on screen, and the time range during which it's displayed.

WEBVTT

00:00.000 --> 00:04.999
Man sitting on a tree branch, using a laptop.

00:05.000 --> 00:08.000
The branch breaks, and he starts to fall.

...

Each item within the track file is a cue. Each cue has a start time and end time, separated by an arrow, followed by cue text. Cues can also have IDs, such as railroad and manuscript. Cues are separated by an empty line.

WEBVTT

railroad
00:00:10.000 --> 00:00:12.500
Left uninspired by the crust of railroad earth

manuscript
00:00:13.200 --> 00:00:16.900
that touched the lead to the pages of your manuscript.

Cue times are in hours:minutes:seconds.milliseconds format. Parsing is strict. Meaning, numbers must be zero padded if necessary: hours, minutes, and seconds must have two digits (00 for a zero value) and milliseconds must have three digits (000 for a zero value). There is an excellent WebVTT validator at Live WebVTT Validator, which checks for errors in time formatting, and problems such as non-sequential times.

You can create a VTT file by hand, thought there are many services that create them for you.

As you can see in our previous examples, the WebVTT format is pretty simple. Just add your text data along with timing.

However, what if you want your captions to render in a different position with left or right alignment? Perhaps to align the captions with the current speaker position, or to stay out of the way of in-camera text. WebVTT defines settings to do that, and more, directly inside the .vtt file. Take note of how the caption placement is defined by adding settings after the time interval definitions.

WEBVTT

00:00:05.000 --> 00:00:10.000 line:0 position:20% size:60% align:start
The first line of the subtitles.

Another handy feature is the ability to style cues using CSS. Perhaps you want to use a gray linear gradient as the background, with a foreground color of papayawhip for all captions and all bold text colored peachpuff.

video::cue {
  background-image: linear-gradient(to bottom, dimgray, lightgray);
  color: papayawhip;
}

video::cue(b) {
  color: peachpuff;
}

If you're interested in learning more about styling and tagging of individual cues, the WebVTT specification is a good source for advanced examples.

Kinds of text tracks

Did you notice the kind attribute of the <track> element? It's used to indicate what relation the particular text track has to the video. The possible values of the kind attribute are:

captions: For closed captions from transcripts and possibly translations of any audio. Suitable for hearing impaired and in cases when the video is playing muted.
subtitles: For subtitles, that is, translations of speech and text in a language different from the main language of the video.
descriptions: For descriptions of visual parts of the video content. Suitable for visually impaired people.
chapters: Intended to be displayed when the user is navigating within the video.
metadata: Not visible, and may be used by scripts.

Now that you understand the basics of making a video available and accessible on your web page, you might wonder about more complex use cases. Learn about Media frameworks and how they can help you add videos to your web page, while providing advanced features.