音訊媒體來源擴充功能

Dale Curtis

簡介

媒體來源擴充功能 (MSE) 為 HTML5 <audio> 和 <video> 元素提供延伸的緩衝和播放控制項。雖然最初開發是為了協助透過 HTTP (DASH) 影片播放器進行動態自動調整串流，但下文將介紹這類播放器如何用於音訊；特別是不間斷播放。

您可能聽過音樂專輯，其中的歌曲會在這首曲目中順暢流動，您甚至可能正在聆聽其中一首歌。藝人打造這些無限制的播放體驗，除了可以做為藝術選擇，也可以是黑膠唱片和 CD 的成果，其中音訊是以連續串流的形式寫入。遺憾的是，由於 MP3 和 AAC 等現代音訊轉碼器的運作機制相當順暢，因此今天的流暢音訊體驗經常會遺失。

我們將詳細說明原因，但現在先從示範開始。以下是經過傑出的 Sintel 鏡頭前三十秒，已分割成五個不同的 MP3 檔案，然後使用 MSE 重新組合。紅線代表每個 MP3 在建立 (編碼) 期間出現的間隙；在這些時間點，您也會聽到故障。

示範模式

真了不起！無法提供令人滿意的體驗；我們可以做得更好。只要多花點功夫，就能使用與上述示範相同的 MP3 檔案，我們可以利用 MSE 消除這些令人困擾的缺口。下一個示範中的綠線代表檔案已彙整的位置，以及缺漏的部分。在 Chrome 38 以上版本中，這可以順暢播放！

示範模式

製作不間斷內容的方法有很多種。為便於本次示範，我們將重點放在一般使用者可能身在哪些類型的檔案上。每個檔案都已分別編碼，且不考慮其前後的音訊片段。

基本設定

首先，讓我們反向說明 MediaSource 執行個體的基本設定。顧名思義，媒體來源擴充功能只是現有媒體元素的擴充功能。在下方，我們會指派 Object URL (代表 MediaSource 例項) 給音訊元素的來源屬性，就像設定標準網址一樣。

var audio = document.createElement('audio');
var mediaSource = new MediaSource();
var SEGMENTS = 5;

mediaSource.addEventListener('sourceopen', function () {
  var sourceBuffer = mediaSource.addSourceBuffer('audio/mpeg');

  function onAudioLoaded(data, index) {
    // Append the ArrayBuffer data into our new SourceBuffer.
    sourceBuffer.appendBuffer(data);
  }

  // Retrieve an audio segment via XHR.  For simplicity, we're retrieving the
  // entire segment at once, but we could also retrieve it in chunks and append
  // each chunk separately.  MSE will take care of assembling the pieces.
  GET('sintel/sintel_0.mp3', function (data) {
    onAudioLoaded(data, 0);
  });
});

audio.src = URL.createObjectURL(mediaSource);

連結 MediaSource 物件後，系統會執行部分初始化，最終會觸發 sourceopen 事件；之後我們可以建立 SourceBuffer。在上述範例中，我們要建立 audio/mpeg 這個路徑能夠剖析及解碼 MP3 片段；有幾種其他類型可供使用。

異常波形

我們稍後就會移除程式碼，但現在要進一步瞭解我們剛才附加的檔案，特別是在這個檔案結尾。下圖是最近 3000 個樣本，顯示 sintel_0.mp3 歷程中兩個管道的平均值。紅線中的每個像素都是 [-1.0, 1.0] 範圍內的浮點範例。

mp3 差距

這些零 (靜音) 樣本究竟是什麼意思？這是因為編碼作業是在編碼期間導入的壓縮成果所致。幾乎所有編碼器都會加入某種邊框間距。在這個案例中，LAME 會在檔案結尾加入剛好 576 個邊框間距範例。

除了結尾的邊框間距外，每個檔案的開頭也加上了邊框間距。如果我們先查看 sintel_1.mp3 軌道，就會看見另一個 576 個邊框間距範例。邊框間距量會因編碼器和內容而異，但我們會根據每個檔案所含的 metadata 知道確切的值。

mp3 差距結束

每個檔案開頭和結尾的靜音部分，是導致上一個示範片段之間「故障」的原因。為實現無間斷的播放效果，我們必須移除這些無聲部分。幸好，只要使用 MediaSource 即可輕鬆完成。下方，我們將修改 onAudioLoaded() 方法，使用附加期間和時間戳記偏移移除這個靜音設定。

程式碼範例

function onAudioLoaded(data, index) {
  // Parsing gapless metadata is unfortunately non trivial and a bit messy, so
  // we'll glaze over it here; see the appendix for details.
  // ParseGaplessData() will return a dictionary with two elements:
  //
  //    audioDuration: Duration in seconds of all non-padding audio.
  //    frontPaddingDuration: Duration in seconds of the front padding.
  //
  var gaplessMetadata = ParseGaplessData(data);

  // Each appended segment must be appended relative to the next.  To avoid any
  // overlaps, we'll use the end timestamp of the last append as the starting
  // point for our next append or zero if we haven't appended anything yet.
  var appendTime = index > 0 ? sourceBuffer.buffered.end(0) : 0;

  // Simply put, an append window allows you to trim off audio (or video) frames
  // which fall outside of a specified time range.  Here, we'll use the end of
  // our last append as the start of our append window and the end of the real
  // audio data for this segment as the end of our append window.
  sourceBuffer.appendWindowStart = appendTime;
  sourceBuffer.appendWindowEnd = appendTime + gaplessMetadata.audioDuration;

  // The timestampOffset field essentially tells MediaSource where in the media
  // timeline the data given to appendBuffer() should be placed.  I.e., if the
  // timestampOffset is 1 second, the appended data will start 1 second into
  // playback.
  //
  // MediaSource requires that the media timeline starts from time zero, so we
  // need to ensure that the data left after filtering by the append window
  // starts at time zero.  We'll do this by shifting all of the padding we want
  // to discard before our append time (and thus, before our append window).
  sourceBuffer.timestampOffset =
    appendTime - gaplessMetadata.frontPaddingDuration;

  // When appendBuffer() completes, it will fire an updateend event signaling
  // that it's okay to append another segment of media.  Here, we'll chain the
  // append for the next segment to the completion of our current append.
  if (index == 0) {
    sourceBuffer.addEventListener('updateend', function () {
      if (++index < SEGMENTS) {
        GET('sintel/sintel_' + index + '.mp3', function (data) {
          onAudioLoaded(data, index);
        });
      } else {
        // We've loaded all available segments, so tell MediaSource there are no
        // more buffers which will be appended.
        mediaSource.endOfStream();
        URL.revokeObjectURL(audio.src);
      }
    });
  }

  // appendBuffer() will now use the timestamp offset and append window settings
  // to filter and timestamp the data we're appending.
  //
  // Note: While this demo uses very little memory, more complex use cases need
  // to be careful about memory usage or garbage collection may remove ranges of
  // media in unexpected places.
  sourceBuffer.appendBuffer(data);
}

滑稽的波形

讓我們在套用附加視窗後，再查看波形，看看這個閃亮的新程式碼有何成果。在下方，您可以看到 sintel_0.mp3 末端的靜音部分 (紅色) 和 sintel_1.mp3 開頭的靜音區段 (以藍色顯示) 已移除。我們能在多個片段之間順暢轉換。

mp3 中等

結語

透過這個方式，我們已將全部五個片段順利拼接為一個影片片段，並隨後達到示範的最後效果。在開始之前，您可能會發現 onAudioLoaded() 方法未考量容器或轉碼器。換言之，無論容器或轉碼器類型為何，所有這些技術都能正常運作。您可以在下方重播原始示範 DASH 的分段 MP4，而非 MP3。

示範模式

如要進一步瞭解有關內容建立與中繼資料剖析的細節，請參閱以下附錄。您也可以探索 gapless.js，進一步瞭解支援此示範的程式碼。

感謝您閱讀本信！

附錄 A：製作無縫接軌內容

創作不相干的內容可能並不容易。以下逐步說明如何建立此示範中使用的 Sintel 媒體。首先，您需要 Sintel 的無損 FLAC 原聲配樂副本；為了方便海報，SHA1 也包含在內。使用工具時，您需要 FFmpeg、MP4Box、LAME 以及透過 afconvert 安裝 OSX。

    unzip Jan_Morgenstern-Sintel-FLAC.zip
    sha1sum 1-Snow_Fight.flac
    # 0535ca207ccba70d538f7324916a3f1a3d550194  1-Snow_Fight.flac

首先，我們將 1-Snow_Fight.flac 音軌分割為前 31.5 秒。此外，我們也想在 28 秒開始時加入 2.5 秒的淡出效果，避免在播放結束時發生任何點擊。您可以使用下方的 FFmpeg 指令列完成所有動作，並將結果放入 sintel.flac。

    ffmpeg -i 1-Snow_Fight.flac -t 31.5 -af "afade=t=out:st=28:d=2.5" sintel.flac

接下來，我們會將檔案分割成 5 個Wave 檔案，每個 6.5 秒。這是最簡單的使用 Wave，因為幾乎所有編碼器都支援擷取檔案。再次強調，我們可以透過 FFmpeg 精確執行這項作業，之後會有 sintel_0.wav、sintel_1.wav、sintel_2.wav、sintel_3.wav 和 sintel_4.wav。

    ffmpeg -i sintel.flac -acodec pcm_f32le -map 0 -f segment \
           -segment_list out.list -segment_time 6.5 sintel_%d.wav

接下來，我們要建立 MP3 檔案。LAME 提供多種建立無限制內容的選項。如果您能夠控管內容，請考慮將 --nogap 與所有檔案的批次編碼搭配使用，以避免片段之間的完全邊框間距。但為了便於示範，我們需要進行邊框間距，因此為使用 Wave 檔案的標準高品質 VBR 編碼。

    lame -V=2 sintel_0.wav sintel_0.mp3
    lame -V=2 sintel_1.wav sintel_1.mp3
    lame -V=2 sintel_2.wav sintel_2.mp3
    lame -V=2 sintel_3.wav sintel_3.mp3
    lame -V=2 sintel_4.wav sintel_4.mp3

這就是建立 MP3 檔案所需的所有步驟。現在，我們來說明建立片段 MP4 檔案的建立方式我們將按照 Apple 的指示，建立 iTunes 主要版本的媒體。在下方，我們會依照操作說明，將波浪檔案轉換為中繼 CAF 檔案，然後再使用建議的參數，在 MP4 容器中將這些檔案編碼為 AAC。

    afconvert sintel_0.wav sintel_0_intermediate.caf -d 0 -f caff \
              --soundcheck-generate
    afconvert sintel_1.wav sintel_1_intermediate.caf -d 0 -f caff \
              --soundcheck-generate
    afconvert sintel_2.wav sintel_2_intermediate.caf -d 0 -f caff \
              --soundcheck-generate
    afconvert sintel_3.wav sintel_3_intermediate.caf -d 0 -f caff \
              --soundcheck-generate
    afconvert sintel_4.wav sintel_4_intermediate.caf -d 0 -f caff \
              --soundcheck-generate
    afconvert sintel_0_intermediate.caf -d aac -f m4af -u pgcm 2 --soundcheck-read \
              -b 256000 -q 127 -s 2 sintel_0.m4a
    afconvert sintel_1_intermediate.caf -d aac -f m4af -u pgcm 2 --soundcheck-read \
              -b 256000 -q 127 -s 2 sintel_1.m4a
    afconvert sintel_2_intermediate.caf -d aac -f m4af -u pgcm 2 --soundcheck-read \
              -b 256000 -q 127 -s 2 sintel_2.m4a
    afconvert sintel_3_intermediate.caf -d aac -f m4af -u pgcm 2 --soundcheck-read \
              -b 256000 -q 127 -s 2 sintel_3.m4a
    afconvert sintel_4_intermediate.caf -d aac -f m4af -u pgcm 2 --soundcheck-read \
              -b 256000 -q 127 -s 2 sintel_4.m4a

現在有多個 M4A 檔案需要適當片段，才能與 MediaSource 搭配使用。我們會使用一秒的片段大小MP4Box 會將每個片段的 MP4 以及可捨棄的 MPEG-DASH 資訊清單 (sintel_#_dash.mpd) 寫出為 sintel_#_dashinit.mp4。

    MP4Box -dash 1000 sintel_0.m4a && mv sintel_0_dashinit.mp4 sintel_0.mp4
    MP4Box -dash 1000 sintel_1.m4a && mv sintel_1_dashinit.mp4 sintel_1.mp4
    MP4Box -dash 1000 sintel_2.m4a && mv sintel_2_dashinit.mp4 sintel_2.mp4
    MP4Box -dash 1000 sintel_3.m4a && mv sintel_3_dashinit.mp4 sintel_3.mp4
    MP4Box -dash 1000 sintel_4.m4a && mv sintel_4_dashinit.mp4 sintel_4.mp4
    rm sintel_{0,1,2,3,4}_dash.mpd

大功告成！我們現在已為 MP4 和 MP3 檔案建立片段，並提供為無間斷播放所需要的正確中繼資料。如要進一步瞭解中繼資料的外觀，請參閱附錄 B。

附錄 B：剖析無邊框中繼資料

就像建立無限制的內容一樣，剖析無資料的中繼資料可能並不容易，因為儲存空間沒有標準方法。以下將介紹兩種最常見的編碼器 (LAME 和 iTunes) 如何儲存無限制的中繼資料。請先設定一些輔助方法，以及上方所用 ParseGaplessData() 的大綱。

    // Since most MP3 encoders store the gapless metadata in binary, we'll need a
    // method for turning bytes into integers.  Note: This doesn't work for values
    // larger than 2^30 since we'll overflow the signed integer type when shifting.
    function ReadInt(buffer) {
      var result = buffer.charCodeAt(0);
      for (var i = 1; i < buffer.length; ++i) {
        result <<= 8;
        result += buffer.charCodeAt(i);
      }
      return result;
    }

    function ParseGaplessData(arrayBuffer) {
      // Gapless data is generally within the first 512 bytes, so limit parsing.
      var byteStr = new TextDecoder().decode(arrayBuffer.slice(0, 512));

      var frontPadding = 0, endPadding = 0, realSamples = 0;

      // ... we'll fill this in as we go below.

我們將先介紹 Apple 的 iTunes 中繼資料格式，因為這是最容易剖析和說明的格式。在 MP3 和 M4A 檔案 iTunes (與 afconvert) 中，按照以下方式編寫在 ASCII 中的簡短區段：

    iTunSMPB[ 26 bytes ]0000000 00000840 000001C0 0000000000046E00

這項資訊會寫入 MP3 容器中的 ID3 標記內，以及 MP4 容器內的中繼資料不可部分。為達成此目的，我們可以忽略第一個 0000000 權杖。接下來三個符記為前端邊框間距、結尾邊框間距，以及非邊框間距樣本總數。將每個音訊除以音訊的取樣率，即可得到每個音訊的時間長度。

// iTunes encodes the gapless data as hex strings like so:
//
//    'iTunSMPB[ 26 bytes ]0000000 00000840 000001C0 0000000000046E00'
//    'iTunSMPB[ 26 bytes ]####### frontpad  endpad    real samples'
//
// The approach here elides the complexity of actually parsing MP4 atoms. It
// may not work for all files without some tweaks.
var iTunesDataIndex = byteStr.indexOf('iTunSMPB');
if (iTunesDataIndex != -1) {
  var frontPaddingIndex = iTunesDataIndex + 34;
  frontPadding = parseInt(byteStr.substr(frontPaddingIndex, 8), 16);

  var endPaddingIndex = frontPaddingIndex + 9;
  endPadding = parseInt(byteStr.substr(endPaddingIndex, 8), 16);

  var sampleCountIndex = endPaddingIndex + 9;
  realSamples = parseInt(byteStr.substr(sampleCountIndex, 16), 16);
}

另一方面，大部分的開放原始碼 MP3 編碼器都會將無限制的中繼資料儲存在無聲 MPEG 框架中的特殊 Xing 標頭中 (沒有靜音，因此不瞭解 Xing 標頭的解碼器只會播放靜音)。可惜的是，這個標記不會一直存在，且包含多個選填欄位。以本示範來說，我們擁有媒體控管權，但實際上，在執行無限制中繼資料時，需要進行一些額外的保密檢查。

首先，我們會剖析樣本總數。為簡單起見，我們會從 Xing 標頭讀取這一點，但可以透過一般的 MPEG 音訊標頭建構。 Xing 標頭可用 Xing 或 Info 標記標示。這個標記在 32 位元之後，剛好有 4 位元組，代表檔案中的影格總數；將這個值乘以每個影格的樣本數，即可得到檔案中的樣本總數。

    // Xing padding is encoded as 24bits within the header.  Note: This code will
    // only work for Layer3 Version 1 and Layer2 MP3 files with XING frame counts
    // and gapless information.  See the following document for more details:
    // http://www.codeproject.com/Articles/8295/MPEG-Audio-Frame-Header
    var xingDataIndex = byteStr.indexOf('Xing');
    if (xingDataIndex == -1) xingDataIndex = byteStr.indexOf('Info');
    if (xingDataIndex != -1) {
      // See section 2.3.1 in the link above for the specifics on parsing the Xing
      // frame count.
      var frameCountIndex = xingDataIndex + 8;
      var frameCount = ReadInt(byteStr.substr(frameCountIndex, 4));

      // For Layer3 Version 1 and Layer2 there are 1152 samples per frame.  See
      // section 2.1.5 in the link above for more details.
      var paddedSamples = frameCount * 1152;

      // ... we'll cover this below.

現在，我們已經有了樣本總數，可以繼續看看邊框間距樣本的數量。視編碼器而定，此標記可能會以巢狀結構寫入 Xing 標頭的 LAME 或 Lavf 標記下。這個標頭後方僅有 17 個位元組，也就是 3 個位元組，分別以 12 位元各代表前端和結尾邊框間距。

        xingDataIndex = byteStr.indexOf('LAME');
        if (xingDataIndex == -1) xingDataIndex = byteStr.indexOf('Lavf');
        if (xingDataIndex != -1) {
          // See http://gabriel.mp3-tech.org/mp3infotag.html#delays for details of
          // how this information is encoded and parsed.
          var gaplessDataIndex = xingDataIndex + 21;
          var gaplessBits = ReadInt(byteStr.substr(gaplessDataIndex, 3));

          // Upper 12 bits are the front padding, lower are the end padding.
          frontPadding = gaplessBits >> 12;
          endPadding = gaplessBits & 0xFFF;
        }

        realSamples = paddedSamples - (frontPadding + endPadding);
      }

      return {
        audioDuration: realSamples * SECONDS_PER_SAMPLE,
        frontPaddingDuration: frontPadding * SECONDS_PER_SAMPLE
      };
    }

藉由這個做法，我們有能剖析絕大多數無間斷內容的完整函式。不過在邊緣情況下，該情況也設有限制，因此在實際工作環境中使用類似的程式碼前，建議您謹慎進行。

附錄 C：垃圾收集

系統會根據內容類型、平台專屬限制和目前播放位置，主動收集屬於 SourceBuffer 執行個體的記憶體。在 Chrome 中，系統會先透過已播放的緩衝區回收記憶體。但如果記憶體用量超過平台專用的限制，就會從未播放的緩衝區中移除記憶體。

如果因為收回記憶體，播放作業達到時間軸上的間隔時，如果差距夠小，或間隔過大時就會完全停滯。這兩者都不是良好的使用者體驗，因此請務必避免一次附加過多資料，並手動從媒體時間軸中移除不再需要的範圍。

您可以透過每個 SourceBuffer 的 remove() 方法移除範圍，該方法以秒為單位。[start, end]與 appendBuffer() 類似，每個 remove() 都會在完成時觸發 updateend 事件。在事件啟動之前，不應發出其他移除或附加內容。

在電腦版 Chrome 中，您可以一次在記憶體中保存約 12 MB 的音訊內容和 150 MB 的影片內容。建議您避免跨瀏覽器或平台使用這些值，例如這些值大多無法代表行動裝置。

垃圾收集只會影響新增至 SourceBuffers 的資料；JavaScript 變數可保留的資料數量沒有限制。如有必要，您也可以在相同位置重新附加相同的資料。

音訊媒體來源擴充功能 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

簡介

基本設定

異常波形

程式碼範例

滑稽的波形

結語

附錄 A：製作無縫接軌內容

附錄 B：剖析無邊框中繼資料

附錄 C：垃圾收集

意見回饋

音訊媒體來源擴充功能