이 페이지는 Cloud Translation API를 통해 번역되었습니다.

자바스크립트에서 base64 인코딩 문자열의 미묘한 차이
bookmark_border 컬렉션을 사용해 정리하기 내 환경설정을 기준으로 콘텐츠를 저장하고 분류하세요.

Matt Joseph

base64 인코딩 및 디코딩은 바이너리 콘텐츠를 웹에 적합한 텍스트로 변환하는 일반적인 형태입니다. 인라인 이미지와 같은 데이터 URL에 일반적으로 사용됩니다.

JavaScript에서 문자열에 base64 인코딩 및 디코딩을 적용하면 어떻게 되나요? 이 게시물에서는 주의해야 할 뉘앙스와 일반적인 함정을 살펴봅니다.

btoa() 및 atob()

JavaScript에서 Base64 인코딩 및 디코딩하는 핵심 함수는 btoa() 및 atob()입니다. btoa()는 문자열에서 base64로 인코딩된 문자열로 이동하고 atob()는 다시 디코딩합니다.

다음은 간단한 예입니다.

// A really plain string that is just code points below 128.
const asciiString = 'hello';

// This will work. It will print:
// Encoded string: [aGVsbG8=]
const asciiStringEncoded = btoa(asciiString);
console.log(`Encoded string: [${asciiStringEncoded}]`);

// This will work. It will print:
// Decoded string: [hello]
const asciiStringDecoded = atob(asciiStringEncoded);
console.log(`Decoded string: [${asciiStringDecoded}]`);

안타깝게도 MDN 문서에 언급된 대로 이 방법은 ASCII 문자 또는 단일 바이트로 표현할 수 있는 문자가 포함된 문자열에서만 작동합니다. 즉, 유니코드에서는 작동하지 않습니다.

어떤 일이 일어나는지 확인하려면 다음 코드를 사용해 보세요.

// Sample string that represents a combination of small, medium, and large code points.
// This sample string is valid UTF-16.
// 'hello' has code points that are each below 128.
// '⛳' is a single 16-bit code units.
// '❤️' is a two 16-bit code units, U+2764 and U+FE0F (a heart and a variant).
// '🧀' is a 32-bit code point (U+1F9C0), which can also be represented as the surrogate pair of two 16-bit code units '\ud83e\uddc0'.
const validUTF16String = 'hello⛳❤️🧀';

// This will not work. It will print:
// DOMException: Failed to execute 'btoa' on 'Window': The string to be encoded contains characters outside of the Latin1 range.
try {
  const validUTF16StringEncoded = btoa(validUTF16String);
  console.log(`Encoded string: [${validUTF16StringEncoded}]`);
} catch (error) {
  console.log(error);
}

문자열에 그림 이모티콘이 있으면 오류가 발생합니다. 유니코드가 이 문제를 일으키는 이유는 무엇인가요?

이해를 돕기 위해 한 걸음 물러서서 컴퓨터 공학과 JavaScript의 문자열을 이해해 보겠습니다.

유니코드 및 JavaScript의 문자열

유니코드는 문자 인코딩의 현재 글로벌 표준 또는 컴퓨터 시스템에서 사용할 수 있도록 특정 문자에 숫자를 할당하는 관행입니다. 유니코드에 대한 자세한 내용은 이 W3C 도움말을 참조하세요.

다음은 유니코드 문자 및 관련 숫자의 몇 가지 예입니다.

h~104
ñ - 241
❤ - 2764
❤️ - 2764, 숨겨진 수정자가 65039임
YTP - 9971
🧀 - 129472

각 문자를 나타내는 숫자를 '코드 포인트'라고 합니다. '코드 포인트'를 각 문자의 주소라고 생각할 수 있습니다. 빨간색 하트 이모티콘에는 실제로 두 가지 코드 포인트가 있습니다. 하나는 하트이고 하나는 색상을 '변경'하여 항상 빨간색으로 만드는 것입니다.

유니코드에는 이러한 코드 포인트를 사용하여 컴퓨터가 일관되게 해석할 수 있는 바이트 시퀀스로 만드는 두 가지 일반적인 방법, 즉 UTF-8과 UTF-16이 있습니다.

단순화된 견해는 다음과 같습니다.

UTF-8에서 코드 포인트는 1~4바이트 (바이트당 8비트)를 사용할 수 있습니다.
UTF-16에서 코드 포인트는 항상 2바이트 (16비트)입니다.

중요한 점은 JavaScript가 문자열을 UTF-16으로 처리한다는 것입니다. 이렇게 하면 문자열의 각 문자가 단일 바이트에 매핑된다고 가정하고 효과적으로 작동하는 btoa()와 같은 함수가 중단됩니다. 이는 MDN에 명시적으로 명시되어 있습니다.

btoa() 메서드는 바이너리 문자열(즉, 문자열의 각 문자가 바이너리 데이터의 바이트로 취급되는 문자열)에서 Base64로 인코딩된 ASCII 문자열을 만듭니다.

JavaScript의 문자에는 1바이트 이상이 필요한 경우가 많다는 사실을 알았으니 다음 섹션에서는 base64 인코딩 및 디코딩에서 이 경우를 처리하는 방법을 설명합니다.

유니코드를 사용하는 btoa() 및 atob()

이제 알겠지만 발생한 오류는 UTF-16에서 단일 바이트를 벗어난 문자가 문자열에 포함되어 있기 때문입니다.

다행히 base64에 대한 MDN 문서에는 이러한 '유니코드 문제'를 해결하는 데 유용한 샘플 코드가 포함되어 있습니다. 이 코드를 수정하여 위의 예와 호환되도록 할 수 있습니다.

// From https://developer.mozilla.org/en-US/docs/Glossary/Base64#the_unicode_problem.
function base64ToBytes(base64) {
  const binString = atob(base64);
  return Uint8Array.from(binString, (m) => m.codePointAt(0));
}

// From https://developer.mozilla.org/en-US/docs/Glossary/Base64#the_unicode_problem.
function bytesToBase64(bytes) {
  const binString = String.fromCodePoint(...bytes);
  return btoa(binString);
}

// Sample string that represents a combination of small, medium, and large code points.
// This sample string is valid UTF-16.
// 'hello' has code points that are each below 128.
// '⛳' is a single 16-bit code units.
// '❤️' is a two 16-bit code units, U+2764 and U+FE0F (a heart and a variant).
// '🧀' is a 32-bit code point (U+1F9C0), which can also be represented as the surrogate pair of two 16-bit code units '\ud83e\uddc0'.
const validUTF16String = 'hello⛳❤️🧀';

// This will work. It will print:
// Encoded string: [aGVsbG/im7PinaTvuI/wn6eA]
const validUTF16StringEncoded = bytesToBase64(new TextEncoder().encode(validUTF16String));
console.log(`Encoded string: [${validUTF16StringEncoded}]`);

// This will work. It will print:
// Decoded string: [hello⛳❤️🧀]
const validUTF16StringDecoded = new TextDecoder().decode(base64ToBytes(validUTF16StringEncoded));
console.log(`Decoded string: [${validUTF16StringDecoded}]`);

다음 단계에서는 이 코드에서 문자열을 인코딩하는 작업을 설명합니다.

TextEncoder 인터페이스를 사용하여 UTF-16 인코딩된 JavaScript 문자열을 가져와 TextEncoder.encode()를 사용하여 UTF-8 인코딩된 바이트 스트림으로 변환합니다.
이렇게 하면 JavaScript에서 덜 일반적으로 사용되는 데이터 유형이며 TypedArray의 서브클래스인 Uint8Array이 반환됩니다.
이 Uint8Array를 가져와 bytesToBase64() 함수에 제공합니다. 이 함수는 String.fromCodePoint()를 사용하여 Uint8Array의 각 바이트를 코드 포인트로 처리하고 그로부터 문자열을 만듭니다. 그러면 모두 단일 바이트로 표현될 수 있는 코드 포인트 문자열이 생성됩니다.
이 문자열을 가져와 btoa()를 사용하여 Base64로 인코딩합니다.

디코딩 프로세스는 동일하지만 그 반대입니다.

이는 Uint8Array와 문자열 사이의 단계에서 JavaScript의 문자열이 UTF-16, 2바이트 인코딩으로 표시되는 동안 각 2바이트가 나타내는 코드 포인트가 항상 128보다 작다는 것을 보장하기 때문에 작동합니다.

이 코드는 대부분의 상황에서 잘 작동하지만 다른 상황에서는 자동으로 실패합니다.

자동 실패 케이스

동일한 코드를 사용하지만 다른 문자열을 사용합니다.

// From https://developer.mozilla.org/en-US/docs/Glossary/Base64#the_unicode_problem.
function base64ToBytes(base64) {
  const binString = atob(base64);
  return Uint8Array.from(binString, (m) => m.codePointAt(0));
}

// From https://developer.mozilla.org/en-US/docs/Glossary/Base64#the_unicode_problem.
function bytesToBase64(bytes) {
  const binString = String.fromCodePoint(...bytes);
  return btoa(binString);
}

// Sample string that represents a combination of small, medium, and large code points.
// This sample string is invalid UTF-16.
// 'hello' has code points that are each below 128.
// '⛳' is a single 16-bit code units.
// '❤️' is a two 16-bit code units, U+2764 and U+FE0F (a heart and a variant).
// '🧀' is a 32-bit code point (U+1F9C0), which can also be represented as the surrogate pair of two 16-bit code units '\ud83e\uddc0'.
// '\uDE75' is code unit that is one half of a surrogate pair.
const partiallyInvalidUTF16String = 'hello⛳❤️🧀\uDE75';

// This will work. It will print:
// Encoded string: [aGVsbG/im7PinaTvuI/wn6eA77+9]
const partiallyInvalidUTF16StringEncoded = bytesToBase64(new TextEncoder().encode(partiallyInvalidUTF16String));
console.log(`Encoded string: [${partiallyInvalidUTF16StringEncoded}]`);

// This will work. It will print:
// Decoded string: [hello⛳❤️🧀�]
const partiallyInvalidUTF16StringDecoded = new TextDecoder().decode(base64ToBytes(partiallyInvalidUTF16StringEncoded));
console.log(`Decoded string: [${partiallyInvalidUTF16StringDecoded}]`);

디코딩 후 마지막 문자 (�)를 가져와 16진수 값을 확인하면 원래 \uDE75이 아닌 \uFFFD임을 알 수 있습니다. 실패하거나 오류를 발생시키지 않지만 입력 및 출력 데이터가 자동으로 변경되었습니다. 왜냐하면

JavaScript API에 따라 달라지는 문자열

앞서 설명한 것처럼 JavaScript는 문자열을 UTF-16으로 처리합니다. 하지만 UTF-16 문자열에는 고유한 속성이 있습니다.

치즈 이모티콘을 예로 들어 보겠습니다. 이모티콘 (🧀)의 유니코드 코드 포인트는 129472입니다. 안타깝게도 16비트 숫자의 최댓값은 65535입니다. 그렇다면 UTF-16은 이보다 훨씬 큰 숫자를 어떻게 나타낼까요?

UTF-16에는 대체 쌍이라는 개념이 있습니다. 다음과 같이 생각할 수 있습니다.

쌍의 첫 번째 숫자는 검색할 '책'을 지정합니다. 이를 '서로게이트'라고 합니다.
쌍의 두 번째 숫자는 '책'의 항목입니다.

짐작할 수 있듯이 도서를 나타내는 숫자만 있고 도서의 실제 항목은 없는 경우가 있습니다. UTF-16에서는 이를 론 서로게이트라고 합니다.

이는 JavaScript에서 특히 어렵습니다. 일부 API는 단일 대리자가 있음에도 작동하지만 다른 API는 작동하지 않기 때문입니다.

여기서는 base64에서 다시 디코딩할 때 TextDecoder를 사용합니다. 특히 TextDecoder의 기본값은 다음을 지정합니다.

기본값은 false입니다. 즉, 디코더가 형식이 잘못된 데이터를 대체 문자로 대체합니다.

앞에서 확인한 � 문자(16진수로 \uFFFD로 표시됨)가 대체 문자입니다. UTF-16에서 단일 대체 문자가 있는 문자열은 '잘못된 형식' 또는 '잘못된 형식'으로 간주됩니다.

잘못된 형식의 문자열이 API 동작에 영향을 미치는 시점을 정확히 지정하는 다양한 웹 표준 (예 1, 2, 3, 4)이 있으며, 특히 TextDecoder는 이러한 API 중 하나입니다. 텍스트 처리를 수행하기 전에 문자열의 형식이 올바른지 확인하는 것이 좋습니다.

올바른 형식의 문자열 확인

최신 브라우저에는 이제 이를 위한 함수(isWellFormed())가 있습니다.

브라우저 지원

소스

문자열에 단일 대리자가 포함된 경우 URIError 오류를 발생시키는 encodeURIComponent()를 사용하여 유사한 결과를 얻을 수 있습니다.

다음 함수는 사용 가능한 경우 isWellFormed()를 사용하고 사용 불가능한 경우 encodeURIComponent()를 사용합니다. 유사한 코드를 사용하여 isWellFormed()의 polyfill을 생성할 수 있습니다.

// Quick polyfill since older browsers do not support isWellFormed().
// encodeURIComponent() throws an error for lone surrogates, which is essentially the same.
function isWellFormed(str) {
  if (typeof(str.isWellFormed)!="undefined") {
    // Use the newer isWellFormed() feature.
    return str.isWellFormed();
  } else {
    // Use the older encodeURIComponent().
    try {
      encodeURIComponent(str);
      return true;
    } catch (error) {
      return false;
    }
  }
}

종합해보기

이제 유니코드와 홀로 서드 파티를 모두 처리하는 방법을 알았으므로 모든 사례를 처리하고 자동 텍스트 대체 없이 처리하는 코드를 만들 수 있습니다.

// From https://developer.mozilla.org/en-US/docs/Glossary/Base64#the_unicode_problem.
function base64ToBytes(base64) {
  const binString = atob(base64);
  return Uint8Array.from(binString, (m) => m.codePointAt(0));
}

// From https://developer.mozilla.org/en-US/docs/Glossary/Base64#the_unicode_problem.
function bytesToBase64(bytes) {
  const binString = String.fromCodePoint(...bytes);
  return btoa(binString);
}

// Quick polyfill since Firefox and Opera do not yet support isWellFormed().
// encodeURIComponent() throws an error for lone surrogates, which is essentially the same.
function isWellFormed(str) {
  if (typeof(str.isWellFormed)!="undefined") {
    // Use the newer isWellFormed() feature.
    return str.isWellFormed();
  } else {
    // Use the older encodeURIComponent().
    try {
      encodeURIComponent(str);
      return true;
    } catch (error) {
      return false;
    }
  }
}

const validUTF16String = 'hello⛳❤️🧀';
const partiallyInvalidUTF16String = 'hello⛳❤️🧀\uDE75';

if (isWellFormed(validUTF16String)) {
  // This will work. It will print:
  // Encoded string: [aGVsbG/im7PinaTvuI/wn6eA]
  const validUTF16StringEncoded = bytesToBase64(new TextEncoder().encode(validUTF16String));
  console.log(`Encoded string: [${validUTF16StringEncoded}]`);

  // This will work. It will print:
  // Decoded string: [hello⛳❤️🧀]
  const validUTF16StringDecoded = new TextDecoder().decode(base64ToBytes(validUTF16StringEncoded));
  console.log(`Decoded string: [${validUTF16StringDecoded}]`);
} else {
  // Not reached in this example.
}

if (isWellFormed(partiallyInvalidUTF16String)) {
  // Not reached in this example.
} else {
  // This is not a well-formed string, so we handle that case.
  console.log(`Cannot process a string with lone surrogates: [${partiallyInvalidUTF16String}]`);
}

이 코드에는 폴리필로 일반화, 홀로 서드 파티를 자동으로 대체하는 대신 던지도록 TextDecoder 매개변수 변경 등 여러 최적화를 적용할 수 있습니다.

이 지식과 코드를 사용하면 데이터를 거부하거나 데이터 대체를 명시적으로 사용 설정하거나 나중에 분석하기 위해 오류를 발생시키는 등 잘못된 형식의 문자열을 처리하는 방법을 명시적으로 결정할 수도 있습니다.

이 게시물은 base64 인코딩 및 디코딩의 중요한 예일 뿐만 아니라, 특히 텍스트 데이터가 사용자 생성 소스 또는 외부 소스에서 제공되는 경우 신중한 텍스트 처리가 특히 중요한 이유에 대한 예를 제공합니다.