Font Subsetting Strategies: Content-Based vs Alphabetical

Font Subsetting Strategies: Content-Based vs Alphabetical

At 4/19/2024

A grid of tiles with the letters A-Z on them. A number of other tiles showing non-Latin glyphs are broken off from the grid

Font subsetting allows you to split a font’s characters (letters, numbers, symbols, etc.) into separate files so your visitors only download what they need. There are two main subsetting strategies that have different advantages depending on the type of site you’re building.

On a recent project, I did a deep dive into optimizing web font loading. I wanted to switch from Google Fonts to hosting the fonts ourselves in order to speed up load times and reduce external dependencies. We were using the excellent, open source font Inter. I downloaded the font files and added them to the project. But, I quickly noticed that the size of the font files I was downloading had increased drastically since switching from Google Fonts.

Modern fonts often support text in a multitude of different languages and alphabets. Inter supports 51 languages and has 2,504 characters. This is awesome, but I don’t need all of that for the site I’m building, and I don’t want my site visitors to download 317kb of font files.


Inter

This is a TrueType variable font with 2504 characters. It has 2 axes and 18 named instances. It has 35 layout features.

Characters: 2504
Glyphs: 2548
Layout features: 35
Languages: 51

Filename: Inter.var.woff2
Size: 317 KB
Format: WOFF2

Designed by: Rasmus Andersson
Manufactured by: rsms

Copyright: Copyright © 2020 The Inter Project Authors
Version: Version 3.019;git-0a5106e0b

License
This Font Software is licensed under the SIL Open Font License, Version 1.1. This license is available with a FAQ at: http://scripts.sil.org/OFL

Language support

51 languages: Afrikaans, Albanian, Azerbaijani, Basque, Belarusian, Bosnian, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, English, Estonian, Faroese, Filipino, Finnish, French, Galician, German, Greek, Hungarian, Icelandic, Indonesian, Irish, Italian, Kazakh, Kyrgyz, Latvian, Lithuanian, Macedonian, Malay, Mongolian, Norwegian Bokmål, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tongan, Turkish, Ukrainian, Uzbek, Vietnamese, Welsh, Zulu
Screenshot taken from the excellent tool Wakamai Fondue.

Google’s font files were smaller because they were subset. I wanted to do the same thing for the locally hosted version of the font. I found an excellent open source tool called GlyphhangerFootnote 1 which would allow me to split a font into smaller pieces, but I wasn’t sure what characters I should include in my subsets. I did some research and found two main strategies…

Heads up! Some font licenses do not allow you to legally modify the font you’re using. Therefore, it could be illegal for you to subset the font. You may want to review your font license, speak with the font’s creators or consult a lawyer before subsetting a font.

If you know your site content ahead of time, you can create a super-targeted subset that only contains the characters used on your site. For example, if the only text on your site was “Hello World”, you could create a subset that only contained those characters: “H”, “e”, “l”, “o”, “W”, “r”, and “d”.

Your site probably has more text than “Hello World,” but it probably has a lot less than 2,504 characters. For example, all of Alice in Wonderland only contains 75 unique characters!

!()*,-.03:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyzù—‘’“”Code language: CSS (css)

All of the characters used in Alice in Wonderland

After subsetting Inter to the 75 characters in Alice in Wonderland, the file size drops from 317kb to 33kb. That’s almost one tenth the size of the original, and reduces the size of fonts loaded by 284kb!

If you want to subset based on content, you first need to determine what content is on your site. Luckily, there are a couple of prebuilt tools that can help:

  • Glyphhanger (mentioned above) can scan your website and find the specific characters used by each font.
  • Unicode Range Interchange allows you to paste in your site content and see a list of unique characters.

I won’t go too in-depth on this since you can view the links above for more instructions, but here’s an example of how to subset a font based on site content with Glyphhanger:

glyphhanger ./my-site.html --subset=Inter.var.woff2 --family='Inter, sans-serif'
Code language: Bash (bash)

Subsetting based on content allows you to make the smallest subset possible! But there’s a major downside: if your site has dynamic content there’s no good way to subset ahead of time. Even if your site is static, every time you make content changes you’ll need to re-subset your font. You can set up a build task to automate this for static sitesFootnote 2 , but for dynamic sites you’ll need to look for other options.

Luckily, there’s another approach for font subsetting that works well with dynamic content…

The strategy outlined above creates a single subset that’s as small as possible. For dynamic sites, it often makes sense to create a number of small subsets and let the browser pick and choose the ones it needs. That way, you’re providing the entire font but visitors only download the chunks that are used on the page.

This is possible because of the awesome unicode-range CSS feature. This allows you to specify that a certain font file contains certain characters. Then, the browser can check whether the page contains those characters and determine whether to download the file or not.

It’s very common to subset fonts by alphabet. For example, you could have one font file that contains the Latin alphabet, another that contains the Cyrillic alphabet, and another that contains the Greek alphabet.

The letter "A" in the latin alphabet, the letter "Ӂ" in the Cyrillic alphabet and the letter "ξ" in the Greek alphabet

You may think that your site only contains Latin characters. And you might be right! But once you’re creating a single subset, it’s not much more work to create the others.

Skipping this step now may set you up for problems in the future. If your site allows user accounts or has users fill out forms, you should definitely provide these additional alphabets. Your site visitors’ names or form input may require these characters.

We can take this strategy a step further and split large alphabets into even smaller subsets. For example, a base subset could contain commonly used letters from that alphabet and an extended subset could contain rarer letters that are less likely to be used on a page. On most pages, the browser is likely to only download the base subset, but the extended subset is there if it’s needed.

This is how Google Fonts subsets Inter. If you view Google’s CSS file for Inter, you’ll see that it contains 7 different alphabet-based subsets: Cyrillic, Cyrillic Extended, Greek, Greek Extended, Latin, Latin Extended, and Vietnamese.Footnote 3

First, we’ll need to figure out the appropriate unicode ranges for different alphabets. A unicode range tells the browser which characters a font file includes. I based my subsets off of Google Fonts but there are a number of other resources you can consult.Footnote 4

Once you’ve determined your alphabets and their unicode ranges, you can use Glyphhanger to generate your subsets. I put together a Node script to automate this process.

First, we’ll define our alphabets up front:

const alphabets = [
  {
    name: "cyrillic-ext",
    unicodeRange:
      "U+0460-052F, U+1C80-1C88, U+20B4, U+2DE0-2DFF, U+A640-A69F, U+FE2E-FE2F",
  },
  {
    name: "cyrillic",
    unicodeRange: "U+0400-045F, U+0490-0491, U+04B0-04B1, U+2116",
  },
  { name: "greek-ext", unicodeRange: "U+1F00-1FFF" },
  { name: "greek", unicodeRange: "U+0370-03FF" },
  {
    name: "vietnamese",
    unicodeRange:
      "U+0102-0103, U+0110-0111, U+0128-0129, U+0168-0169, U+01A0-01A1, U+01AF-01B0, U+1EA0-1EF9, U+20AB",
  },
  {
    name: "latin-ext",
    unicodeRange:
      "U+0100-024F, U+0259, U+1E00-1EFF, U+2020, U+20A0-20AB, U+20AD-20CF, U+2113, U+2C60-2C7F, U+A720-A7FF",
  },
  {
    name: "latin",
    unicodeRange:
      "U+0000-00FF, U+0131, U+0152-0153, U+02BB-02BC, U+02C6, U+02DA, U+02DC, U+2000-206F, U+2074, U+20AC, U+2122, U+2191, U+2193, U+2212, U+2215, U+FEFF, U+FFFD",
  },
];
Code language: JavaScript (javascript)

Then, we can iterate over our alphabets and create subsets. This script will create new subsets in a subsets subdirectory. You’ll likely need to tweak it to make it work with your font files.

import { spawn } from "child_process";
import path from "path";
import { renameSync } from "fs";
import { fileURLToPath } from "url";

// Set up dirname to make it easier to place our files
// @see https://leveluptutorials.com/posts/how-to-use-__dirname-with-esm
const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);

for (const alphabet of alphabets) {
  // We want to wait for each subset to finish before starting
  // the next one (this is required for some file renaming
  // done below)
  await new Promise((resolve) => {
    const glyphhangerProcess = spawn(
      "glyphhanger",
      [
        // Spaces in the whitelist breaks the CLI script
        `--whitelist=${alphabet.unicodeRange.replaceAll(" ", "")}`,
        `--subset=Inter-var.woff2`,
        "--output=subsets",
        "--formats=woff2",
      ],
      { stdio: "inherit" }
    );

    glyphhangerProcess.on("close", () => {
      // I couldn't figure out a way to change the output
      // file name in glyphhanger. Instead we manually change
      // it after the file is generated
      const oldPath = path.join(
        __dirname,
        `subsets/Inter-${font.name}-subset.woff2`
      );
      const newPath = path.join(
        __dirname,
        `subsets/Inter-${font.name}-${subset.name}.woff2`
      );
      renameSync(oldPath, newPath);

      resolve();
    });
  });
}
Code language: JavaScript (javascript)

Finally, we can create a CSS file that loads the subsets with the correct unicode ranges. Again, you’ll probably want to tweak this.

import path from "path";
import { writeFile } from "fs";
import { fileURLToPath } from "url";

const cssCode = "";

// Set up dirname to make it easier to place our files
// @see https://leveluptutorials.com/posts/how-to-use-__dirname-with-esm
const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);

// Create our font face rules
// This would need to be modified for other weights and styles
alphabets.forEach((alphabet) => {
  cssCode += `
    @font-face {
      font-display: swap;
      font-family: 'Inter';
      font-style: normal;
      font-weight: 400;
      src: url('subsets/Inter-var-${alphabet.name}.woff2') format('woff2');
      unicode-range: ${alphabet.unicodeRange};
    }
  `;
});

// Determine where to save our file
const cssPath = path.join(__dirname, "inter.css");

// Write our CSS file
writeFile(
  cssPath,
  `
/* Stop: this is a generated file. Do not edit directly. */
${cssCode}
  `.trim(),
  {},
  () => {
    console.log("CSS file written to:", cssPath);
  }
);
Code language: JavaScript (javascript)

As you can see, this can be trickier than creating a single subset, but it allows us to serve the entire font, support dynamic content, and only have site visitors download the parts they need.

Subsetting can have a big impact on font loading performance, but the ideal strategy varies depending on the type of site you’re building.

Subsetting is just one technique to improve the performance of web fonts. I highly recommend Zach Leatherman’s definitive article on the subject if you’re looking for other ways to improve your font loading performance.

Footnotes

  1. Glyphhanger requires some Python dependencies and can be a little tricky to install and get running. I recommend this excellent article by Sara Soueidan if you get stuck.  Return to the text before footnote 1
  2. If you add a subsetting build step, you’ll probably only want to run it for production builds, and serve the whole font file locally during development.  Return to the text before footnote 2
  3. The Vietnamese alphabet is really cool! According to wikipedia it “produces words that have no silent letters, with letters and words consistent in how they are read and spoken, with rare exceptions. The elaborate use of diacritics produces a highly accurate sound transcription for tonal languages.”  Return to the text before footnote 3
  4. unicode.org splits the unicode range into a number of alphabet-based ranges. unicode-table.com splits the alphabets slightly differently and has a nice scrolling visualizer to help understand the ranges. JRX provides another option for splitting up the alphabets.  Return to the text before footnote 4
Copyrights

We respect the property rights of others and are always careful not to infringe on their rights, so authors and publishing houses have the right to demand that an article or book download link be removed from the site. If you find an article or book of yours and do not agree to the posting of a download link, or you have a suggestion or complaint, write to us through the Contact Us, or by email at: support@freewsad.com.

More About us