Normalizers: Demystifying HuggingFace Tokenizers in C++

Jul 30, 2023

This blog post marks the beginning of a series that delves into the C++ details of HuggingFace Tokenizers. To begin with, in this blog post, the spotlight will be on the Normalizer, leaving other tokenizer components for consequent entries in the series. We will explore its functioning, its underlying mechanisms and get some insight into handling special characters. The aim is to expose researchers or engineers who are already familiar with HF Tokenizers to its internals, but if you are not, here’s an interesting course with YouTube Videos from HuggingFace team itself!

Source Code: GitHub

Karpathy's Tweet Why re-implement in C++?


Normalizer: The Secret Ingredient to Good Tokenization

When working with text data, one should ensure that it’s clean and consistent before starting the tokenization process. This is where the normalizer comes in. The normalizer helps you clean up the text by performing tasks like lowercasing, removing accents, and stripping whitespace.

But the normalizer does more than just clean up the text. It also helps to improve the performance and interpretability of machine learning models. For example, if the normalizer removes all accents from the text, then the machine learning model does not need to worry about distinguishing between different representations of the same character. This can help to improve the model’s accuracy.

Here’s a good refresher video from the HuggingFace Team on “What is Normalization?” with very easy-to-understand examples.


C++ Details

Overview

HuggingFace Tokenizers is Rust-first with bindings for Python, Node, and Ruby. There are several GitHub issues about requests on HOW-TO in C++ or C++ bindings, but I think the sun will never shine on that. The idea here is not to translate Rust to C++ but build C++ normalizers from the ground up. The C++ C++ doesn’t have to choose an exact similar construct of Rust but stick to the Rust functioning.

Using std::wstring in C++

Working around strings in tokenizers is a bit more complex than usual. It needs support for characters from any Unicode character set, including non-ASCII characters, and processing text from many languages without converting it to a different encoding. APIs for string operations can differ in Operating Systems. Keeping the above requirements in mind, std::wstring seems to be a good choice as it fulfills most of them. It also has other advantages over std::string; the ability to represent characters more precisely and to store larger strings.

Handling Unicode Representations

The International Components for Unicode (ICU) is a set of libraries and tools for handling text in multiple languages. It provides a wide range of features, including support for Unicode, normalization, collation, and date and time formatting. ICU also provides many C++ APIs that makes it suitable for building normalizers in C++. For example, the icu::Normalizer::normalize() function can be used to normalize a string to a specific NF form e.g. NFC, NFD, NFKC, and NFKD.

Here’s an example for performing Unicode normalization using ICU:

std::wstring input = L"<?>"; // input text
// convert std::wstring to icu::UnicodeString
icu::UnicodeString uInput =
    icu::UnicodeString::fromUTF32(reinterpret_cast<const UChar32*>(input.c_str()), input.length());
UErrorCode status = U_ZERO_ERROR;
icu::UnicodeString uNormalizedInput;
// perform normalization and assign result to variable uNormalizedInput
icu::Normalizer::normalize(uInput, mode, 0, uNormalizedInput, status); // mode can be UNORM_NFC, UNORM_NFD, UNORM_NFKC, UNORM_NFKD
// convert it back to std::wstring
std::wstring normalizedInput;
for (int32_t i = 0; i < uNormalizedInput.length(); ++i) {
  UChar32 c = uNormalizedInput.char32At(i);
  normalizedInput.push_back(static_cast<wchar_t>(c));
}

Normalizers

Sequence

Allows defining multiple normalizers as a sequence which runs in the given order.
Rust · C++

Lowercase

Converts all the uppercase characters to lowercase.
This can be implemented easily using tolower function.
Rust · C++

Strip

Removes whitespaces from specific sides i.e. left, right or both.
Stripping whitespaces can be done using std::wstring::erase and std::iswspace functions.
Rust · C++

StripAccents

Removes all accent symbols in unicode to be used with NFD for consistency.
This one’s tricky and we need to use ICU’s getCombiningClass.
Rust · C++

Replace

Replaces all the matching occurences of regex pattern with the new content.
Regex replacing is as straight forward as using std::regex_replace function with a pattern and replacing text.
Rust · C++

Prepend

Adds given prepend content at the start of the string.
Just some straightfoward C++ logic to iterate over strings and add a prefix.
Rust · C++

NFC, NFKC, NFD, NFKD, Nmt

Support for handling unicode characters.
ICU library does the heavy lifting here as mentioned above in handling Unicode representations.
Rust · C++

Bert

Supporting normalization for BERT-like models with lowercasing, punctuation and unicode normalization, stripping accents and special character handling.
This is a combination of already implemented Lowercase and NFD normalizers along with ICU’s constants ICU_NON_SPACING_MARK and C++ std::transform for stripping accents and cleaning up text. Rust · C++

End-to-End C++ Example

Here’s an exmaple for running Sequence normalizer which encompasses almost all of the normalizers:

// initialize individual normalizers
Lowercase lowercase;
Strip strip(true, true);
std::wstring replaceWhat = L"'";
std::wstring replaceWith = L"!";
Replace replace(replaceWhat, replaceWith);
NFD nfd;
StripAccents sa;
std::wstring prepender = L"_";
Prepend prepend(prepender);

// construct sequence normalizer
std::vector<Normalizer*> normalizers;
normalizers.push_back(&lowercase);
normalizers.push_back(&strip);
normalizers.push_back(&replace);
normalizers.push_back(&nfd);
normalizers.push_back(&sa);
normalizers.push_back(&prepend);
Sequence seq(normalizers);

// run normalization
Rust = L"  Héllò 'user' hôw are ü  ";
normalized = NormalizedString(Rust);
seq.normalize(normalized);
std::cout << normalized.get() << std::endl; // _hello _!user! _how _are _u

This above C++ example can be easily understood using this flow diagram:

Normalizers Normalizer Example Flow Diagram

Future Scope

The idea is to make this C++ more modular and extensible similar to Rust/Python version. It should be easy to use this Normalizer component in an end-to-end tokenization pipeline and create a custom normalization sequence.
There is always room for optimizing the performance by refactoring, including caching, batching, and parallelization - but like many real-world projects, I see this evolving organically.

If you liked this blog post, go check it out on GitHub. If you have an idea or a suggestion for improvement, feel free to contribute via Issues/Pull Requests!