Unicode is a universal character encoding standard that encodes, represents, and processes text consistently across different languages, platforms, and systems. It is a language of the modern web, that can represent text across all environments.
Shout out to many YouTube channels for providing concise explanations about the evolution of text in Computer Systems across the globe. Here are a few of my recommendations to understand the What/How/When of Unicode, before reading more in this blog:
If you have an idea about what Unicode is, then you understand how hard it is to parse and understand text that contains Unicode. To do this in many languages and operating systems worldwide, one will write code to maintain backward compatibility with different encodings. To make our lives easier, IBM introduced a library project called ICU - International Components for Unicode so that developers can build better software applications while dealing with all kinds of text. It later formed its own Unicode Consortium, an organization responsible for developing and maintaining the Unicode standard.
The library has variants supporting C/C++ called ICU4C and Java called ICU4J containing APIs for working with ICU in their respective languages.
Android supports Unicode by exposing a subset of APIs from ICU4C via Android NDK and ICU4J via Android Java APIs. They also use CLDR for data size reduction. As most of the data in ICU is strings, the converted data uses shorter keys and smaller values for exchange. One can find references to the inclusion of ICU4C APIs in Android NDK here that have a lot of missing headers.
iOS supports Unicode by exposing a subset of ICU4C APIs via FoundationICU swift package, which in turn is extracted from Apple’s version of OSS Distribution’s ICU. It is essentially a stripped-down version of ICU that is missing a lot of properties and APIs.
However, both of their documentation does not mention which set of APIs will be broken.
Running language models from online providers such as HuggingFace is a tedious task. Just like one can run Computer Vision PyTorch models by converting them for mobile/edge devices, one struggles while doing the same for NLP models. The initial tokenization process creates a barrier to reproducing a language model on mobile phones as the tokenizers built for training are written in Python/Rustwhich makes it hard to run. We can benefit from a tokenization library that can support Unicode to run all kinds of operations like NFC/NFD Normalization, Regex Match, etc.
As mentioned earlier, ICU supports two language variants i.e. C/C++ and Java. To build a cross-platform mobile library on top of ICU, we can use icu4c
from ICU which is the C/C++ variant. Since its a C++ project, here’s a quick guide on how to build a static library for different architectures of mobile platforms like Android and iOS. Since Android and iOS can be run on multiple architectures, here’s a quick support matrix for the proposed project.
Platform | SDK | Architecture |
---|---|---|
ios | iphoneos | arm64 |
ios-simulator | iphonesimulator | arm64 |
ios-simulator | iphonesimulator | x86_64 |
ABI | Triple | Architecture |
---|---|---|
arm64-v8a | aarch64-linux-android | arm64 |
armeabi-v7a | armv7a-linux-androideabi | armv7 |
x86-64 | x86_64-linux-android | x86_64 |
x86 | i686-linux-android | x86 |
For building icu4c
, refer to this repository: icu4c-mobile. It contains a bash script for downloading a specific release version of ICU and building it for either Android or iOS.
Here’s what the functions for building ICU for Android and iOS look like:
COMMON_CONFIGURE_ARGS="--enable-static --disable-shared --with-data-packaging=static"
BUILD_DIR=$(pwd)
INSTALL_DIR="$BUILD_DIR/final"
build_host() {
printf "Building ICU for host system\n"
mkdir -p $BUILD_DIR/icu-host-build && cd $BUILD_DIR/icu-host-build
../$ICU4C_FOLDER/source/configure --prefix=$BUILD_DIR/icu-host $COMMON_CONFIGURE_ARGS \
CFLAGS="-fPIC" \
CXXFLAGS="-fPIC"
make -j$THREAD_COUNT
make install
cd $BUILD_DIR
}
build_ios() {
PLATFORM=$1
SDK=$2
SDK_MIN_VERSION=$3
ARCH=$4
BUILD_FOLDER="$BUILD_DIR/icu-$PLATFORM-$ARCH-build"
printf "Building ICU for $PLATFORM $ARCH $SDK ($SDK_MIN_VERSION) $ARCH\n"
mkdir -p $BUILD_FOLDER && cd $BUILD_FOLDER
export CC="$(xcrun --sdk $SDK --find clang)"
export CXX="$(xcrun --sdk $SDK --find clang++)"
export AR="$(xcrun --sdk $SDK --find ar)"
export AS="$(xcrun --sdk $SDK --find as)"
export LD="$(xcrun --sdk $SDK --find ld)"
export RANLIB="$(xcrun --sdk $SDK --find ranlib)"
export STRIP="$(xcrun --sdk $SDK --find strip)"
CFLAGS="-arch $ARCH -fPIC -isysroot $(xcrun --sdk $SDK --show-sdk-path) -m$SDK-version-min=$SDK_MIN_VERSION -DTARGET_IOS=1"
LDFLAGS="$CFLAGS"
../$ICU4C_FOLDER/source/configure \
--host=$ARCH-apple-darwin \
--with-cross-build=$BUILD_DIR/icu-host-build \
--prefix=$INSTALL_DIR/$PLATFORM-$ARCH \
$COMMON_CONFIGURE_ARGS \
CFLAGS="$CFLAGS" \
CXXFLAGS="$CFLAGS" \
LDFLAGS="$LDFLAGS"
make -j$THREAD_COUNT
make install
cd $BUILD_DIR
}
build_android() {
ABI=$1
TARGET=$2
ARCH=$3
TOOLCHAIN="$ANDROID_NDK_ROOT/toolchains/llvm/prebuilt/darwin-x86_64"
SYSROOT="$TOOLCHAIN/sysroot"
BUILD_FOLDER="$BUILD_DIR/icu-android-$ARCH-build"
printf "Building ICU for android $ABI $TARGET $ARCH\n"
mkdir -p $BUILD_FOLDER && cd $BUILD_FOLDER
export CC="$TOOLCHAIN/bin/$TARGET$ANDROID_API-clang"
export CXX="$TOOLCHAIN/bin/$TARGET$ANDROID_API-clang++"
export AR="$TOOLCHAIN/bin/llvm-ar"
export AS="$TOOLCHAIN/bin/llvm-as"
export LD="$TOOLCHAIN/bin/ld"
export RANLIB="$TOOLCHAIN/bin/llvm-ranlib"
export STRIP="$TOOLCHAIN/bin/llvm-strip"
CFLAGS="--sysroot=$SYSROOT -fPIC -D__ANDROID_API__=$ANDROID_API"
LDFLAGS="$CFLAGS"
../$ICU4C_FOLDER/source/configure \
--host=$TARGET \
--with-cross-build=$BUILD_DIR/icu-host-build \
--prefix=$INSTALL_DIR/android-$ARCH \
$COMMON_CONFIGURE_ARGS \
CFLAGS="$CFLAGS" \
CXXFLAGS="$CFLAGS" \
LDFLAGS="$LDFLAGS"
make -j$THREAD_COUNT
make install
cd $BUILD_DIR
}
# You can build by calling something like:
# build_ios ios iphoneos arm64
# build_android arm64-v8a aarch64-linux-android arm64
To begin, here is a guide on how to add C/C++ to an Android project and a guide on how to mix C/C++ in an iOS project. For building an application with a C++ static library like ICU that we just built as mentioned above, we can refer to below steps.
For Android, we can start by creating a CMake configuration which specifies the header and library inclusion while building the Android project that uses NDK.
cmake_minimum_required(VERSION 3.22.1)
project("icu4cmobile")
set(ICU4C_ROOT_PATH ${CMAKE_SOURCE_DIR}/../../../../../final)
set(ICU_INCLUDE_PATH "")
if (${CMAKE_ANDROID_ARCH_ABI} STREQUAL "armeabi-v7a")
set(ICU_INCLUDE_PATH "${ICU4C_ROOT_PATH}/android-armv7/include")
elseif (${CMAKE_ANDROID_ARCH_ABI} STREQUAL "arm64-v8a")
set(ICU_INCLUDE_PATH "${ICU4C_ROOT_PATH}/android-arm64/include")
elseif (${CMAKE_ANDROID_ARCH_ABI} STREQUAL "x86")
set(ICU_INCLUDE_PATH "${ICU4C_ROOT_PATH}/android-x86/include")
elseif (${CMAKE_ANDROID_ARCH_ABI} STREQUAL "x86_64")
set(ICU_INCLUDE_PATH "${ICU4C_ROOT_PATH}/android-x86_64/include")
endif()
include_directories(${ICU_INCLUDE_PATH})
set(ICU_LIB_PATH "")
if (${CMAKE_ANDROID_ARCH_ABI} STREQUAL "armeabi-v7a")
set(ICU_LIB_PATH "${ICU4C_ROOT_PATH}/android-armv7/lib")
elseif (${CMAKE_ANDROID_ARCH_ABI} STREQUAL "arm64-v8a")
set(ICU_LIB_PATH "${ICU4C_ROOT_PATH}/android-arm64/lib")
elseif (${CMAKE_ANDROID_ARCH_ABI} STREQUAL "x86")
set(ICU_LIB_PATH "${ICU4C_ROOT_PATH}/android-x86/lib")
elseif (${CMAKE_ANDROID_ARCH_ABI} STREQUAL "x86_64")
set(ICU_LIB_PATH "${ICU4C_ROOT_PATH}/android-x86_64/lib")
endif()
add_library(${CMAKE_PROJECT_NAME} SHARED
icu4cmobile_jni.cpp)
target_link_libraries(${CMAKE_PROJECT_NAME}
android
log
${ICU_LIB_PATH}/libicuuc.a
${ICU_LIB_PATH}/libicui18n.a
${ICU_LIB_PATH}/libicudata.a)
You can see that, by default Android will try to build for all the ABIs i.e. armeabi-v7a, arm64-v8a, x86, x86_64
and we need to specify the locations for C++ includes and libraries based on the architecture ABI for which the project is being compiled.
Now depending on the operations to expose, we can write JNI code for exposing functions to be called by Kotlin/Java in our user level activities.
#include <jni.h>
#include <string>
#include <unicode/unistr.h>
#include <unicode/uscript.h>
extern "C"
JNIEXPORT jstring JNICALL
Java_xyz_omkar_icu4cmobile_MainActivity_getScript(JNIEnv *env, jobject obj,
jstring input) {
const jchar *inputStr = env->GetStringChars(input, nullptr);
jsize inputLength = env->GetStringLength(input);
std::wstring inputWStr(inputStr, inputStr + inputLength);
env->ReleaseStringChars(input, inputStr);
icu::UnicodeString unicodeStr = icu::UnicodeString::fromUTF32(
reinterpret_cast<const UChar32 *>(
reinterpret_cast<const UChar *>(inputWStr.data())), inputWStr.length());
UErrorCode errorCode = U_ZERO_ERROR;
UChar32 firstChar = unicodeStr.char32At(0);
UScriptCode scriptCode = uscript_getScript(firstChar, &errorCode);
if (U_FAILURE(errorCode)) {
return env->NewStringUTF("Unknown");
}
const char* scriptName = uscript_getName(scriptCode);
return env->NewStringUTF(scriptName);
}
We can load them and use in your Main Activity like:
external fun getScript(input: String): String
companion object {
init {
System.loadLibrary("icu4cmobile")
}
}
For iOS, it is a tricky process as explained in the documentation. One has to write Objective-C code to define the functions to be called from Swift Views. Similarly, we can use helper includes and source files for running ICU APIs to get script code from the text.
#import <Foundation/Foundation.h>
@interface ICU4CHelper : NSObject
+ (NSString *)getScript:(NSString *)input;
@end
#import "ICU4CHelper.h"
#import <unicode/unistr.h>
#import <unicode/uscript.h>
@implementation ICU4CHelper
+ (NSString *)getScript:(NSString *)input {
icu::UnicodeString unicodeStr = icu::UnicodeString::fromUTF8(input.UTF8String);
UErrorCode errorCode = U_ZERO_ERROR;
UChar32 firstChar = unicodeStr.char32At(0);
UScriptCode scriptCode = uscript_getScript(firstChar, &errorCode);
if (U_FAILURE(errorCode)) {
return @"Unknown";
}
const char *scriptName = uscript_getName(scriptCode);
return [NSString stringWithUTF8String:scriptName];
}
@end
One has to create a Bridging Header to work with Objective-C functions. It is a special header file that allows you to use Objective-C code in a Swift project. It acts as a bridge between Swift and Objective-C, enabling interoperability between the two languages.
Unlike using CMake for Android projects, we have to manually specify the paths for finding our built ICU’s headers and libraries. We can do that in the XCode Project settings as shown below.
There are endless possibilities now that we can run ICU on Android and iOS. We can do better internationalization and localization, perform several operations such as regex matching that are missing in existing ICU APIs, etc.
The repository contains an example Android and iOS application which does a simple task of Script Detection.
Future use cases also include building a lightweight and fast tokenizer library for running HuggingFace language models on mobile. This will help achieve consistent inference results for LLMs.
Go check it out on GitHub. If you have an idea or a suggestion for improvement, feel free to contribute via Issues/Pull Requests!