Enabling Unicode Support with ICU for Android and iOS

February 28, 2025

WHAT is Unicode and WHY does it matter?

Unicode

Unicode is a universal character encoding standard that encodes, represents, and processes text consistently across different languages, platforms, and systems. It is a language of the modern web, that can represent text across all environments.
Shout out to many YouTube channels for providing concise explanations about the evolution of text in Computer Systems across the globe. Here are a few of my recommendations to understand the What/How/When of Unicode, before reading more in this blog:

ICU

If you have an idea about what Unicode is, then you understand how hard it is to parse and understand text that contains Unicode. To do this in many languages and operating systems worldwide, one will write code to maintain backward compatibility with different encodings. To make our lives easier, IBM introduced a library project called ICU - International Components for Unicode so that developers can build better software applications while dealing with all kinds of text. It later formed its own Unicode Consortium, an organization responsible for developing and maintaining the Unicode standard.

The library has variants supporting C/C++ called ICU4C and Java called ICU4J containing APIs for working with ICU in their respective languages.


Challenges in Unicode Support on Mobile

Existing Gaps in Unicode Support Across Android, iOS, and ICU

Android supports Unicode by exposing a subset of APIs from ICU4C via Android NDK and ICU4J via Android Java APIs. They also use CLDR for data size reduction. As most of the data in ICU is strings, the converted data uses shorter keys and smaller values for exchange. One can find references to the inclusion of ICU4C APIs in Android NDK here that have a lot of missing headers.
iOS supports Unicode by exposing a subset of ICU4C APIs via FoundationICU swift package, which in turn is extracted from Apple’s version of OSS Distribution’s ICU. It is essentially a stripped-down version of ICU that is missing a lot of properties and APIs.
However, both of their documentation does not mention which set of APIs will be broken.

Running Language Models on Mobile

Running language models from online providers such as HuggingFace is a tedious task. Just like one can run Computer Vision PyTorch models by converting them for mobile/edge devices, one struggles while doing the same for NLP models. The initial tokenization process creates a barrier to reproducing a language model on mobile phones as the tokenizers built for training are written in Python/Rustwhich makes it hard to run. We can benefit from a tokenization library that can support Unicode to run all kinds of operations like NFC/NFD Normalization, Regex Match, etc.


Integrating ICU for Mobile

As mentioned earlier, ICU supports two language variants i.e. C/C++ and Java. To build a cross-platform mobile library on top of ICU, we can use icu4c from ICU which is the C/C++ variant. Since its a C++ project, here’s a quick guide on how to build a static library for different architectures of mobile platforms like Android and iOS. Since Android and iOS can be run on multiple architectures, here’s a quick support matrix for the proposed project.

iOS Support Matrix

PlatformSDKArchitecture
iosiphoneosarm64
ios-simulatoriphonesimulatorarm64
ios-simulatoriphonesimulatorx86_64

Android Support Matrix

ABITripleArchitecture
arm64-v8aaarch64-linux-androidarm64
armeabi-v7aarmv7a-linux-androideabiarmv7
x86-64x86_64-linux-androidx86_64
x86i686-linux-androidx86

Building Cross-platform C++ ICU Library

For building icu4c, refer to this repository: icu4c-mobile. It contains a bash script for downloading a specific release version of ICU and building it for either Android or iOS.
Here’s what the functions for building ICU for Android and iOS look like:

COMMON_CONFIGURE_ARGS="--enable-static --disable-shared --with-data-packaging=static"
BUILD_DIR=$(pwd)
INSTALL_DIR="$BUILD_DIR/final"

build_host() {
    printf "Building ICU for host system\n"
    mkdir -p $BUILD_DIR/icu-host-build && cd $BUILD_DIR/icu-host-build
    ../$ICU4C_FOLDER/source/configure --prefix=$BUILD_DIR/icu-host $COMMON_CONFIGURE_ARGS \
        CFLAGS="-fPIC" \
        CXXFLAGS="-fPIC"
    make -j$THREAD_COUNT
    make install
    cd $BUILD_DIR
}

build_ios() {
    PLATFORM=$1
    SDK=$2
    SDK_MIN_VERSION=$3
    ARCH=$4
    
    BUILD_FOLDER="$BUILD_DIR/icu-$PLATFORM-$ARCH-build"
    
    printf "Building ICU for $PLATFORM $ARCH $SDK ($SDK_MIN_VERSION) $ARCH\n"

    mkdir -p $BUILD_FOLDER && cd $BUILD_FOLDER

    export CC="$(xcrun --sdk $SDK --find clang)"
    export CXX="$(xcrun --sdk $SDK --find clang++)"
    export AR="$(xcrun --sdk $SDK --find ar)"
    export AS="$(xcrun --sdk $SDK --find as)"
    export LD="$(xcrun --sdk $SDK --find ld)"
    export RANLIB="$(xcrun --sdk $SDK --find ranlib)"
    export STRIP="$(xcrun --sdk $SDK --find strip)"

    CFLAGS="-arch $ARCH -fPIC -isysroot $(xcrun --sdk $SDK --show-sdk-path) -m$SDK-version-min=$SDK_MIN_VERSION -DTARGET_IOS=1"
    LDFLAGS="$CFLAGS"

    ../$ICU4C_FOLDER/source/configure \
        --host=$ARCH-apple-darwin \
        --with-cross-build=$BUILD_DIR/icu-host-build \
        --prefix=$INSTALL_DIR/$PLATFORM-$ARCH \
        $COMMON_CONFIGURE_ARGS \
        CFLAGS="$CFLAGS" \
        CXXFLAGS="$CFLAGS" \
        LDFLAGS="$LDFLAGS"
    make -j$THREAD_COUNT
    make install

    cd $BUILD_DIR
}

build_android() {
    ABI=$1
    TARGET=$2
    ARCH=$3
    
    TOOLCHAIN="$ANDROID_NDK_ROOT/toolchains/llvm/prebuilt/darwin-x86_64"
    SYSROOT="$TOOLCHAIN/sysroot"
    BUILD_FOLDER="$BUILD_DIR/icu-android-$ARCH-build"

    printf "Building ICU for android $ABI $TARGET $ARCH\n"

    mkdir -p $BUILD_FOLDER && cd $BUILD_FOLDER

    export CC="$TOOLCHAIN/bin/$TARGET$ANDROID_API-clang"
    export CXX="$TOOLCHAIN/bin/$TARGET$ANDROID_API-clang++"
    export AR="$TOOLCHAIN/bin/llvm-ar"
    export AS="$TOOLCHAIN/bin/llvm-as"
    export LD="$TOOLCHAIN/bin/ld"
    export RANLIB="$TOOLCHAIN/bin/llvm-ranlib"
    export STRIP="$TOOLCHAIN/bin/llvm-strip"

    CFLAGS="--sysroot=$SYSROOT -fPIC -D__ANDROID_API__=$ANDROID_API"
    LDFLAGS="$CFLAGS"

    ../$ICU4C_FOLDER/source/configure \
        --host=$TARGET \
        --with-cross-build=$BUILD_DIR/icu-host-build \
        --prefix=$INSTALL_DIR/android-$ARCH \
        $COMMON_CONFIGURE_ARGS \
        CFLAGS="$CFLAGS" \
        CXXFLAGS="$CFLAGS" \
        LDFLAGS="$LDFLAGS"

    make -j$THREAD_COUNT
    make install

    cd $BUILD_DIR
}

# You can build by calling something like:
# build_ios ios iphoneos arm64
# build_android arm64-v8a aarch64-linux-android arm64

Running ICU on Android and iOS

To begin, here is a guide on how to add C/C++ to an Android project and a guide on how to mix C/C++ in an iOS project. For building an application with a C++ static library like ICU that we just built as mentioned above, we can refer to below steps.

C++ and CMake Configuration with Kotlin

For Android, we can start by creating a CMake configuration which specifies the header and library inclusion while building the Android project that uses NDK.

cmake_minimum_required(VERSION 3.22.1)

project("icu4cmobile")

set(ICU4C_ROOT_PATH ${CMAKE_SOURCE_DIR}/../../../../../final)

set(ICU_INCLUDE_PATH "")
if (${CMAKE_ANDROID_ARCH_ABI} STREQUAL "armeabi-v7a")
    set(ICU_INCLUDE_PATH "${ICU4C_ROOT_PATH}/android-armv7/include")
elseif (${CMAKE_ANDROID_ARCH_ABI} STREQUAL "arm64-v8a")
    set(ICU_INCLUDE_PATH "${ICU4C_ROOT_PATH}/android-arm64/include")
elseif (${CMAKE_ANDROID_ARCH_ABI} STREQUAL "x86")
    set(ICU_INCLUDE_PATH "${ICU4C_ROOT_PATH}/android-x86/include")
elseif (${CMAKE_ANDROID_ARCH_ABI} STREQUAL "x86_64")
    set(ICU_INCLUDE_PATH "${ICU4C_ROOT_PATH}/android-x86_64/include")
endif()
include_directories(${ICU_INCLUDE_PATH})

set(ICU_LIB_PATH "")
if (${CMAKE_ANDROID_ARCH_ABI} STREQUAL "armeabi-v7a")
    set(ICU_LIB_PATH "${ICU4C_ROOT_PATH}/android-armv7/lib")
elseif (${CMAKE_ANDROID_ARCH_ABI} STREQUAL "arm64-v8a")
    set(ICU_LIB_PATH "${ICU4C_ROOT_PATH}/android-arm64/lib")
elseif (${CMAKE_ANDROID_ARCH_ABI} STREQUAL "x86")
    set(ICU_LIB_PATH "${ICU4C_ROOT_PATH}/android-x86/lib")
elseif (${CMAKE_ANDROID_ARCH_ABI} STREQUAL "x86_64")
    set(ICU_LIB_PATH "${ICU4C_ROOT_PATH}/android-x86_64/lib")
endif()

add_library(${CMAKE_PROJECT_NAME} SHARED
        icu4cmobile_jni.cpp)

target_link_libraries(${CMAKE_PROJECT_NAME}
        android
        log
        ${ICU_LIB_PATH}/libicuuc.a
        ${ICU_LIB_PATH}/libicui18n.a
        ${ICU_LIB_PATH}/libicudata.a)

You can see that, by default Android will try to build for all the ABIs i.e. armeabi-v7a, arm64-v8a, x86, x86_64 and we need to specify the locations for C++ includes and libraries based on the architecture ABI for which the project is being compiled.

Android JNI

Now depending on the operations to expose, we can write JNI code for exposing functions to be called by Kotlin/Java in our user level activities.

#include <jni.h>
#include <string>
#include <unicode/unistr.h>
#include <unicode/uscript.h>

extern "C"
JNIEXPORT jstring JNICALL
Java_xyz_omkar_icu4cmobile_MainActivity_getScript(JNIEnv *env, jobject obj,
                                                  jstring input) {
    const jchar *inputStr = env->GetStringChars(input, nullptr);
    jsize inputLength = env->GetStringLength(input);
    std::wstring inputWStr(inputStr, inputStr + inputLength);
    env->ReleaseStringChars(input, inputStr);
    icu::UnicodeString unicodeStr = icu::UnicodeString::fromUTF32(
            reinterpret_cast<const UChar32 *>(
                    reinterpret_cast<const UChar *>(inputWStr.data())), inputWStr.length());
    UErrorCode errorCode = U_ZERO_ERROR;
    UChar32 firstChar = unicodeStr.char32At(0);
    UScriptCode scriptCode = uscript_getScript(firstChar, &errorCode);
    if (U_FAILURE(errorCode)) {
        return env->NewStringUTF("Unknown");
    }
    const char* scriptName = uscript_getName(scriptCode);
    return env->NewStringUTF(scriptName);
}

We can load them and use in your Main Activity like:

external fun getScript(input: String): String

companion object {
    init {
        System.loadLibrary("icu4cmobile")
    }
}

Objective-C and Bridging Header

For iOS, it is a tricky process as explained in the documentation. One has to write Objective-C code to define the functions to be called from Swift Views. Similarly, we can use helper includes and source files for running ICU APIs to get script code from the text.

#import <Foundation/Foundation.h>

@interface ICU4CHelper : NSObject
+ (NSString *)getScript:(NSString *)input;
@end
#import "ICU4CHelper.h"
#import <unicode/unistr.h>
#import <unicode/uscript.h>

@implementation ICU4CHelper
+ (NSString *)getScript:(NSString *)input {
    icu::UnicodeString unicodeStr = icu::UnicodeString::fromUTF8(input.UTF8String);
    UErrorCode errorCode = U_ZERO_ERROR;
    UChar32 firstChar = unicodeStr.char32At(0);
    UScriptCode scriptCode = uscript_getScript(firstChar, &errorCode);
    if (U_FAILURE(errorCode)) {
        return @"Unknown";
    }
    const char *scriptName = uscript_getName(scriptCode);
    return [NSString stringWithUTF8String:scriptName];
}
@end

One has to create a Bridging Header to work with Objective-C functions. It is a special header file that allows you to use Objective-C code in a Swift project. It acts as a bridge between Swift and Objective-C, enabling interoperability between the two languages.

iOS Bridging Header

Unlike using CMake for Android projects, we have to manually specify the paths for finding our built ICU’s headers and libraries. We can do that in the XCode Project settings as shown below.

XCode Search Paths Update Header Search Paths and Library Search Paths in Build Settings of XCode Project XCode Link Binaries Include ICU binaries to be linked in XCode Project

Practical Use Cases and Next Steps

There are endless possibilities now that we can run ICU on Android and iOS. We can do better internationalization and localization, perform several operations such as regex matching that are missing in existing ICU APIs, etc.

Script Detection on Mobile

The repository contains an example Android and iOS application which does a simple task of Script Detection.

ICU4C Android ICU4C iOS
Script Detection on Android and iOS using ICU

Tokenization

Future use cases also include building a lightweight and fast tokenizer library for running HuggingFace language models on mobile. This will help achieve consistent inference results for LLMs.


Go check it out on GitHub. If you have an idea or a suggestion for improvement, feel free to contribute via Issues/Pull Requests!