Get Started

uchardet library is the encoding detector library of Mozilla which takes a sequence of bytes in an unknown character encoding without any additional information, and attempts to determine the encoding of the text.

uchardet package solves 3 types of tasks:

The uchardet package includes demo files. You could get their paths with this command:

system.file("examples", path, package = "uchardet")

Character encoding detection

To detect encoding of the strings you should use detect_str_enc function. It is vectorized and accepts the character vector. Missing values will be skipped. Also function keeps names attribute.

Simple example. Detection of the ASII symbols:

detect_str_enc("Hello, useR!")
#> [1] "ASCII"

All strings in R could be only in three encodings - UTF-8, Latin1 and native. It means that we could not read file with WINDOWS-1252 encoding and got/print it with the same encoding, it will be converted into one of the basic encodings (usually ASCII or UTF-8).

Due to this limitations if we want to test detect_str_enc() we should convert string with UTF-8 encoding into another encoding and then use the detect_str_enc().

# get file path
file <- system.file("examples", "zh/big5.txt", package = "uchardet")
# create the file connection with the encoding
con <- file(file, encoding = "BIG-5")
# read file into the working env
zh_utf8 <- paste(readLines(con, warn = FALSE), collapse = "\n")
# close connection
close(con)
# print content
print(zh_utf8)
#> [1] "繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文繁體中文"
# check the encoding of the created object
Encoding(zh_utf8)
#> [1] "UTF-8"
# detection result
detect_str_enc(zh_utf8)
#> [1] "UTF-8"

Detection of the unusual encodings:

# convert zh_utf8 from UTF-8 into unusual encodings
zh_big5 <- iconv(zh_utf8, "UTF-8", "BIG-5")
print(zh_big5)
#> [1] "\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5\xc1c\xc5餤\xa4\xe5"

zh_gb <- iconv(zh_utf8, "UTF-8", "GB18030")
print(zh_gb)
#> [1] "\xb7\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xceķ\xb1\xf3w\xd6\xd0\xce\xc4"

# detect encoding
detect_str_enc(c(zh_utf8, zh_big5, zh_gb))
#> [1] "UTF-8"   "BIG5"    "GB18030"

Basic Encoding() function returns unknown encoding:

Encoding(c(zh_utf8, zh_big5, zh_gb))
#> [1] "UTF-8"   "unknown" "unknown"

Raw bytes encoding detection

Sometimes file can’t be read as a string, for example, when it includes embedded nul (\000). In such cases it would be right to read the file as raw byte vector and detect encoding with detect_raw_enc function.

Let’s define the function for reading demo files as raw bytes vector:

read_bin <- function(path) {
  # get file path
  file <- system.file("examples", path, package = "uchardet")
  # read file to raw vector
  readBin(file, raw(), file.size(file))
}

# print first 5 bytes
read_bin("de/iso-8859-1.txt")[1:5]
#> [1] 49 53 4f 20 38

Detection of the ASII symbols from the byte vector:

detect_raw_enc(charToRaw("Hello, useR!"))
#> [1] "ASCII"

Unusual encodings (each file has it’s own encoding) detection:

detect_raw_enc(read_bin("de/iso-8859-1.txt"))
#> [1] "ISO-8859-1"
detect_raw_enc(read_bin("de/windows-1252.txt"))
#> [1] "WINDOWS-1252"
detect_raw_enc(read_bin("fr/utf-16.be"))
#> [1] "UTF-16"
detect_raw_enc(read_bin("zh/big5.txt"))
#> [1] "BIG5"

Also detect_raw_enc can be used with curl:

library(curl)
# fetch page content as raw vector
cnt <- curl_fetch_memory("http://www.ppomppu.co.kr")$content
# detect page encoding
enc <- detect_raw_enc(cnt)
# convert encoding to UTF-8
cnt <- iconv(readBin(cnt, character()), enc, "UTF-8")

We used the readBin(cnt, character()) instead rawToChar(cnt) because the rawToChar() function may raise an errors with some encodings.

Files encoding detection

Function detect_file_enc will be helpful for detection files encoding without importing these files into the working environment. detect_file_enc uses the sliding window with the 65536 bytes width, in result there is no need to import the entire file.

Function is vectorized and accepts the character vector of file paths. Non existing files will be skipped and names of the files will be attached to output vector as names attribute.

# paths to examples files
ex_path <- system.file("examples", package = "uchardet")
ex_files <- Sys.glob(file.path(ex_path, "*", "*"))
# detect encoding
res <- detect_file_enc(ex_files)

Let’s compare results with the original files encodings:

# regex pattern
pattern <- ".*/examples/((.*)/(.*)\\.(?:.*))$"
proto <- list(file = character(1L), lang = character(1L), original = character(1L))
cmp <- strcapture(pattern, ex_files, proto)
cmp$lang <- toupper(cmp$lang)
cmp$original <- toupper(cmp$original)
cmp$uchardet <- res
head(cmp, n = 15)
#>                        file lang          original     uchardet
#> 1         ar/iso-8859-6.txt   AR        ISO-8859-6   ISO-8859-6
#> 2              ar/utf-8.txt   AR             UTF-8        UTF-8
#> 3       ar/windows-1256.txt   AR      WINDOWS-1256 WINDOWS-1256
#> 4       bg/windows-1251.txt   BG      WINDOWS-1251 WINDOWS-1251
#> 5             cs/ibm852.txt   CS            IBM852        UTF-8
#> 6         cs/iso-8859-2.txt   CS        ISO-8859-2   ISO-8859-2
#> 7  cs/mac-centraleurope.txt   CS MAC-CENTRALEUROPE        UTF-8
#> 8              cs/utf-8.txt   CS             UTF-8        UTF-8
#> 9       cs/windows-1250.txt   CS      WINDOWS-1250   ISO-8859-2
#> 10        da/iso-8859-1.txt   DA        ISO-8859-1  ISO-8859-15
#> 11       da/iso-8859-15.txt   DA       ISO-8859-15  ISO-8859-15
#> 12             da/utf-8.txt   DA             UTF-8        UTF-8
#> 13      da/windows-1252.txt   DA      WINDOWS-1252 WINDOWS-1252
#> 14        de/iso-8859-1.txt   DE        ISO-8859-1   ISO-8859-1
#> 15      de/windows-1252.txt   DE      WINDOWS-1252 WINDOWS-1252

C++ API

You can use the uchardet package functions via Rcpp.

// [[Rcpp::depends(uchardet)]]

#include <Rcpp.h>
#include <uchardet.h>

using namespace Rcpp;

// [[Rcpp::export]]
StringVector test_string(StringVector x) {
  return uchardet::detect_str_enc(x);
}

// [[Rcpp::export]]
StringVector test_file(StringVector x) {
  return uchardet::detect_file_enc(x);
}

// [[Rcpp::export]]
StringVector test_raw(RawVector x) {
  return uchardet::detect_raw_enc(x);
}