This package applies WordPiece tokenization to input text, given an appropriate WordPiece vocabulary. Currently, the BERT tokenization conventions are used. The basic tokenization algorithm is:
Ideally, a WordPiece vocabulary will be complete enough to represent any word, but this is not required.
The vocabulary is represented by the package as a named integer vector, with a logical attribute
is_cased to indicate whether the vocabulary is case sensitive. The names are the actual tokens, and the integer values are the token indices (this would be the input to a BERT model, for example).
A vocabulary can be read from a text file containing a single token per line. The token index is taken to be the line number, starting from zero. These conventions are adopted for compatibility with the vocabulary and file format used in the pretrained BERT checkpoints released by Google Research. The casedness of the vocabulary is inferred from the content of the vocabulary.
When a text vocabulary is loaded in an interactive R session, the option is given to cache the vocabulary as an RDS file for faster future loading.
library(wordpiece) # Get path to sample vocabulary included with package. vocab_path <- system.file("extdata", "tiny_vocab.txt", package = "wordpiece") # Load the vocabulary. vocab <- load_or_retrieve_vocab(vocab_path, use_cache = FALSE) # Take a peek at the vocabulary. head(vocab) #> [PAD] [CLS] [SEP] ! . , #> 0 1 2 3 4 5
Tokenize text by calling
wordpiece_tokenize on the text, passing the vocabulary as the
vocab parameter. The output of
wordpiece_tokenize is a named integer vector of token indices.
# Now tokenize some text! wordpiece_tokenize(text = "I love tacos, apples, and tea!", vocab = vocab) #> i love tacos , app ##les , and t ##e ##a ! #> 6 7 8 5 10 11 5 9 30 41 37 3
The above vocabulary contained no tokens starting with an uppercase letter, so it was assumed to be uncased. When tokenizing text with an uncased vocabulary, the input is converted to lowercase before any other processing is applied. If the vocabulary contains at least one capitalized token, it will be taken as case-sensitive, and the case of the input text is preserved. Note that in a cased vocabulary, capitalized and uncapitalized versions of the same word are different tokens, and must both be included in the vocabulary to be recognized.
# The above vocabulary was uncased. attr(vocab, "is_cased") #>  FALSE # Here is the same vocabulary, but containing the capitalized token "Hi". vocab_path2 <- system.file("extdata", "tiny_vocab_cased.txt", package = "wordpiece") vocab_cased <- load_or_retrieve_vocab(vocab_path2, use_cache = FALSE) head(vocab_cased) #> [PAD] [CLS] [SEP] ! . , #> 0 1 2 3 4 5 # vocab_cased is inferred to be case-sensitive... attr(vocab_cased, "is_cased") #>  TRUE # ... so the tokenization will *not* convert strings to lowercase, and so the # words "I" and "And" are not found in the vocabulary (though "and" still is). wordpiece_tokenize(text = "And I love tacos and salsa!", vocab = vocab_cased) #> [UNK] [UNK] love tacos and s ##a ##l ##s ##a ! #> 64 64 8 9 10 30 38 49 56 38 3
Note that the default value for the
unk_token argument, "[UNK]", is present in the above vocabularies, so it had an integer index in the tokenization. If that token were not in the vocabulary, its index would be coded as
wordpiece_tokenize(text = "I love tacos!", vocab = vocab_cased, unk_token = "[missing]") #> [missing] love tacos ! #> NA 8 9 3
The package defaults are set to be compatible with BERT tokenization. If you have a different use case, be sure to check all parameter values.