I’ve been into Golang lately, and today I’m glad to announce my second open source project in Golang, following the feature flags API. My second package is all about word segmentation.
What is the word segmentation problem?
Word segmentation is the process of dividing a phrase without spaces back into its constituent parts. For example, consider a phrase like thisisatest
. Humans can immediately identify that the correct phrase should be this is a test
. But for machines, this is a tricky problem.
An approach to this problem
A basic idea would be to use a dictionary, and then to try to split words if the current chunk of letters is a valid word. But then you run into issues with sentences like peanutbutter
that you will split with this approach as pea nut butter
instead of peanut butter
.
The idea was to take advantage of frequencies of words in a corpus. This is where the concept of a n-gram is used. In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application.
For example, this is an extract of some unigrams in a corpus composed of 1,024,908,267,229 words distributed by the Linguistic Data Consortium.
used 421438139
go 421086358
b 419765694
work 419483948
last 417601616
most 416210411
music 414028837
buy 410780176
data 406908328
make 405084642
them 403000411
should 402028056
Using unigrams and bigrams, we can score an arrangement of words. This is what is done in the score method for example.
Concurrency and channels
This was also a great opportunity for me to work with channels, because some parts of the program can be run in parallel. I’m just starting to work around goroutines and channels, but I really like it!
Take a look at the source code and the documentation on GitHub: github.com/AntoineAugusti/wordsegmentation