I’ve been into Golang lately, and today I’m glad to announce my second open source project in Golang, following the feature flags API. My second package is all about word segmentation.
What is the word segmentation problem?
Word segmentation is the process of dividing a phrase without spaces back into its constituent parts. For example, consider a phrase like
thisisatest. Humans can immediately identify that the correct phrase should be
this is a test. But for machines, this is a tricky problem.
An approach to this problem
A basic idea would be to use a dictionary, and then to try to split words if the current chunk of letters is a valid word. But then you run into issues with sentences like
peanutbutter that you will split with this approach as
pea nut butter instead of
The idea was to take advantage of frequencies of words in a corpus. This is where the concept of a n-gram is used. In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application.
For example, this is an extract of some unigrams in a corpus composed of 1,024,908,267,229 words distributed by the Linguistic Data Consortium.
Using unigrams and bigrams, we can score an arrangement of words. This is what is done in the score method for example.
Concurrency and channels
This was also a great opportunity for me to work with channels, because some parts of the program can be run in parallel. I’m just starting to work around goroutines and channels, but I really like it!
Take a look at the source code and the documentation on GitHub: github.com/AntoineAugusti/wordsegmentation