Word segmentation library in Golang

I’ve been into Golang lately, and today I’m glad to announce my second open source project in Golang, following the feature flags API. My second package is all about word segmentation.

What is the word segmentation problem?

Word segmentation is the process of dividing a phrase without spaces back into its constituent parts. For example, consider a phrase like thisisatest. Humans can immediately identify that the correct phrase should be this is a test. But for machines, this is a tricky problem.

An approach to this problem

A basic idea would be to use a dictionary, and then to try to split words if the current chunk of letters is a valid word. But then you run into issues with sentences like peanutbutter that you will split with this approach as pea nut butter instead of peanut butter.

The idea was to take advantage of frequencies of words in a corpus. This is where the concept of a n-gram is used. In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application.

For example, this is an extract of some unigrams in a corpus composed of 1,024,908,267,229 words distributed by the Linguistic Data Consortium.
used 421438139 go 421086358 b 419765694 work 419483948 last 417601616 most 416210411 music 414028837 buy 410780176 data 406908328 make 405084642 them 403000411 should 402028056

Using unigrams and bigrams, we can score an arrangement of words. This is what is done in the score method for example.

Concurrency and channels

This was also a great opportunity for me to work with channels, because some parts of the program can be run in parallel. I’m just starting to work around goroutines and channels, but I really like it!

Take a look at the source code and the documentation on GitHub: github.com/AntoineAugusti/wordsegmentation