fastText is a library for efficient learning of word representations and sentence classification.
In NLP word vectors play an important role in almost all algorithms. That adds transfer learning features to an algorithm, which in turn reducing the requirement of massive datasets. There are a couple of word vectorization algorithms out there. Among them, Facebook's Fasttext is the newest and outperforming algorithm right now.
Almost all word vectorization algorithms depend on clustering methods. They assume some lexical element as the building block of any language. Some of them including Word2Vec and Stanford GloVe assumes a word itself as the building block of a language. This is where Fasttext is different. Although Word2Vec and GloVe perform really well on languages like English and Hindi, it's not that good at morphologically rich languages. That's happening because they assumed a word as the basic building block of a language. Facebook's Fasttext assumes an n-gram as the basic lexical unit of a language. With this assumption, it's possible to generate vectors for each possible n-gram in a sentence. Each word in that sentence is then represented as a sum of each building n-gram vectors. This is great! Because we could represent any word, even if it's valid or invalid, outside of pre-trained word dictionary, Fasttext is able to figure out a best matching vector representation for that word by summing over its component n-grams. In this way, Fasttext is capable of handling spelling mistakes elegantly. Fasttext also outperforms at text classification for complicated language. For more technical information refer the papers mentioned here.
Building a Q&A chatbot with Fasttext: #
This is a very straightforward application of Fasttext. To build a chatbot from scratch its required to solve three separate Machine Learning problems. Two of them belongs to the field of NLP. Text classification is one among them, which is the first problem to be solved. As I have mentioned, Fasttext is good at text classification. It is able to handle the languages that existing systems were able to address along with a new set of languages. Theoretically, it could address almost all languages. We have tested it on Malayalam, which is a highly inflectional and agglutinative language. The results were amazing.
Below is the overall architecture of our Q&A chatbot.
Training data: #
Data for training is provided by the user of the application as question-answer pairs. We currently use a JSON format to input data in this demo. This makes it easy to convert it to an API service to collect corresponding data from any web-based UIs. At the backend, this data will be transformed into the Fasttext input format for training.
Model parameters: #
Below are some Fasttext model parameters that we have used train our model.
learning rate - 0.1 size of word vectors - 300 number of negatives sampled - 5 min length of char n-gram - 3 max length of char n-gram - 6