9 Key Concepts for Understanding Natural Language Processing
Oct. 16, 2017
Natural Language Processing, or NLP, allows the automatic extraction of information expressed in human language. It is used in text classification, machine translation, opinion analysis or chatbots, among many other applications.
When talking about Natural Language Processing , terms from linguistics, statistics and computation are combined. This varied terminology may cause confusion when you first approach NLP. With this list of 9 key concepts, we try to clarify this confusion, explaining in a simple way some of the most common NLP terms.
In this list we explain in a simple way some of the most common terms when talking about NLP.
Named Entity Recognition (NER)
It is the process of identifying predefined categories in a text, such as names, places, monetary amounts, organizations...
Think, for example, of a Natural Language Processing solution that classifies the minutes of our company's meetings. Before analyzing more complex issues, the system should automatically identify meeting participants or meeting dates.
It consists of identifying the lemma of a word and its corresponding inflected forms.
If, for example, we want to search for texts on cleaning, the system should also recognize words such as clean, cleaner, cleanliness or cleanse.
Part-of-speech (POS) tagging
This term is used to describe the grammatical labelling process. Specifically, POS tagging consists of identifying which category (noun, adjective, verb...) each word of a text belongs to.
Advanced systems take into account that the same term can change its grammatical category depending on its context. For instance, a huge number of words (such as face, answer, cook, attack, control, and many others) are both verbs and adjectives.
Stop words are those words which provide very little information to understand the overall meaning of a text. This category include articles, prepositions or pronouns.
Let's say that our NLP solution automatically organizes our company's documents according to the department to which they belong. The most common words (such as the, at or which) don’t offer relevant clues about the content of the text. So, those words should be filtered out before further processing.
Bag of words
The bag-of-words method, or bow model, is used to represent the number of times each word is repeated in a text.
This system focuses on word frequency and it disregards grammar and word order.
Word Sense Disambiguation
A word may have different meanings. Word Sense Disambiguation, or WSD, is the process of identifying which of these meanings is the correct one in a given context.
Analyzing semantic relationships between terms, an NLP system is able to distinguish whether the word capital refers to a city or a sum of money.
It is the process of identifying the subjective information in a text. Sentiment analysis is usually used to classify texts into positive, negative or neutral.
Sentiment analysis is widely applied in Marketing and Customer service. For instance, one of its most common applications is to monitor customer opinions expressed on the Internet.
Latent semantic analysis (LSA)
This technique consists of identifying patterns in the relationships of the concepts contained in a document. It is based on the premise that words used in the same context tend to have similar meanings.
LSA uses a mathematical technique called singular value decomposition (SVD).
Latent Dirichlet Allocation (LDA) Model
It is based on the principle that each text is a mixture of different topics and that each word is attributable to one of these topics.
It is a generative method, in which topics are automatically inferred through word analysis.