Lesson Title: Key Points

WelcomeWelcome

This lesson on Natural language processing in Python is for researchers working in the field of Humanities and/or Social Sciences
This lesson is an introduction to NLP and aims at implementing first practical NLP applications from scratch

Preprocessing involves a number of steps that one can apply to their text to prepare it for further processing.
Preprocessing is important because it can improve your results
You do not always need to do all preprocessing steps. It depends on the task at hand which preprocessing steps are important.
A number of preprocessing steps are: lowercasing, tokenization, stop word removal, lemmatization, part-of-speech tagging.
Often you can use a pretrained model to process and analyse your data.

We can represent text as vectors of numbers (which makes it interpretable for machines)
The most efficient and useful way is to use word embeddings
We can easily compute how words are similar to each other with the cosine similarity
Dimensions in corpus-based word embeddings are many and not transparent

We can explore linguistic categories via word2vec by extracting the vectors of words belonging to some category we wish to investigate
To visualise word embeddings we must reduce their dimensions to 2
Word2vec does not deal very efficiently with polysemy as it does not allow to extract a different embedding to a word depending on its context

We can both train or load a pre-trained word2vec model
Embeddings of a trained model will reflect the statistics of the input dataset
Loading a (big) pre-trained word2vec model allows us to get embeddings that better reflect the syntactic and semantic relationship among (pairs of) words. Using one or the other will depend on your research question.