WelcomeWelcome


  • This lesson on Natural language processing in Python is for researchers working in the field of Humanities and/or Social Sciences
  • This lesson is an introduction to NLP and aims at implementing first practical NLP applications from scratch

Episode 1: Introducing NLP


Episode 2: PreprocessingPreprocessing


  • Preprocessing involves a number of steps that one can apply to their text to prepare it for further processing.
  • Preprocessing is important because it can improve your results
  • You do not always need to do all preprocessing steps. It depends on the task at hand which preprocessing steps are important.
  • A number of preprocessing steps are: lowercasing, tokenization, stop word removal, lemmatization, part-of-speech tagging.
  • Often you can use a pretrained model to process and analyse your data.

Episode 3: Word embeddingsWhat are word embeddings?Word2vec model(Optional) Training a word2vec model on our dataset


  • We can represent text as vectors of numbers (which makes it interpretable for machines)
  • The most efficient and useful way is to use word embeddings
  • We can easily compute how words are similar to each other with the cosine similarity
  • Dimensions in corpus-based word embeddings are many and not transparent
  • We can explore linguistic categories via word2vec by extracting the vectors of words belonging to some category we wish to investigate

  • To visualise word embeddings we must reduce their dimensions to 2

  • Word2vec does not deal very efficiently with polysemy as it does not allow to extract a different embedding to a word depending on its context

  • We can both train or load a pre-trained word2vec model

  • Embeddings of a trained model will reflect the statistics of the input dataset

  • Loading a (big) pre-trained word2vec model allows us to get embeddings that better reflect the syntactic and semantic relationship among (pairs of) words. Using one or the other will depend on your research question.