Class Description

This course is a graduate-level course on fundamental techniques in information retrieval and text mining. By taking this course, students learn how to crawl, clean, process, mine, and infer knowledge from a massive amount of text data; how to build a search engine from scratch, including indexing, building retrieval models, and evaluating the performance of a search engine; they will also learn important machine learning and deep learning techniques for text data, including topic model, LSTM and BERT; finally, they will learn frontier research topics in text mining and information retrieval, and get research experience in these topics by working on the final project.

Learning Goals:

  • Perform basic text processing tasks such as crawling data from the web, pre-processing and cleaning natural language sentences;
  • Evaluate ranking algorithms by using information retrieval evaluation techniques, and implement text retrieval models such as TF-IDF and BM25;
  • Use Elastic search to implement a prototypical search engine on Twitter data;
  • Derive inference algorithms for the maximum likelihood estimation (MLE), implement the expectation maximization (EM) algorithm;
  • Use tools such as LSTM/Bert for text classification tasks;

Recommended reading:


Schedule

Simple Table
Week Date Topic
Week 1 Sep 11
History of IR, vector space model
Week 2 Sep 18
Probabilistic ranking principle, Probabilistic/LM-based retrieval
Week 3 Sep 25
IR evaluation, relevance feedback
Week 4 Oct 2
IR infrastructure: inverted index
Week 5 Oct 9
Web search; learning to rank
Week 6 Oct 16
Topic model, representing semantics
Week 7 Oct 23
Word2vec, Recurrent Networks
Week 8 Oct 30
Midterm
Week 8 Nov 6
Attention; Transformer architecture
Week 10 Nov 13
Pre-trained LMs, Automated ML
Week 11 Nov 20
Parameter Efficient Fine Tuning, In-Context Learning
Week 12 Nov 27
Security/hate speech/misinformation
Week 13 Dec 4
bias/evaluation/AI safety