Class Description

This course is a graduate-level course on fundamental techniques in information retrieval and text mining. By taking this course, students learn how to crawl, clean, process, mine, and infer knowledge from a massive amount of text data; how to build a search engine from scratch, including indexing, building retrieval models, and evaluating the performance of a search engine; they will also learn important machine learning and deep learning techniques for text data, including topic model, LSTM and BERT; finally, they will learn frontier research topics in text mining and information retrieval, and get research experience in these topics by working on the final project.

Learning Goals:

Perform basic text processing tasks such as crawling data from the web, pre-processing and cleaning natural language sentences;
Evaluate ranking algorithms by using information retrieval evaluation techniques, and implement text retrieval models such as TF-IDF and BM25;
Use Elastic search to implement a prototypical search engine on Twitter data;
Derive inference algorithms for the maximum likelihood estimation (MLE), implement the expectation maximization (EM) algorithm;
Use tools such as LSTM/Bert for text classification tasks;

Recommended reading:

Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining. ChengXiang Zhai and Sean Massung
Introduction to Information Retrieval. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze

Schedule

Simple Table

Week	Date	Topic
Week 1	Sep 11	History of IR, vector space model Slides overview.ipynb
Week 2	Sep 18	Probabilistic ranking principle, Probabilistic/LM-based retrieval Slides
Week 3	Sep 25	IR evaluation, relevance feedback Slides relevance_feedback.ipynb evaluation.ipynb
Week 4	Oct 2	IR infrastructure: inverted index Slides inverted_index.ipynb post_listing.ipynb
Week 5	Oct 9	Web search; learning to rank slides gradient_boosting_demo1.ipynb gradient_boosting_demo2.ipynb
Week 6	Oct 16	Topic model, representing semantics slides lda.ipynb
Week 7	Oct 23	Word2vec, Recurrent Networks slides word_embedding.ipynb
Week 8	Oct 30	Midterm
Week 8	Nov 6	Attention; Transformer architecture slides blue.ipynb transformer.ipynb transformer.py
Week 10	Nov 13	Pre-trained LMs, Automated ML slides
Week 11	Nov 20	Parameter Efficient Fine Tuning, In-Context Learning slides
Week 12	Nov 27	Security/hate speech/misinformation slides
Week 13	Dec 4	bias/evaluation/AI safety slides

Class Description

Learning Goals:

Recommended reading:

Schedule

FAQ