Class Description
This course is a graduate-level course on fundamental techniques in information retrieval and text mining. By taking this course, students learn how to crawl, clean, process, mine, and infer knowledge from a massive amount of text data; how to build a search engine from scratch, including indexing, building retrieval models, and evaluating the performance of a search engine; they will also learn important machine learning and deep learning techniques for text data, including topic model, LSTM and BERT; finally, they will learn frontier research topics in text mining and information retrieval, and get research experience in these topics by working on the final project.
Learning Goals:
- Perform basic text processing tasks such as crawling data from the web, pre-processing and cleaning natural language sentences;
- Evaluate ranking algorithms by using information retrieval evaluation techniques, and implement text retrieval models such as TF-IDF and BM25;
- Use Elastic search to implement a prototypical search engine on Twitter data;
- Derive inference algorithms for the maximum likelihood estimation (MLE), implement the expectation maximization (EM) algorithm;
- Use tools such as LSTM/Bert for text classification tasks;
Recommended reading:
- Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining. ChengXiang Zhai and Sean Massung
- Introduction to Information Retrieval. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze