Instructor Susan (Xueqing) Liu (xueqing.liu AT stevens DOT edu)
Meeting
  • Monday 6:30-9:00PM, Babbio 122
Susan's Office Hour
  • Wed: 11-1, GS 250
TA
  • Guanqun Yang
  • Fri: 4-6

Class Description

This course is a graduate-level course on fundamental techniques in information retrieval and text mining. By taking this course, students learn how to crawl, clean, process, mine, and infer knowledge from a massive amount of text data; how to build a search engine from scratch, including indexing, building retrieval models, and evaluating the performance of a search engine; they will also learn important machine learning and deep learning techniques for text data, including topic model, LSTM and BERT; finally, they will learn state-of-the-art research topics in text mining and information retrieval, and get research experience in these topics by working on the final project.

Learning Goals: Upon successful completion of this course, students should be able to:

  • Perform basic text processing tasks such as crawling data from the web, pre-processing and cleaning natural language sentences;
  • Evaluate ranking algorithms by using information retrieval evaluation techniques, and implement text retrieval models such as TF-IDF and BM25;
  • Use Elastic search to implement a prototypical search engine on Twitter data;
  • Derive inference algorithms for the maximum likelihood estimation (MLE), implement the expectation maximization (EM) algorithm;
  • Use state-of-the-art tools such as LSTM/Bert for text classification tasks;

Recommended reading:

Schedule

Note: some slides require password, you can find the password from the home page in Canvas

Date Slides/Readings Homework/Exams
Week 1 of Sept 12
Week 2 of Sept 19
Week 3 of Sept 26
  • HW1 due Implementing IR models, evaluations to retrievel Stack Overflow posts
Week 4 of Oct 3
Week 5 of Oct 10
  • HW2 due Using ElasticSearch to retrieve StackOverflow posts
Week 6 of Oct 17
Week 7 of Oct 24
  • Project paper review due
Week 8 of Oct 31
  • midterm
Week 9 of Nov 7
  • HW3 due deriving/implementing EM algorithm for StackOverflow word tagging
Week 10 of Nov 14
  • Project proposal due
Week 11 of Nov 21
  • HW4 due StackOverflow posts tag prediction
Week 12 of Nov 28
  • Frontier topic 3

  • Reading
Week 13 of Dec 5
  • Project presentation

  • Reading
Week 14 of Dec 12
  • Project presentation

Final Grade Calculator

Homework40%
Midterm30%
Project 30%

Policies

Late Policy: submit within 24 hours of deadline - 90%, within 48 hours - 70%, over 48 hours - 0 point, 0 if code not compile.

Academic Integrity: Students must follow the instructions from the (Stevens Honor system). This course will have a zero-tolerance policy regarding plagiarism. You should complete all the assignments and quizzes on your own. You can help your classmates with questions such as how to use the programming language, what the library classes or methods do, what the errors mean, and how to interpret the assignment instructions. You are encouraged to come to both the instructor and the CAs' office hours regarding any questions you have, or email your questions to both the instructors or the CAs. However, you may not give or receive help from others (except the CA) with the actual implementation or answers for any of the assignments or tests. Do not show or share your code with others, and do not view or copy source code from others. For the same reason, you are not allowed to copy and paste a code snippet you found online in the assignments. All electronic work submitted for this course will be archived and subjected to automatic plagiarism detection. Whenever in doubt, please seek clarifications from the instructor. Students who violate the academic intergrity principle of Stevens will be immediately reported to the department and the school (which could leave a permanent mark on the transcript).

If you need special accommodations because of a disability, please contact the instructor through emails.

Thanks to Prof. Gang Wang for the website template