
1 Text preprocessing

2 Outline

3 Outline

4 Outline

5 Zipf’s law

6 High-frequency words

7 Low-frequency words

8 Zipf’s law vs. real data

9 Outline

10 Heaps’ law

11 Outline

12 Text preprocessing pipeline

13 Example

14 Stop-word removal

15 Outline

16 Stemming

17 Algorithmic stemming (Porter stemmer)

18 Algorithmic stemming (Porter stemmer)

19 Dictionary-based stemming

20 Hybrid stemming (Krovetz stemmer)

21 Stemming example

22 Outline

23 Example

24 Dealing with phrases

25 Summary

26 Additional References
