Skip to Content
Course cluster

Data Mining

A preserved cluster of undergraduate notes grouped by subject area.

6 notes

01

1. Data Warehouse

2020-02-02

Big Data GB: $2^{30}$ B TB, PB, EB, ZB data newly generated globally 2006: 180 EB 2001: 1.8 ZB 2020: 35 ZB Data Mining Examples: supermarket transactions valuable customers network...

02

Preprocessing

2020-02-02

预处理流程 General Data cleaning Data reduction Data Integration Data transformation Data cleaning incomplete: mainly from data collection 忽略属性 手动填充 使用全局量 属性均值 相同类属性均值 最可能值 噪音:mainly fr...

03

Association

2020-06-14

Quantitative Discriminant Rule general form: $\forall X,$target class$(X)\Leftrightarrow$ contition$ 1(X)[t:w 1,d:\omega 1]\vee\cdots\vee$ condition$ n(X)[t: w n, d:\omega n]$ Disc...

04

4. NLP

2020-02-09

Text NLP 预处理 (网页)确定 main block 去除标点与特殊符号 (网页)去除标签 (中文)分词 (英文)小写化 去除停用词 (英文)stemming + lemmatization Representation bag of words Binary Frequency TF IDF = $\text{tf}(t,d)\log\frac{|...

05

4. Prediction and Clustering

2020-02-06

Prediction Evaluate prediction algorithms Generalization Speed Robustness Scalability Comprehensibility Semi supervised Learning Generative methods Low density separations Graph ba...

06

5. Outlier Analysis

2020-02-06

Outlier analysis Outliers Global: deviates from the rest of the data set Contextual: deviates significantly with respect to a specific context of the object Collective: objects as...