Data Mining
A preserved cluster of undergraduate notes grouped by subject area.
6 notes
1. Data Warehouse
Big Data GB: $2^{30}$ B TB, PB, EB, ZB data newly generated globally 2006: 180 EB 2001: 1.8 ZB 2020: 35 ZB Data Mining Examples: supermarket transactions valuable customers network...
Preprocessing
预处理流程 General Data cleaning Data reduction Data Integration Data transformation Data cleaning incomplete: mainly from data collection 忽略属性 手动填充 使用全局量 属性均值 相同类属性均值 最可能值 噪音:mainly fr...
Association
Quantitative Discriminant Rule general form: $\forall X,$target class$(X)\Leftrightarrow$ contition$ 1(X)[t:w 1,d:\omega 1]\vee\cdots\vee$ condition$ n(X)[t: w n, d:\omega n]$ Disc...
4. NLP
Text NLP 预处理 (网页)确定 main block 去除标点与特殊符号 (网页)去除标签 (中文)分词 (英文)小写化 去除停用词 (英文)stemming + lemmatization Representation bag of words Binary Frequency TF IDF = $\text{tf}(t,d)\log\frac{|...
4. Prediction and Clustering
Prediction Evaluate prediction algorithms Generalization Speed Robustness Scalability Comprehensibility Semi supervised Learning Generative methods Low density separations Graph ba...
5. Outlier Analysis
Outlier analysis Outliers Global: deviates from the rest of the data set Contextual: deviates significantly with respect to a specific context of the object Collective: objects as...