knn算法的核心思想是如果一個樣本在特征空間中的k個最相鄰的樣本中的大多數屬于某一個類別,則該樣本也屬于這個類別,并具有這個類別上樣本的特性。該方法在確定分類決策上只依據最鄰近的一個或者幾個樣本的類別來決定待分樣本所屬的類別。knn方法在類別決策時,只與極少量的相鄰樣本有關。由于knn方法主要靠周圍有限的鄰近的樣本,而不是靠判別類域的方法來確定所屬類別的,因此對于類域的交叉或重疊較多的待分樣本集來說,knn方法較其他方法更為合適。
knn算法流程如下:
1. 計算當前測試數據與訓練數據中的每條數據的距離
2. 圈定距離最近的k個訓練對象,作為測試對象的近鄰
3. 計算這k個訓練對象中出現最多的那個類別,并將這個類別作為當前測試數據的類別
以上流程是knn的大致流程,按照這個流程實現的mr效率并不高,可以在這之上進行優化。在這里只寫,跟著這個流程走的mr實現過程。
mapper的設計:
由于測試數據相比于訓練數據來說,會小很多,因此將測試數據用java api讀取,放到內存中。所以,在setup中需要對測試數據進行初始化。在map中,計算當前測試數據與每條訓練數據的距離,mapper的值類型為:<object, text, intwritable,mywritable>。map輸出鍵類型為intwritable,存放當前測試數據的下標,輸出值類型為mywritable,這是自定義值類型,其中存放的是距離以及與測試數據比較的訓練數據的類別。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
|
public class knnmapper extends mapper<object, text, intwritable,mywritable> { logger log = loggerfactory.getlogger(knnmapper. class ); private list< float []> testdata; @override protected void setup(context context) throws ioexception, interruptedexception { // todo auto-generated method stub configuration conf= context.getconfiguration(); conf.set( "fs.defaultfs" , "master:8020" ); string testpath= conf.get( "testfilepath" ); path testdatapath= new path(testpath); filesystem fs = filesystem.get(conf); this .testdata = readtestdata(fs,testdatapath); } @override protected void map(object key, text value, context context) throws ioexception, interruptedexception { // todo auto-generated method stub string[] line = value.tostring().split( "," ); float [] traindata = new float [line.length- 1 ]; for ( int i= 0 ;i<traindata.length;i++){ traindata[i] = float .valueof(line[i]); log.info( "訓練數據:" +line[i]+ "類別:" +line[line.length- 1 ]); } for ( int i= 0 ; i< this .testdata.size();i++){ float [] testi = this .testdata.get(i); float distance = outh(testi, traindata); log.info( "距離:" +distance); context.write( new intwritable(i), new mywritable(distance, line[line.length- 1 ])); } } private list< float []> readtestdata(filesystem fs,path path) throws ioexception { //補充代碼完整 fsdatainputstream data = fs.open(path); bufferedreader bf = new bufferedreader( new inputstreamreader(data)); string line = "" ; list< float []> list = new arraylist<>(); while ((line = bf.readline()) != null ) { string[] items = line.split( "," ); float [] item = new float [items.length]; for ( int i= 0 ;i<items.length;i++){ item[i] = float .valueof(items[i]); } list.add(item); } return list; } // 計算歐式距離 private static float outh( float [] testdata, float [] indata) { float distance = 0 .0f; for ( int i= 0 ;i<testdata.length;i++){ distance += (testdata[i]-indata[i])*(testdata[i]-indata[i]); } distance = ( float )math.sqrt(distance); return distance; } } |
自定義值類型mywritable如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
|
public class mywritable implements writable{ private float distance; private string label; public mywritable() { // todo auto-generated constructor stub } public mywritable( float distance, string label){ this .distance = distance; this .label = label; } @override public string tostring() { // todo auto-generated method stub return this .distance+ "," + this .label; } @override public void write(dataoutput out) throws ioexception { // todo auto-generated method stub out.writefloat(distance); out.writeutf(label); } @override public void readfields(datainput in) throws ioexception { // todo auto-generated method stub this .distance = in.readfloat(); this .label = in.readutf(); } public float getdistance() { return distance; } public void setdistance( float distance) { this .distance = distance; } public string getlabel() { return label; } public void setlabel(string label) { this .label = label; } } |
在reducer端中,需要初始化參數k,也就是圈定距離最近的k個對象的k值。在reduce中需要對距離按照從小到大的距離排序,然后選取前k條數據,再計算這k條數據中,出現次數最多的那個類別并將這個類別與測試數據的下標相對應并以k,v的形式輸出到hdfs上。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
|
public class knnreducer extends reducer<intwritable, mywritable, intwritable, text> { private int k; @override protected void setup(context context) throws ioexception, interruptedexception { // todo auto-generated method stub this .k = context.getconfiguration().getint( "k" , 5 ); } @override /*** * key => 0 * values =>([1,lable1],[2,lable2],[3,label2],[2.5,lable2]) */ protected void reduce(intwritable key, iterable<mywritable> values, context context) throws ioexception, interruptedexception { // todo auto-generated method stub mywritable[] mywrit = new mywritable[k]; for ( int i= 0 ;i<k;i++){ mywrit[i] = new mywritable( float .max_value, "-1" ); } // 找出距離最小的前k個 for (mywritable m : values) { float distance = m.getdistance(); string label = m.getlabel(); for (mywritable m1: mywrit){ if (distance < m1.getdistance()){ m1.setdistance(distance); m1.setlabel(label); } } } // 找出前k個中,出現次數最多的類別 string[] testclass = new string[k]; for ( int i= 0 ;i<k;i++){ testclass[i] = mywrit[i].getlabel(); } string countmost = mostele(testclass); context.write(key, new text(countmost)); } public static string mostele(string[] strarray) { hashmap<string, integer> map = new hashmap<>(); for ( int i = 0 ; i < strarray.length; i++) { string str = strarray[i]; if (map.containskey(str)) { int tmp = map.get(str); map.put(str, tmp+ 1 ); } else { map.put(str, 1 ); } } // 得到hashmap中值最大的鍵,也就是出現次數最多的類別 collection<integer> count = map.values(); int maxcount = collections.max(count); string maxstring = "" ; for (map.entry<string, integer> entry: map.entryset()){ if (maxcount == entry.getvalue()) { maxstring = entry.getkey(); } } return maxstring; } } |
最后輸出結果如下:
以上就是本文的全部內容,希望對大家的學習有所幫助,也希望大家多多支持服務器之家。
原文鏈接:https://blog.csdn.net/Angelababy_huan/article/details/53045579