激情久久久_欧美视频区_成人av免费_不卡视频一二三区_欧美精品在欧美一区二区少妇_欧美一区二区三区的

服務器之家:專注于服務器技術及軟件下載分享
分類導航

PHP教程|ASP.NET教程|JAVA教程|ASP教程|編程技術|正則表達式|C/C++|IOS|C#|Swift|Android|JavaScript|易語言|

服務器之家 - 編程語言 - JAVA教程 - TF-IDF理解及其Java實現代碼實例

TF-IDF理解及其Java實現代碼實例

2021-02-05 12:21ywl925 JAVA教程

這篇文章主要介紹了TF-IDF理解及其Java實現代碼實例,簡單介紹了tfidf算法及其相應公式,然后分享了Java實現代碼,具有一定參考價值,需要的朋友可以了解下。

tf-idf

前言

前段時間,又具體看了自己以前整理的tf-idf,這里把它發布在博客上,知識就是需要不斷的重復的,否則就感覺生疏了。

tf-idf理解

tf-idf(term frequency–inverse document frequency)是一種用于資訊檢索與資訊探勘的常用加權技術, tfidf的主要思想是:如果某個詞或短語在一篇文章中出現的頻率tf高,并且在其他文章中很少出現,則認為此詞或者短語具有很好的類別區分能力,適合用來分類。tfidf實際上是:tf * idf,tf詞頻(term frequency),idf反文檔頻率(inverse document frequency)。tf表示詞條在文檔d中出現的頻率。idf的主要思想是:如果包含詞條t的文檔越少,也就是n越小,idf越大,則說明詞條t具有很好的類別區分能力。如果某一類文檔c中包含詞條t的文檔數為m,而其它類包含t的文檔總數為k,顯然所有包含t的文檔數n=m + k,當m大的時候,n也大,按照idf公式得到的idf的值會小,就說明該詞條t類別區分能力不強。但是實際上,如果一個詞條在一個類的文檔中頻繁出現,則說明該詞條能夠很好代表這個類的文本的特征,這樣的詞條應該給它們賦予較高的權重,并選來作為該類文本的特征詞以區別與其它類文檔。這就是idf的不足之處.

tf公式:

TF-IDF理解及其Java實現代碼實例

以上式子中TF-IDF理解及其Java實現代碼實例是該詞在文件TF-IDF理解及其Java實現代碼實例中的出現次數,而分母則是在文件TF-IDF理解及其Java實現代碼實例中所有字詞的出現次數之和。

idf公式:

TF-IDF理解及其Java實現代碼實例

|d|:語料庫中的文件總數

TF-IDF理解及其Java實現代碼實例:包含詞語 ti 的文件數目(即 ni,j不等于0的文件數目)如果該詞語不在語料庫中,就會導致被除數為零,因此一般情況下使用

TF-IDF理解及其Java實現代碼實例

然后

TF-IDF理解及其Java實現代碼實例

tf-idf實現(java)

這里采用了外部插件ikanalyzer-2012.jar,用其進行分詞

具體代碼如下:

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
package tfidf;
import java.io.*;
import java.util.*;
import org.wltea.analyzer.lucene.ikanalyzer;
public class readfiles {
    /**
   * @param args
   */
    private static arraylist<string> filelist = new arraylist<string>();
    // the list of file
    //get list of file for the directory, including sub-directory of it
    public static list<string> readdirs(string filepath) throws filenotfoundexception, ioexception
      {
        try
            {
            file file = new file(filepath);
            if(!file.isdirectory())
                  {
                system.out.println("輸入的[]");
                system.out.println("filepath:" + file.getabsolutepath());
            } else
                  {
                string[] flist = file.list();
                for (int i = 0; i < flist.length; i++)
                        {
                    file newfile = new file(filepath + "\\" + flist[i]);
                    if(!newfile.isdirectory())
                              {
                        filelist.add(newfile.getabsolutepath());
                    } else if(newfile.isdirectory()) //if file is a directory, call readdirs
                    {
                        readdirs(filepath + "\\" + flist[i]);
                    }
                }
            }
        }
        catch(filenotfoundexception e)
            {
            system.out.println(e.getmessage());
        }
        return filelist;
    }
    //read file
    public static string readfile(string file) throws filenotfoundexception, ioexception
      {
        stringbuffer strsb = new stringbuffer();
        //string is constant, stringbuffer can be changed.
        inputstreamreader instrr = new inputstreamreader(new fileinputstream(file), "gbk");
        //byte streams to character streams
        bufferedreader br = new bufferedreader(instrr);
        string line = br.readline();
        while(line != null){
            strsb.append(line).append("\r\n");
            line = br.readline();
        }
        return strsb.tostring();
    }
    //word segmentation
    public static arraylist<string> cutwords(string file) throws ioexception{
        arraylist<string> words = new arraylist<string>();
        string text = readfiles.readfile(file);
        ikanalyzer analyzer = new ikanalyzer();
        words = analyzer.split(text);
        return words;
    }
    //term frequency in a file, times for each word
    public static hashmap<string, integer> normaltf(arraylist<string> cutwords){
        hashmap<string, integer> restf = new hashmap<string, integer>();
        for (string word : cutwords){
            if(restf.get(word) == null){
                restf.put(word, 1);
                system.out.println(word);
            } else{
                restf.put(word, restf.get(word) + 1);
                system.out.println(word.tostring());
            }
        }
        return restf;
    }
    //term frequency in a file, frequency of each word
    public static hashmap<string, float> tf(arraylist<string> cutwords){
        hashmap<string, float> restf = new hashmap<string, float>();
        int wordlen = cutwords.size();
        hashmap<string, integer> inttf = readfiles.normaltf(cutwords);
        iterator iter = inttf.entryset().iterator();
        //iterator for that get from tf
        while(iter.hasnext()){
            map.entry entry = (map.entry)iter.next();
            restf.put(entry.getkey().tostring(), float.parsefloat(entry.getvalue().tostring()) / wordlen);
            system.out.println(entry.getkey().tostring() + " = "+ float.parsefloat(entry.getvalue().tostring()) / wordlen);
        }
        return restf;
    }
    //tf times for file
    public static hashmap<string, hashmap<string, integer>> normaltfallfiles(string dirc) throws ioexception{
        hashmap<string, hashmap<string, integer>> allnormaltf = new hashmap<string, hashmap<string,integer>>();
        list<string> filelist = readfiles.readdirs(dirc);
        for (string file : filelist){
            hashmap<string, integer> dict = new hashmap<string, integer>();
            arraylist<string> cutwords = readfiles.cutwords(file);
            //get cut word for one file
            dict = readfiles.normaltf(cutwords);
            allnormaltf.put(file, dict);
        }
        return allnormaltf;
    }
    //tf for all file
    public static hashmap<string,hashmap<string, float>> tfallfiles(string dirc) throws ioexception{
        hashmap<string, hashmap<string, float>> alltf = new hashmap<string, hashmap<string, float>>();
        list<string> filelist = readfiles.readdirs(dirc);
        for (string file : filelist){
            hashmap<string, float> dict = new hashmap<string, float>();
            arraylist<string> cutwords = readfiles.cutwords(file);
            //get cut words for one file
            dict = readfiles.tf(cutwords);
            alltf.put(file, dict);
        }
        return alltf;
    }
    public static hashmap<string, float> idf(hashmap<string,hashmap<string, float>> all_tf){
        hashmap<string, float> residf = new hashmap<string, float>();
        hashmap<string, integer> dict = new hashmap<string, integer>();
        int docnum = filelist.size();
        for (int i = 0; i < docnum; i++){
            hashmap<string, float> temp = all_tf.get(filelist.get(i));
            iterator iter = temp.entryset().iterator();
            while(iter.hasnext()){
                map.entry entry = (map.entry)iter.next();
                string word = entry.getkey().tostring();
                if(dict.get(word) == null){
                    dict.put(word, 1);
                } else {
                    dict.put(word, dict.get(word) + 1);
                }
            }
        }
        system.out.println("idf for every word is:");
        iterator iter_dict = dict.entryset().iterator();
        while(iter_dict.hasnext()){
            map.entry entry = (map.entry)iter_dict.next();
            float value = (float)math.log(docnum / float.parsefloat(entry.getvalue().tostring()));
            residf.put(entry.getkey().tostring(), value);
            system.out.println(entry.getkey().tostring() + " = " + value);
        }
        return residf;
    }
    public static void tf_idf(hashmap<string,hashmap<string, float>> all_tf,hashmap<string, float> idfs){
        hashmap<string, hashmap<string, float>> restfidf = new hashmap<string, hashmap<string, float>>();
        int docnum = filelist.size();
        for (int i = 0; i < docnum; i++){
            string filepath = filelist.get(i);
            hashmap<string, float> tfidf = new hashmap<string, float>();
            hashmap<string, float> temp = all_tf.get(filepath);
            iterator iter = temp.entryset().iterator();
            while(iter.hasnext()){
                map.entry entry = (map.entry)iter.next();
                string word = entry.getkey().tostring();
                float value = (float)float.parsefloat(entry.getvalue().tostring()) * idfs.get(word);
                tfidf.put(word, value);
            }
            restfidf.put(filepath, tfidf);
        }
        system.out.println("tf-idf for every file is :");
        distfidf(restfidf);
    }
    public static void distfidf(hashmap<string, hashmap<string, float>> tfidf){
        iterator iter1 = tfidf.entryset().iterator();
        while(iter1.hasnext()){
            map.entry entrys = (map.entry)iter1.next();
            system.out.println("filename: " + entrys.getkey().tostring());
            system.out.print("{");
            hashmap<string, float> temp = (hashmap<string, float>) entrys.getvalue();
            iterator iter2 = temp.entryset().iterator();
            while(iter2.hasnext()){
                map.entry entry = (map.entry)iter2.next();
                system.out.print(entry.getkey().tostring() + " = " + entry.getvalue().tostring() + ", ");
            }
            system.out.println("}");
        }
    }
    public static void main(string[] args) throws ioexception {
        // todo auto-generated method stub
        string file = "d:/testfiles";
        hashmap<string,hashmap<string, float>> all_tf = tfallfiles(file);
        system.out.println();
        hashmap<string, float> idfs = idf(all_tf);
        system.out.println();
        tf_idf(all_tf, idfs);
    }
}

結果如下圖:

TF-IDF理解及其Java實現代碼實例

常見問題

沒有加入lucene jar包

TF-IDF理解及其Java實現代碼實例

lucene包和je包版本不適合

TF-IDF理解及其Java實現代碼實例

總結

以上就是本文關于tf-idf理解及其java實現代碼實例的全部內容,希望對大家有所幫助。如有不足之處,歡迎留言指出。

原文鏈接:https://www.cnblogs.com/ywl925/archive/2013/08/26/3275878.html

延伸 · 閱讀

精彩推薦
主站蜘蛛池模板: 欧美一级一级 | 国产成人在线一区二区 | 精品国产一区二区三区久久久狼牙 | 一级做a爱视频 | 久久99精品久久久久久小说 | 精品一区二区三区电影 | 国产毛毛片一区二区三区四区 | 欧美韩国日本在线 | 麻豆视频网 | 国产chinesehd精品91 | 久久一区国产 | 九九综合视频 | xvideos korean| 最新黄色电影网站 | 黄网站色成年大片免费高 | 在线视频观看一区二区 | 国产精品久久久久久久久久久久久久久 | 亚洲va久久久噜噜噜久牛牛影视 | 日韩在线观看视频网站 | 国产91在线播放九色 | 99亚洲伊人久久精品影院红桃 | 日本大片在线播放 | 一级黄色播放 | 国产高清永久免费 | 911色_911色sss主站色播 | lutube成人福利在线观看 | 久久99网 | 日本欧美中文字幕 | 欧美不卡在线 | 最新日韩在线观看视频 | 免费在线观看午夜视频 | 日本高清一级片 | 老a影视网站在线观看免费 欧美日韩成人一区二区 | 国产精品久久久久久久久久久天堂 | 国产亚洲欧美日韩在线观看不卡 | 91精品国产91久久久 | 欧美成人免费tv在线播放 | 婷婷亚洲一区二区三区 | 91精品久久香蕉国产线看观看 | 美女黄色影院 | 亚洲国产一区二区三区 |