成人免费看,欧美在线a,国产精品99久久久久久久vr

tf-idf

前言

前段時間，又具體看了自己以前整理的tf-idf，這里把它發布在博客上，知識就是需要不斷的重復的，否則就感覺生疏了。

tf-idf理解

tf-idf（term frequency–inverse document frequency）是一種用于資訊檢索與資訊探勘的常用加權技術, tfidf的主要思想是：如果某個詞或短語在一篇文章中出現的頻率tf高，并且在其他文章中很少出現，則認為此詞或者短語具有很好的類別區分能力，適合用來分類。tfidf實際上是：tf * idf，tf詞頻(term frequency)，idf反文檔頻率(inverse document frequency)。tf表示詞條在文檔d中出現的頻率。idf的主要思想是：如果包含詞條t的文檔越少，也就是n越小，idf越大，則說明詞條t具有很好的類別區分能力。如果某一類文檔c中包含詞條t的文檔數為m，而其它類包含t的文檔總數為k，顯然所有包含t的文檔數n=m + k，當m大的時候，n也大，按照idf公式得到的idf的值會小，就說明該詞條t類別區分能力不強。但是實際上，如果一個詞條在一個類的文檔中頻繁出現，則說明該詞條能夠很好代表這個類的文本的特征，這樣的詞條應該給它們賦予較高的權重，并選來作為該類文本的特征詞以區別與其它類文檔。這就是idf的不足之處.

tf公式：

TF-IDF理解及其Java實現代碼實例

以上式子中 TF-IDF理解及其Java實現代碼實例是該詞在文件中的出現次數，而分母則是在文件中所有字詞的出現次數之和。

idf公式：

TF-IDF理解及其Java實現代碼實例

|d|：語料庫中的文件總數

TF-IDF理解及其Java實現代碼實例：包含詞語 t_i 的文件數目（即 n_i,j不等于0的文件數目）如果該詞語不在語料庫中，就會導致被除數為零，因此一般情況下使用

TF-IDF理解及其Java實現代碼實例

然后

TF-IDF理解及其Java實現代碼實例

tf-idf實現（java）

這里采用了外部插件ikanalyzer-2012.jar，用其進行分詞

具體代碼如下：

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

									package tfidf;

									import java.io.*;

									import java.util.*;

									import org.wltea.analyzer.lucene.ikanalyzer;

									public class readfiles {

									    /**

									   * @param args

									   */

									    private static arraylist<string> filelist = new arraylist<string>();

									    // the list of file

									    //get list of file for the directory, including sub-directory of it

									    public static list<string> readdirs(string filepath) throws filenotfoundexception, ioexception

									      {

									        try

									            {

									            file file = new file(filepath);

									            if(!file.isdirectory())

									                  {

									                system.out.println("輸入的[]");

									                system.out.println("filepath:" + file.getabsolutepath());

									            } else

									                  {

									                string[] flist = file.list();

									                for (int i = 0; i < flist.length; i++)

									                        {

									                    file newfile = new file(filepath + "\\" + flist[i]);

									                    if(!newfile.isdirectory())

									                              {

									                        filelist.add(newfile.getabsolutepath());

									                    } else if(newfile.isdirectory()) //if file is a directory, call readdirs

									                    {

									                        readdirs(filepath + "\\" + flist[i]);

									                    }

									                }

									            }

									        }

									        catch(filenotfoundexception e)

									            {

									            system.out.println(e.getmessage());

									        }

									        return filelist;

									    }

									    //read file

									    public static string readfile(string file) throws filenotfoundexception, ioexception

									      {

									        stringbuffer strsb = new stringbuffer();

									        //string is constant， stringbuffer can be changed.

									        inputstreamreader instrr = new inputstreamreader(new fileinputstream(file), "gbk");

									        //byte streams to character streams

									        bufferedreader br = new bufferedreader(instrr);

									        string line = br.readline();

									        while(line != null){

									            strsb.append(line).append("\r\n");

									            line = br.readline();

									        }

									        return strsb.tostring();

									    }

									    //word segmentation

									    public static arraylist<string> cutwords(string file) throws ioexception{

									        arraylist<string> words = new arraylist<string>();

									        string text = readfiles.readfile(file);

									        ikanalyzer analyzer = new ikanalyzer();

									        words = analyzer.split(text);

									        return words;

									    }

									    //term frequency in a file, times for each word

									    public static hashmap<string, integer> normaltf(arraylist<string> cutwords){

									        hashmap<string, integer> restf = new hashmap<string, integer>();

									        for (string word : cutwords){

									            if(restf.get(word) == null){

									                restf.put(word, 1);

									                system.out.println(word);

									            } else{

									                restf.put(word, restf.get(word) + 1);

									                system.out.println(word.tostring());

									            }

									        }

									        return restf;

									    }

									    //term frequency in a file, frequency of each word

									    public static hashmap<string, float> tf(arraylist<string> cutwords){

									        hashmap<string, float> restf = new hashmap<string, float>();

									        int wordlen = cutwords.size();

									        hashmap<string, integer> inttf = readfiles.normaltf(cutwords);

									        iterator iter = inttf.entryset().iterator();

									        //iterator for that get from tf

									        while(iter.hasnext()){

									            map.entry entry = (map.entry)iter.next();

									            restf.put(entry.getkey().tostring(), float.parsefloat(entry.getvalue().tostring()) / wordlen);

									            system.out.println(entry.getkey().tostring() + " = "+ float.parsefloat(entry.getvalue().tostring()) / wordlen);

									        }

									        return restf;

									    }

									    //tf times for file

									    public static hashmap<string, hashmap<string, integer>> normaltfallfiles(string dirc) throws ioexception{

									        hashmap<string, hashmap<string, integer>> allnormaltf = new hashmap<string, hashmap<string,integer>>();

									        list<string> filelist = readfiles.readdirs(dirc);

									        for (string file : filelist){

									            hashmap<string, integer> dict = new hashmap<string, integer>();

									            arraylist<string> cutwords = readfiles.cutwords(file);

									            //get cut word for one file

									            dict = readfiles.normaltf(cutwords);

									            allnormaltf.put(file, dict);

									        }

									        return allnormaltf;

									    }

									    //tf for all file

									    public static hashmap<string,hashmap<string, float>> tfallfiles(string dirc) throws ioexception{

									        hashmap<string, hashmap<string, float>> alltf = new hashmap<string, hashmap<string, float>>();

									        list<string> filelist = readfiles.readdirs(dirc);

									        for (string file : filelist){

									            hashmap<string, float> dict = new hashmap<string, float>();

									            arraylist<string> cutwords = readfiles.cutwords(file);

									            //get cut words for one file

									            dict = readfiles.tf(cutwords);

									            alltf.put(file, dict);

									        }

									        return alltf;

									    }

									    public static hashmap<string, float> idf(hashmap<string,hashmap<string, float>> all_tf){

									        hashmap<string, float> residf = new hashmap<string, float>();

									        hashmap<string, integer> dict = new hashmap<string, integer>();

									        int docnum = filelist.size();

									        for (int i = 0; i < docnum; i++){

									            hashmap<string, float> temp = all_tf.get(filelist.get(i));

									            iterator iter = temp.entryset().iterator();

									            while(iter.hasnext()){

									                map.entry entry = (map.entry)iter.next();

									                string word = entry.getkey().tostring();

									                if(dict.get(word) == null){

									                    dict.put(word, 1);

									                } else {

									                    dict.put(word, dict.get(word) + 1);

									                }

									            }

									        }

									        system.out.println("idf for every word is:");

									        iterator iter_dict = dict.entryset().iterator();

									        while(iter_dict.hasnext()){

									            map.entry entry = (map.entry)iter_dict.next();

									            float value = (float)math.log(docnum / float.parsefloat(entry.getvalue().tostring()));

									            residf.put(entry.getkey().tostring(), value);

									            system.out.println(entry.getkey().tostring() + " = " + value);

									        }

									        return residf;

									    }

									    public static void tf_idf(hashmap<string,hashmap<string, float>> all_tf,hashmap<string, float> idfs){

									        hashmap<string, hashmap<string, float>> restfidf = new hashmap<string, hashmap<string, float>>();

									        int docnum = filelist.size();

									        for (int i = 0; i < docnum; i++){

									            string filepath = filelist.get(i);

									            hashmap<string, float> tfidf = new hashmap<string, float>();

									            hashmap<string, float> temp = all_tf.get(filepath);

									            iterator iter = temp.entryset().iterator();

									            while(iter.hasnext()){

									                map.entry entry = (map.entry)iter.next();

									                string word = entry.getkey().tostring();

									                float value = (float)float.parsefloat(entry.getvalue().tostring()) * idfs.get(word);

									                tfidf.put(word, value);

									            }

									            restfidf.put(filepath, tfidf);

									        }

									        system.out.println("tf-idf for every file is :");

									        distfidf(restfidf);

									    }

									    public static void distfidf(hashmap<string, hashmap<string, float>> tfidf){

									        iterator iter1 = tfidf.entryset().iterator();

									        while(iter1.hasnext()){

									            map.entry entrys = (map.entry)iter1.next();

									            system.out.println("filename: " + entrys.getkey().tostring());

									            system.out.print("{");

									            hashmap<string, float> temp = (hashmap<string, float>) entrys.getvalue();

									            iterator iter2 = temp.entryset().iterator();

									            while(iter2.hasnext()){

									                map.entry entry = (map.entry)iter2.next();

									                system.out.print(entry.getkey().tostring() + " = " + entry.getvalue().tostring() + ", ");

									            }

									            system.out.println("}");

									        }

									    }

									    public static void main(string[] args) throws ioexception {

									        // todo auto-generated method stub

									        string file = "d:/testfiles";

									        hashmap<string,hashmap<string, float>> all_tf = tfallfiles(file);

									        system.out.println();

									        hashmap<string, float> idfs = idf(all_tf);

									        system.out.println();

									        tf_idf(all_tf, idfs);

									    }

									}