婷婷网址,精品久久99,最新精品国产

擁有高方差使得決策樹（secision tress）在處理特定訓練數據集時其結果顯得相對脆弱。bagging（bootstrap aggregating 的縮寫）算法從訓練數據的樣本中建立復合模型，可以有效降低決策樹的方差，但樹與樹之間有高度關聯（并不是理想的樹的狀態）。

隨機森林算法（Random forest algorithm）是對 bagging 算法的擴展。除了仍然根據從訓練數據樣本建立復合模型之外，隨機森林對用做構建樹（tree）的數據特征做了一定限制，使得生成的決策樹之間沒有關聯，從而提升算法效果。

本教程將實現如何用 Python 實現隨機森林算法。

bagged decision trees 與隨機森林算法的差異；
如何構建含更多方差的裝袋決策樹；
如何將隨機森林算法運用于預測模型相關的問題。

算法描述

這個章節將對隨機森林算法本身以及本教程的算法試驗所用的聲納數據集（Sonar dataset）做一個簡要介紹。

隨機森林算法

決策樹運行的每一步都涉及到對數據集中的最優分裂點（best split point）進行貪婪選擇（greedy selection）。

這個機制使得決策樹在沒有被剪枝的情況下易產生較高的方差。整合通過提取訓練數據庫中不同樣本（某一問題的不同表現形式）構建的復合樹及其生成的預測值能夠穩定并降低這樣的高方差。這種方法被稱作引導聚集算法（bootstrap aggregating），其簡稱 bagging 正好是裝進口袋，袋子的意思，所以被稱為「裝袋算法」。該算法的局限在于，由于生成每一棵樹的貪婪算法是相同的，那么有可能造成每棵樹選取的分裂點（split point）相同或者極其相似，最終導致不同樹之間的趨同（樹與樹相關聯）。相應地，反過來說，這也使得其會產生相似的預測值，降低原本要求的方差。

我們可以采用限制特征的方法來創建不一樣的決策樹，使貪婪算法能夠在建樹的同時評估每一個分裂點。這就是隨機森林算法（Random Forest algorithm）。

與裝袋算法一樣，隨機森林算法從訓練集里擷取復合樣本并訓練。其不同之處在于，數據在每個分裂點處完全分裂并添加到相應的那棵決策樹當中，且可以只考慮用于存儲屬性的某一固定子集。

對于分類問題，也就是本教程中我們將要探討的問題，其被考慮用于分裂的屬性數量被限定為小于輸入特征的數量之平方根。代碼如下：

1	`num_features_for_split` `=` `sqrt(total_input_features)`

這個小更改會讓生成的決策樹各不相同（沒有關聯），從而使得到的預測值更加多樣化。而多樣的預測值組合往往會比一棵單一的決策樹或者單一的裝袋算法有更優的表現。

聲納數據集（Sonar dataset）

我們將在本教程里使用聲納數據集作為輸入數據。這是一個描述聲納反射到不同物體表面后返回的不同數值的數據集。60 個輸入變量表示聲納從不同角度返回的強度。這是一個二元分類問題（binary classification problem），要求模型能夠區分出巖石和金屬柱體的不同材質和形狀，總共有 208 個觀測樣本。

該數據集非常易于理解——每個變量都互有連續性且都在 0 到 1 的標準范圍之間，便于數據處理。作為輸出變量，字符串'M'表示金屬礦物質，'R'表示巖石。二者需分別轉換成整數 1 和 0。

通過預測數據集（M 或者金屬礦物質）中擁有最多觀測值的類，零規則算法（Zero Rule Algorithm）可實現 53% 的精確度。

更多有關該數據集的內容可參見 UCI Machine Learning repository：https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks)

免費下載該數據集，將其命名為 sonar.all-data.csv，并存儲到需要被操作的工作目錄當中。

教程

此次教程分為兩個步驟。

1. 分裂次數的計算。

2. 聲納數據集案例研究

這些步驟能讓你了解為你自己的預測建模問題實現和應用隨機森林算法的基礎

1. 分裂次數的計算

在決策樹中，我們通過找到一些特定屬性和屬性的值來確定分裂點，這類特定屬性需表現為其所需的成本是最低的。

分類問題的成本函數（cost function）通常是基尼指數（Gini index），即計算由分裂點產生的數據組的純度（purity）。對于這樣二元分類的分類問題來說，指數為 0 表示絕對純度，說明類值被完美地分為兩組。

從一棵決策樹中找到最佳分裂點需要在訓練數據集中對每個輸入變量的值做成本評估。

在裝袋算法和隨機森林中，這個過程是在訓練集的樣本上執行并替換（放回）的。因為隨機森林對輸入的數據要進行行和列的采樣。對于行采樣，采用有放回的方式，也就是說同一行也許會在樣本中被選取和放入不止一次。

我們可以考慮創建一個可以自行輸入屬性的樣本，而不是枚舉所有輸入屬性的值以期找到獲取成本最低的分裂點，從而對這個過程進行優化。

該輸入屬性樣本可隨機選取且沒有替換過程，這就意味著在尋找最低成本分裂點的時候每個輸入屬性只需被選取一次。

如下的代碼所示，函數 get_split() 實現了上述過程。它將一定數量的來自待評估數據的輸入特征和一個數據集作為參數，該數據集可以是實際訓練集里的樣本。輔助函數 test_split() 用于通過候選的分裂點來分割數據集，函數 gini_index() 用于評估通過創建的行組（groups of rows）來確定的某一分裂點的成本。

以上我們可以看出，特征列表是通過隨機選擇特征索引生成的。通過枚舉該特征列表，我們可將訓練集中的特定值評估為符合條件的分裂點。

									# Select the best split point for a dataset

									def get_split(dataset, n_features):

									 class_values = list(set(row[-1] for row in dataset))

									 b_index, b_value, b_score, b_groups = 999, 999, 999, None

									 features = list()

									 while len(features) < n_features:

									  index = randrange(len(dataset[0])-1)

									  if index not in features:

									   features.append(index)

									 for index in features:

									  for row in dataset:

									   groups = test_split(index, row[index], dataset)

									   gini = gini_index(groups, class_values)

									   if gini < b_score:

									    b_index, b_value, b_score, b_groups = index, row[index], gini, groups

									 return {'index':b_index, 'value':b_value, 'groups':b_groups}

至此，我們知道該如何改造一棵用于隨機森林算法的決策樹。我們可將之與裝袋算法結合運用到真實的數據集當中。

2. 關于聲納數據集的案例研究

在這個部分，我們將把隨機森林算法用于聲納數據集。本示例假定聲納數據集的 csv 格式副本已存在于當前工作目錄中，文件名為 sonar.all-data.csv。

首先加載該數據集，將字符串轉換成數字，并將輸出列從字符串轉換成數值 0 和 1. 這個過程是通過輔助函數 load_csv()、str_column_to_float() 和 str_column_to_int() 來分別實現的。

我們將通過 K 折交叉驗證（k-fold cross validatio）來預估得到的學習模型在未知數據上的表現。這就意味著我們將創建并評估 K 個模型并預估這 K 個模型的平均誤差。評估每一個模型是由分類準確度來體現的。輔助函數 cross_validation_split()、accuracy_metric() 和 evaluate_algorithm() 分別實現了上述功能。

裝袋算法將通過分類和回歸樹算法來滿足。輔助函數 test_split() 將數據集分割成不同的組；gini_index() 評估每個分裂點；前文提及的改進過的 get_split() 函數用來獲取分裂點；函數 to_terminal()、split() 和 build_tree() 用以創建單個決策樹；predict() 用于預測；subsample() 為訓練集建立子樣本集； bagging_predict() 對決策樹列表進行預測。

新命名的函數 random_forest() 首先從訓練集的子樣本中創建決策樹列表，然后對其進行預測。

正如我們開篇所說，隨機森林與決策樹關鍵的區別在于前者在建樹的方法上的小小的改變，這一點在運行函數 get_split() 得到了體現。

完整的代碼如下：

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

									# Random Forest Algorithm on Sonar Dataset

									from random import seed

									from random import randrange

									from csv import reader

									from math import sqrt

									# Load a CSV file

									def load_csv(filename):

									 dataset = list()

									 with open(filename, 'r') as file:

									  csv_reader = reader(file)

									  for row in csv_reader:

									   if not row:

									    continue

									   dataset.append(row)

									 return dataset

									# Convert string column to float

									def str_column_to_float(dataset, column):

									 for row in dataset:

									  row[column] = float(row[column].strip())

									# Convert string column to integer

									def str_column_to_int(dataset, column):

									 class_values = [row[column] for row in dataset]

									 unique = set(class_values)

									 lookup = dict()

									 for i, value in enumerate(unique):

									  lookup[value] = i

									 for row in dataset:

									  row[column] = lookup[row[column]]

									 return lookup

									# Split a dataset into k folds

									def cross_validation_split(dataset, n_folds):

									 dataset_split = list()

									 dataset_copy = list(dataset)

									 fold_size = len(dataset) / n_folds

									 for i in range(n_folds):

									  fold = list()

									  while len(fold) < fold_size:

									   index = randrange(len(dataset_copy))

									   fold.append(dataset_copy.pop(index))

									  dataset_split.append(fold)

									 return dataset_split

									# Calculate accuracy percentage

									def accuracy_metric(actual, predicted):

									 correct = 0

									 for i in range(len(actual)):

									  if actual[i] == predicted[i]:

									   correct += 1

									 return correct / float(len(actual)) * 100.0

									# Evaluate an algorithm using a cross validation split

									def evaluate_algorithm(dataset, algorithm, n_folds, *args):

									 folds = cross_validation_split(dataset, n_folds)

									 scores = list()

									 for fold in folds:

									  train_set =a list(folds)

									  train_set.remove(fold)

									  train_set = sum(train_set, [])

									  test_set = list()

									  for row in fold:

									   row_copy = list(row)

									   test_set.append(row_copy)

									   row_copy[-1] = None

									  predicted = algorithm(train_set, test_set, *args)

									  actual = [row[-1] for row in fold]

									  accuracy = accuracy_metric(actual, predicted)

									  scores.append(accuracy)

									 return scores

									# Split a dataset based on an attribute and an attribute value

									def test_split(index, value, dataset):

									 left, right = list(), list()

									 for row in dataset:

									  if row[index] < value:

									   left.append(row)

									  else:

									   right.append(row)

									 return left, right

									# Calculate the Gini index for a split dataset

									def gini_index(groups, class_values):

									 gini = 0.0

									 for class_value in class_values:

									  for group in groups:

									   size = len(group)

									   if size == 0:

									    continue

									   proportion = [row[-1] for row in group].count(class_value) / float(size)

									   gini += (proportion * (1.0 - proportion))

									 return gini

									# Select the best split point for a dataset

									def get_split(dataset, n_features):

									 class_values = list(set(row[-1] for row in dataset))

									 b_index, b_value, b_score, b_groups = 999, 999, 999, None

									 features = list()

									 while len(features) < n_features:

									  index = randrange(len(dataset[0])-1)

									  if index not in features:

									   features.append(index)

									 for index in features:

									  for row in dataset:

									   groups = test_split(index, row[index], dataset)

									   gini = gini_index(groups, class_values)

									   if gini < b_score:

									    b_index, b_value, b_score, b_groups = index, row[index], gini, groups

									 return {'index':b_index, 'value':b_value, 'groups':b_groups}

									# Create a terminal node value

									def to_terminal(group):

									 outcomes = [row[-1] for row in group]

									 return max(set(outcomes), key=outcomes.count)

									# Create child splits for a node or make terminal

									def split(node, max_depth, min_size, n_features, depth):

									 left, right = node['groups']

									 del(node['groups'])

									 # check for a no split

									 if not left or not right:

									  node['left'] = node['right'] = to_terminal(left + right)

									  return

									 # check for max depth

									 if depth >= max_depth:

									  node['left'], node['right'] = to_terminal(left), to_terminal(right)

									  return

									 # process left child

									 if len(left) <= min_size:

									  node['left'] = to_terminal(left)

									 else:

									  node['left'] = get_split(left, n_features)

									  split(node['left'], max_depth, min_size, n_features, depth+1)

									 # process right child

									 if len(right) <= min_size:

									  node['right'] = to_terminal(right)

									 else:

									  node['right'] = get_split(right, n_features)

									  split(node['right'], max_depth, min_size, n_features, depth+1)

									# Build a decision tree

									def build_tree(train, max_depth, min_size, n_features):

									 root = get_split(dataset, n_features)

									 split(root, max_depth, min_size, n_features, 1)

									 return root

									# Make a prediction with a decision tree

									def predict(node, row):

									 if row[node['index']] < node['value']:

									  if isinstance(node['left'], dict):

									   return predict(node['left'], row)

									  else:

									   return node['left']

									 else:

									  if isinstance(node['right'], dict):

									   return predict(node['right'], row)

									  else:

									   return node['right']

									# Create a random subsample from the dataset with replacement

									def subsample(dataset, ratio):

									 sample = list()

									 n_sample = round(len(dataset) * ratio)

									 while len(sample) < n_sample:

									  index = randrange(len(dataset))

									  sample.append(dataset[index])

									 return sample

									# Make a prediction with a list of bagged trees

									def bagging_predict(trees, row):

									 predictions = [predict(tree, row) for tree in trees]

									 return max(set(predictions), key=predictions.count)

									# Random Forest Algorithm

									def random_forest(train, test, max_depth, min_size, sample_size, n_trees, n_features):

									 trees = list()

									 for i in range(n_trees):

									  sample = subsample(train, sample_size)

									  tree = build_tree(sample, max_depth, min_size, n_features)

									  trees.append(tree)

									 predictions = [bagging_predict(trees, row) for row in test]

									 return(predictions)

									# Test the random forest algorithm

									seed(1)

									# load and prepare data

									filename = 'sonar.all-data.csv'

									dataset = load_csv(filename)

									# convert string attributes to integers

									for i in range(0, len(dataset[0])-1):

									 str_column_to_float(dataset, i)

									# convert class column to integers

									str_column_to_int(dataset, len(dataset[0])-1)

									# evaluate algorithm

									n_folds = 5

									max_depth = 10

									min_size = 1

									sample_size = 1.0

									n_features = int(sqrt(len(dataset[0])-1))

									for n_trees in [1, 5, 10]:

									 scores = evaluate_algorithm(dataset, random_forest, n_folds, max_depth, min_size, sample_size, n_trees, n_features)

									 print('Trees: %d' % n_trees)

									 print('Scores: %s' % scores)

									  print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))

這里對第 197 行之后對各項參數的賦值做一個說明。

將 K 賦值為 5 用于交叉驗證，得到每個子樣本為 208/5 = 41.6，即超過 40 條聲納返回記錄會用于每次迭代時的評估。

每棵樹的最大深度設置為 10，每個節點的最小訓練行數為 1. 創建訓練集樣本的大小與原始數據集相同，這也是隨機森林算法的默認預期值。

我們把在每個分裂點需要考慮的特征數設置為總的特征數目的平方根，即 sqrt(60)=7.74，取整為 7。

將含有三組不同數量的樹同時進行評估，以表明添加更多的樹可以使該算法實現的功能更多。

最后，運行這個示例代碼將會 print 出每組樹的相應分值以及每種結構的平均分值。如下所示：

									Trees: 1

									Scores: [68.29268292682927, 75.60975609756098, 70.73170731707317, 63.41463414634146, 65.85365853658537]

									Mean Accuracy: 68.780%

									Trees: 5

									Scores: [68.29268292682927, 68.29268292682927, 78.04878048780488, 65.85365853658537, 68.29268292682927]

									Mean Accuracy: 69.756%

									Trees: 10

									Scores: [68.29268292682927, 78.04878048780488, 75.60975609756098, 70.73170731707317, 70.73170731707317]

									Mean Accuracy: 72.683%