【数据竞赛】竞赛宝典黑科技:基于开源结果的高端融合策略
共 7688字,需浏览 16分钟
·
2021-01-19 19:35
竞赛宝典黑科技_基于开源结果的融合
(轻轻松松挖银牌)
本篇文章的思想很简单,不需要自己跑任何的模型,只需要将现有的开源提交结果进行“直接优化两步走”即可拿到比所有开源结果更好的方案,有一些kaggle竞赛懒人选手就是直接通过此种策略在最后三天直接拿下银牌.......
1. 基础融合
收集所有开源社区的提交结果(假设有N个结果,); 按照所有开源结果的分数进行排序(由低到高),(); 取前M个较低的结果进行某种方式的集成得到结果, 于是我们的结果变为: (); 然后我们选取与分数相近的结果进行集成;依次进行直到所有结果集成完毕。
2. 基础融合升级
拿到基础融合的结果,再依次对结果进行修正。(细节可以看下面的案例)
屡比屡大,则乘上大于1的系数;屡比屡小,则乘上小于1的系数;
该案例摘录于:《kaggle:[results-driven] Tabular Playground Series - 201》:https://www.kaggle.com/somayyehgholami/results-driven-tabular-playground-series-201
1. 收集开源提交结果
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
dfk = pd.DataFrame({
'Kernel ID': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K'],
'Score': [ 0.69864 , 0.69846 , 0.69836 , 0.69824 , 0.69813, 0.69795, 0.69782, 0.69749, 0.69747, 0.69735, 0.69731],
'File Path': ['../input/tps-jan-2021-gbdts-baseline/submission.csv', '../input/pseudo-labelling/submission.csv', '../input/v4-baseline-lgb-no-tune/sub_0.6971.csv', '../input/tps21-optuna-lgb-fast-hyper-parameter-tunning/submission.csv', '../input/gbdts-baseline-prevision-io-for-free/submission.csv', '../input/v41-eda-gbdts/res41.csv', '../input/v3-ensemble-lgb-xgb-cat/submission.csv' , '../input/tabular-playground/sub_gbm.csv', '../input/v48tabular-playground-series-xgboost-lightgbm/V48-0.69747.csv', '../input/xgboost-hyperparameter-tuning-using-optuna/submission.csv', '../input/tabular-playground-some-slightly-useful-features/sub_gbm.csv']
})
dfk
Kernel ID | Score | File Path | |
---|---|---|---|
0 | A | 0.69864 | ../input/tps-jan-2021-gbdts-baseline/submissio... |
1 | B | 0.69846 | ../input/pseudo-labelling/submission.csv |
2 | C | 0.69836 | ../input/v4-baseline-lgb-no-tune/sub_0.6971.csv |
3 | D | 0.69824 | ../input/tps21-optuna-lgb-fast-hyper-parameter... |
4 | E | 0.69813 | ../input/gbdts-baseline-prevision-io-for-free/... |
5 | F | 0.69795 | ../input/v41-eda-gbdts/res41.csv |
6 | G | 0.69782 | ../input/v3-ensemble-lgb-xgb-cat/submission.csv |
7 | H | 0.69749 | ../input/tabular-playground/sub_gbm.csv |
8 | I | 0.69747 | ../input/v48tabular-playground-series-xgboost-... |
9 | J | 0.69735 | ../input/xgboost-hyperparameter-tuning-using-o... |
10 | K | 0.69731 | ../input/tabular-playground-some-slightly-usef... |
2. 结果融合函数
用线上效果好的结果coeff + 线上效果差一些的结果(1-coeff), coeff一般是大于0.5的
def generate(main, support, coeff):
g = main.copy()
for i in main.columns[1:]: # 对每一个预测结果
res = []
lm, Is = [], []
lm = main[i].tolist()
ls = support[i].tolist()
for j in range(len(main)):
res.append((lm[j] * coeff) + (ls[j] * (1.- coeff))) # main * res + (1-coeff) * res2
g[i] = res
return g
def drawing(main, support, generated):
X = main.iloc[:, 1]
Y1 = support.iloc[:, 1]
Y2 = generated.iloc[:, 1]
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(8, 8), facecolor='lightgray')
plt.title(f'\nOn the X axis >>> main\nOn the Y axis >>> support\n')
plt.scatter(X, Y1, s=0.1)
plt.show()
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(8, 8), facecolor='lightgray')
plt.title(f'\nOn the X axis >>> main\nOn the Y axis >>> generated\n')
plt.scatter(X, Y2, s=0.1)
plt.show()
def drawing1(main, support, generated):
X = main.iloc[:, 1]
Y1 = support.iloc[:, 1]
Y2 = generated.iloc[:, 1]
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(8, 8), facecolor='lightgray')
plt.title(f'\nBlue | X axis >> main | Y axis >> support\n\nOrange | X axis >> main | Y axis >> generated\n')
plt.scatter(X, Y1, s=0.1)
plt.scatter(X, Y2, s=0.1)
plt.show()
dfk
Kernel ID | Score | File Path | |
---|---|---|---|
0 | A | 0.69864 | ../input/tps-jan-2021-gbdts-baseline/submissio... |
1 | B | 0.69846 | ../input/pseudo-labelling/submission.csv |
2 | C | 0.69836 | ../input/v4-baseline-lgb-no-tune/sub_0.6971.csv |
3 | D | 0.69824 | ../input/tps21-optuna-lgb-fast-hyper-parameter... |
4 | E | 0.69813 | ../input/gbdts-baseline-prevision-io-for-free/... |
5 | F | 0.69795 | ../input/v41-eda-gbdts/res41.csv |
6 | G | 0.69782 | ../input/v3-ensemble-lgb-xgb-cat/submission.csv |
7 | H | 0.69749 | ../input/tabular-playground/sub_gbm.csv |
8 | I | 0.69747 | ../input/v48tabular-playground-series-xgboost-... |
9 | J | 0.69735 | ../input/xgboost-hyperparameter-tuning-using-o... |
10 | K | 0.69731 | ../input/tabular-playground-some-slightly-usef... |
3. 结果融合初步
3.1 融合1:通过A-G -> 最优结果1
[ A: (Score: 0.69864), B: (Score: 0.69846), ... , F: (Score: 0.69795), G: (Score: 0.69782) ] >>> sub1: (Score: 0.69781)
support = pd.read_csv(dfk.iloc[0, 2])
for k in range (1, 7):
# 依次读取top的结果
main = pd.read_csv(dfk.iloc[k, 2])
# 按照generate函数得到新的结果
support = generate(main, support, 0.7)
sub1 = support
3.2 融合2:使用融合结果1以及差不大的分数融合
注意线上结果好的sub是main,次优的是support; [ H: (Score: 0.69749) , sub1: (Score: 0.69781) ] >>> sub2: (Score: 更好了)
main = pd.read_csv(dfk.iloc[7, 2])
sub2 = generate(main, sub1, 0.8)
3.3 融合3:使用融合结果2以及差不大的分数融合
[ I: (Score: 0.69747) , sub2: (Score: -----) ] >>> sub3: (Score: 更好了)
main = pd.read_csv(dfk.iloc[8, 2])
sub3 = generate(main, sub2, 0.55)
3.4 融合4:使用融合结果3以及差不大的分数融合
[ J: (Score: 0.69735) , sub3: (Score: ------) ] >>> sub4: (Score: 更好了)
3.5 融合5:使用融合结果4以及差不大的分数融合
[ k: (Score: 0.69731) , sub4: (Score: -------) ] >>> sub5: (Score: 0.69688)
4. 结果融合升级
对预测结果偏低的纠正,对预测结果偏高的纠正
sub5: (Score: 0.69688) >>> sub6: (Score: 0.69682)
We first compared the result of our previous step with the results of each kernel used. We looked for rows where the results of all kernels (or the majority of kernels) differed from the results of our previous step (more or less). On the other hand, we know that the results of the previous step are better than the results of all the kernels used. So we can guess that these rows have been oppressed !!! That is, in the previous steps, they were mistakenly increased or decreased. We compensate for these possible errors to some extent by applying the coefficients "pcoeff" and "mcoeff" (of course, only in these rows). Fortunately, the pictures illustrate the method well.
main = sub5 #0.69688
comp = main.copy()
majority = 9 # Hyper parameter
pcoeff = 1.0016 # Hyper parameter
mcoeff = 0.9984 # Hyper parameter
pxy = [[],[],[]]
mxy = [[],[],[]]
for i in main.columns[1:]:
lm = main[i].tolist()
ls = [[],[],[],[],[],[],[],[],[],[],[]]
res = []
## 1. 读取所有的开源结果
for n in range (11):
csv = pd.read_csv(dfk.iloc[n, 2])
ls[n] = csv[i].tolist()
## 2.
for j in range(len(main)):
pcount = 0
pvalue = 0.0
mcount = 0
mvalue = 0.0
## 2.1 统计main的结果大于ls的次数,用pcount记录
## 统计main的结果小于ls的次数,用mcount记录
## 2.2 pcount的次数大于一个阈值,那么我们的main的结果乘上一个系数(一般大于1)
## mcount的次数大于某个阈值,那么我们的main的结果乘上一个系数(一般小于1)
for k in range (11):
if lm[j] > ls[k][j]:
pcount += 1
pvalue += ls[k][j]
else:
mcount += 1
mvalue += ls[k][j]
if (pcount > majority):
res.append(lm[j] * pcoeff)
pxy[0].append(lm[j])
pxy[1].append(pvalue / pcount)
pxy[2].append(lm[j] * pcoeff)
elif (mcount > majority):
res.append(lm[j] * mcoeff)
mxy[0].append(lm[j])
mxy[1].append(mvalue / mcount)
mxy[2].append(lm[j] * mcoeff)
else:
res.append(lm[j])
comp[i] = res
sub6 = comp
往期精彩回顾
本站知识星球“黄博的机器学习圈子”(92416895)
本站qq群704220115。
加入微信群请扫码: