Model qanchalik yaxshi?
|
|
|
Determinatsiya koeffitsienti deb ham ataladigan R 2 regressiya modelining tushuntirish
kuchini umumlashtiradi va kvadratlar yig'indisidan hisoblab chiqiladi.
|
|
|
|
|
|
R 2 regressiya modeli bilan izohlangan qaram o'zgaruvchining dispersiya ulushini
tavsiflaydi.
Agar regressiya modeli "mukammal" bo'lsa, SSE nolga, R 2 esa 1 ga teng. Agar regressiya
modeli to'liq muvaffaqiyatsizlikka uchragan bo'lsa, SSE SST ga teng, hech qanday
dispersiya regressiya bilan izohlanmaydi va R 2 nolga teng. Yuqori R 2 va sabab o'rtasida
to'g'ridan-to'g'ri bog'liqlik yo'qligini yodda tutish kerak .
|
|
|
|
|
|
|
|
|
Model qanchalik muhim?
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
F -nisbati regressiya modelining statistik ahamiyatini baholaydi va ANOVA jadvalidagi
o'rtacha kvadratik shartlardan hisoblanadi. F-nisbatining ahamiyati ikki darajadagi
erkinlik ( df MSR , df MSE )yordamida F taqsimot jadvaliga murojaat qilish orqali
olinadi. p - mustaqil o'zgaruvchilar soni (masalan, p oddiy chiziqli regressiya uchun bitta).
|
|
|
|
|
|
|
|
|
|
F-nisbatining dan ustunligi shundaki, F-nisbati regressiya modelining ahamiyatini
baholashda tanlanma hajmi va bashorat qiluvchilar sonini o'z ichiga oladi. Model
yuqori R 2 ga ega bo'lishi mumkin va hali ham statistik ahamiyatga ega emas.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Koeffitsientlar qanchalik muhim?
|
|
|
|
|
|
|
Agar regressiya modeli sezilarli darajada yaxshi bo'lsa, biz har bir koeffitsientning statistik ahamiyatini baholash uchun t-testidan foydalanishimiz mumkin.
|
|
|
|
|
|
|
Ko'p chiziqlilik
|
|
|
|
|
|
|
Bashorat qiluvchilar o'rtasidagi yuqori darajadagi multikolinearlik ishonchsiz regressiya koeffitsientini baholaydi. Multikolinearlik belgilariga quyidagilar kiradi:
|
|
|
|
|
|
|
1. Bashorat qiluvchi o'zgaruvchilar juftlari o'rtasidagi yuqori korrelyatsiya.
2. Belgilari yoki kattaliklari yaxshi jismoniy ma'noga ega bo'lmagan R egressiya
koeffitsientlari.
3. Muhim bashorat qiluvchilar bo'yicha statistik jihatdan ahamiyatsiz regressiya
koeffitsientlari.
4. Regressiya koeffitsientlarining belgisi yoki kattaligining bashoratchini kiritish
yoki o'chirishga o'ta sezgirligi.
|
|
|
|
|
|
|
matritsadagi diagonal qiymatlar Variant inflyatsiya omillari (VIFs) deb ataladi
va ular ko'p chiziqlilikning juda foydali o'lchovidir. Agar har qanday VIF 5 dan oshsa, multikolinearlik muammo hisoblanadi.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Model tanlash
|
|
|
|
|
|
|
|
|
|
|
Ma'lumotlarni qazib olishda tez-tez uchraydigan muammo bu modelni bashorat qilishga
sezilarli hissa qo'shmaydigan bashorat qiluvchilardan qochishdir. Birinchidan, arzimas koeffitsientlarga ega bo'lgan bashoratchilarni tushirish bashoratlarning o'rtacha xatosini
kamaytirishi mumkinligi ko'rsatilgan. Ikkinchidan, ko'p o'zgaruvchilarga ega modellarda
ko'p kollinearlik tufayli regressiya koeffitsientlarini baholash beqaror bo'lishi mumkin.
Va nihoyat, oddiyroq model - bu modellardagi bashorat qiluvchilarning ta'sirini ko'proq
tushunadigan yaxshi model. Modelni tanlashning ikkita asosiy usuli mavjud:
|
|
|
|
|
|
|
Oldinga tanlash, eng yaxshi bashorat qiluvchilar modelga birma-bir kiritiladi.
Orqaga barham berish, eng yomon bashorat qiluvchilar modeldan birma-bir chiqarib
tashlanadi.
|
|
|
|
|
|
|
Misol:{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import math\n",
"import statistics\n",
"import pandas as pd\n",
"from matplotlib import pyplot as plt\n",
"import seaborn as sns\n",
"import numpy as np\n",
"from scipy import stats\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn import metrics\n",
"import scikitplot as skplt\n",
"from sklearn import tree\n",
"from sklearn.datasets import make_regression\n",
"from sklearn.metrics import mean_squared_error\n",
"import statsmodels.api as sm"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Download the data from http://www.saedsayad.com/datasets/BikeRental.zip\n",
"trainset=pd.read_csv('C:\\\\Users\\\\Lenovo\\\\Documents\\\\Python Codes\\\\BikeRental\\\\bike_rental_train.csv')\n",
"testset=pd.read_csv('C:\\\\Users\\\\Lenovo\\\\Documents\\\\Python Codes\\\\BikeRental\\\\bike_rental_test.csv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"trainset.shape\n",
"testset.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"trainset.columns\n",
"testset.columns"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# training\n",
"cols=['temp', 'humidity', 'windspeed']\n",
"tr=trainset.dropna(axis=0,how='any')\n",
"X_train=tr[cols]\n",
"y_train=tr['bike_rent_count']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mlr = LinearRegression()\n",
"mlr.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# show the model parameters\n",
"print(mlr.intercept_, mlr.coef_, mlr.score(X_train, y_train))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# testing\n",
"cols=['temp', 'humidity', 'windspeed']\n",
"ts=testset.dropna(axis=0,how='any')\n",
"X_test=ts[cols]\n",
"y_test=ts['bike_rent_count']\n",
"y_pred = mlr.predict(X_test)\n",
"print('R-squared of multiple linear regression on test set: {:.2f}'.format(mlr.score(X_test, y_test)))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# mean squared error\n",
"mse = mean_squared_error(y_test, y_pred)\n",
"print(\"MSE: \", mse)\n",
"print(\"RMSE: \", mse**(1/2.0)) "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# plot original and predicted values\n",
"x_ax = range(len(y_test))\n",
"plt.plot(x_ax, y_test, label=\"original\")\n",
"plt.plot(x_ax, y_pred, label=\"predicted\")\n",
"plt.title(\"test and predicted data\")\n",
"plt.xlabel('X-axis')\n",
"plt.ylabel('Y-axis')\n",
"plt.legend(loc='best',fancybox=True, shadow=True)\n",
"plt.grid(True)\n",
"plt.show() "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# residulas\n",
"res=y_test-y_pred\n",
"sns.histplot(res)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.boxplot(x=res, showmeans=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mu, std = stats.norm.fit(res)\n",
"fig, ax = plt.subplots()\n",
"\n",
"# plot the residuals\n",
"sns.histplot(x=res, ax=ax, stat=\"density\", linewidth=0, kde=True)\n",
"ax.set(title=\"Distribution of residuals\", xlabel=\"residual\")\n",
"\n",
"# plot corresponding normal curve\n",
"xmin, xmax = plt.xlim() # the maximum x values from the histogram above\n",
"x = np.linspace(xmin, xmax, 100) # generate some x values\n",
"p = stats.norm.pdf(x, mu, std) # calculate the y values for the normal curve\n",
"sns.lineplot(x=x, y=p, color=\"orange\", ax=ax)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Q-Q plot\n",
"sm.qqplot(res, line='s')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# scatter plot\n",
"plt.scatter(y_test,y_pred)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.11.0 64-bit",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.0"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "e956c49ae0caa224b5fdd754cbc353d7e61c6c61f7329ee680c7f7b4f1b71a8e" }
},
"nbformat": 4,
"nbformat_minor": 2}
http://fayllar.org
Dostları ilə paylaş: |