In [1]:
# Import potrebných balíčkov
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, KBinsDiscretizer
from sklearn.compose import make_column_transformer
import seaborn as sns

from keras.layers import Dense, Input, Activation
from keras.callbacks import Callback
from keras.models import Model
Using TensorFlow backend.
In [2]:
# Uistíme sa, že máme všetky potrebné dáta
!mkdir -p data/boston_housing
!wget -nc -O data/boston_housing.zip https://www.dropbox.com/s/3jnf3000vwaxtcg/boston_housing.zip?dl=1
!unzip -oq -d data/boston_housing data/boston_housing.zip
--2019-08-30 10:19:57--  https://www.dropbox.com/s/3jnf3000vwaxtcg/boston_housing.zip?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.1.1, 2620:100:6016:1::a27d:101
Connecting to www.dropbox.com (www.dropbox.com)|162.125.1.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/dl/3jnf3000vwaxtcg/boston_housing.zip [following]
--2019-08-30 10:19:57--  https://www.dropbox.com/s/dl/3jnf3000vwaxtcg/boston_housing.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc9771d3163c4c99d4769d4c9dff.dl.dropboxusercontent.com/cd/0/get/AnnGLY6MMQB0TYRRHQa0QWveQJTcHYvRQ_Ii1a1z12Ox8sAKSaIvMNRWmlWrcp_o91JYc3KA8_miGic8LiwbVmsVdVgRGvoQILhR6Ne30CYZyIeNtkJRtuZP1OuHujaFJ18/file?dl=1# [following]
--2019-08-30 10:19:57--  https://uc9771d3163c4c99d4769d4c9dff.dl.dropboxusercontent.com/cd/0/get/AnnGLY6MMQB0TYRRHQa0QWveQJTcHYvRQ_Ii1a1z12Ox8sAKSaIvMNRWmlWrcp_o91JYc3KA8_miGic8LiwbVmsVdVgRGvoQILhR6Ne30CYZyIeNtkJRtuZP1OuHujaFJ18/file?dl=1
Resolving uc9771d3163c4c99d4769d4c9dff.dl.dropboxusercontent.com (uc9771d3163c4c99d4769d4c9dff.dl.dropboxusercontent.com)... 162.125.1.6, 2620:100:601b:6::a27d:806
Connecting to uc9771d3163c4c99d4769d4c9dff.dl.dropboxusercontent.com (uc9771d3163c4c99d4769d4c9dff.dl.dropboxusercontent.com)|162.125.1.6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13500 (13K) [application/binary]
Saving to: ‘data/boston_housing.zip’

data/boston_housing 100%[===================>]  13.18K  --.-KB/s    in 0s      

2019-08-30 10:19:58 (265 MB/s) - ‘data/boston_housing.zip’ saved [13500/13500]

In [0]:
# Pomocný kód
class NEpochLogger(Callback):
    """
    Trieda na menej časté zobrazovanie priebehu učenia.
    """
    def __init__(self, n_epochs=100):
        super(NEpochLogger, self).__init__()
        self.n_epochs = n_epochs

    def on_epoch_end(self, epoch, logs=None):
        logs = logs or {}
        
        if epoch % self.n_epochs == 0:
            curr_loss = logs.get('loss')
            print("epoch = {}; loss = {}".format(
                epoch, curr_loss))

Regresný model pre ceny nehnuteľností

V tomto notebook-u budeme aplikovať regresiu na báze umelých neurónových sietí na problém predikcie ceny nehnuteľností. Pracovať budeme s dátovou množinou Boston housing dataset. Cieľom bude z niekoľkých vstupných premenných predikovať cenu nehnuteľnosti.

Načítanie dátovej množiny

Začnime tým, že si zobrazíme opis dát:

In [4]:
with open("data/boston_housing/description.txt", "r") as file:
    print("".join(file.readlines()))
Housing Values in Suburbs of Boston

The medv variable is the target variable.
Data description

The Boston data frame has 506 rows and 14 columns.

This data frame contains the following columns:

crim
per capita crime rate by town.

zn
proportion of residential land zoned for lots over 25,000 sq.ft.

indus
proportion of non-retail business acres per town.

chas
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox
nitrogen oxides concentration (parts per 10 million).

rm
average number of rooms per dwelling.

age
proportion of owner-occupied units built prior to 1940.

dis
weighted mean of distances to five Boston employment centres.

rad
index of accessibility to radial highways.

tax
full-value property-tax rate per $10,000.

ptratio
pupil-teacher ratio by town.

black
1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

lstat
lower status of the population (percent).

medv
median value of owner-occupied homes in $1000s.
Source

Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.

Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

Ďalej si z CSV súboru načítajme samotnú dátovú množinu:

In [5]:
df = pd.read_csv("data/boston_housing/housing.csv")
df.head()
Out[5]:
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 396.90 5.33 36.2

Dáta rozdelíme na tréningové a testovacie. Použijeme stratifikáciu:

In [0]:
kbins = KBinsDiscretizer(10, encode='ordinal')
y_stratify = kbins.fit_transform(df[["medv"]])
In [0]:
df_train, df_test = train_test_split(df, stratify=y_stratify,
                        test_size=0.25, random_state=4)

Môžeme si overiť, aké stĺpce máme k dispozícii:

In [8]:
df.columns
Out[8]:
Index(['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax',
       'ptratio', 'black', 'lstat', 'medv'],
      dtype='object')

Predspracovanie dát

Ďalej si dáta predspracujeme. Budeme postupovať podobne ako v doterajších príkladoch. Rozdiel bude v prístupe ku kategorickým premenným. Doteraz sme ich typicky prekódovali tak, že sme každej hodnote priradili nejaké poradové číšlo (pomocou OrdinalEncoder). V prípade neurónových sietí môže byť vhodnejšie použiť kódovanie 1 z n. Slúži na to trieda OneHotEncoder, ktorú používame aj pri kódovaní požadovaných výstupov pri klasifikácii.

In [0]:
categorical_inputs = ['chas']

numeric_inputs = ['crim', 'zn', 'indus', 'nox', 'rm', 'age', 'dis', 'rad', 'tax',
       'ptratio', 'black', 'lstat']

output = ["medv"]
In [0]:
input_preproc = make_column_transformer(
    (OneHotEncoder(categories='auto'), categorical_inputs),
    (StandardScaler(), numeric_inputs)
)

output_preproc = StandardScaler()
In [0]:
X_train_raw = df_train[categorical_inputs + numeric_inputs]
Y_train_raw = df_train[output]

X_test_raw = df_test[categorical_inputs + numeric_inputs]
Y_test_raw = df_test[output]
In [0]:
X_train = input_preproc.fit_transform(X_train_raw)
X_test = input_preproc.transform(X_test_raw)

Y_train = output_preproc.fit_transform(Y_train_raw)
Y_test = output_preproc.transform(Y_test_raw)

Úloha 1: Tréning neurónovej siete

Vytvorte pomocou balíčka keras neurónovú sieť, ktorá vykoná regresiu na dátovej množine, ktorú sme vyššie predspracovali.


In [0]:


Úloha 2: Testovanie

Otestujte vytvorený klasifikátor na testovacích dátach. Vyhodnoťte strednú kvadratickú chybu, strednú absolútnu chybu a zobrazte histogram chýb.


In [0]: