rubus0304 님의 블로그

[프로젝트 3일차] 본문

Data Analyst/daily

[프로젝트 3일차]

rubus0304 2024. 12. 3. 21:03

1. 전처리

2. 파생변수 (시간정보)

3. 파생변수 (공간정보)

 

train_df_9 (동별 대구 CCTV, 보안등, 어린이 보호구역, 주차장 갯수/ 동별 사망자수, 중상자수, 경상자수, ECLO/ 고속도로유무/ 도로유형)

test_df_9 (동별 대구 CCTV, 보안등, 어린이 보호구역, 주차장 갯수/ 동별 사망자수, 중상자수, 경상자수, ECLO/ 고속도로유무/ 도로유형)

 

4. 모델링

test_x_1 = test_df_9.drop(columns=['ID','군구','사고유형시']).copy()
 

 

train_x_1 = train_df_9[test_x_1.columns].copy()

train_y_1 = train_df_9['동사망자수'].copy()
train_y_2 = train_df_9['동중상자수'].copy()
train_y_3 = train_df_9['동경상자수'].copy()
train_y_4 = train_df_9['동부상자수'].copy()
train_y_5 = train_df_9['ECLO'].copy()
 

 

from sklearn.preprocessing import LabelEncoder

categorical_features = list(train_x_1.dtypes[train_x_1.dtypes == "object"].index)
# 추출된 문자열 변수 확인

for i in categorical_features:
    le = LabelEncoder()
    le=le.fit(train_x_1[i])
    train_x_1[i]=le.transform(train_x_1[i])
   
    for case in np.unique(test_x_1[i]):
        if case not in le.classes_:
            le.classes_ = np.append(le.classes_, case)
    test_x_1[i]=le.transform(test_x_1[i])

from sklearn.preprocessing import LabelEncoder
 

 

from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import SelectFromModel
 

 

X = train_x_1
y = train_y_5

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X,np.log1p(y), test_size=0.2, random_state=42)

# Create an XGBoost Regressor
model = XGBRegressor(
            max_depth=8,
            learning_rate=0.01,
            subsample=0.9,
            colsample_bytree=0.9,
            random_state=42,
            min_child_weight=50,
            objective='reg:squarederror',
            eval_metric='rmse')

model.fit(X_train, y_train)

# Display feature importances
feature_importances = model.feature_importances_
feature_names = X.columns
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
 

 

sel_features = feature_importance_df[feature_importance_df['Importance']>0]['Feature']

train_x_1 = train_x_1[sel_features]
test_x_1 = test_x_1[sel_features]

train_x_1
 

 

5. CATBOOST - 적합 (인코딩 자동)

# CatBoost Regression Model
from catboost import CatBoostRegressor
 
# Initialize the CatBoostRegressor with RMSE as the loss function
model = CatBoostRegressor(loss_function='RMSE', iterations=5000, depth=9, l2_leaf_reg=3)

 
# Fit the model on the training data with verbose logging every 100 iterations
model.fit(X_train, y_train, verbose=100)
 

 

# Import the mean squared error (MSE) function from sklearn and alias it as 'mse'
from sklearn.metrics import mean_squared_error as mse
 
# Generate predictions on the training and validation sets using the trained 'model'
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
 
# Calculate and print the Root Mean Squared Error (RMSE) for training and validation sets
print("Training RMSE: ", np.sqrt(mse(y_train, y_train_pred)))
print("Validation RMSE: ", np.sqrt(mse(y_test, y_test_pred)))
 

 

 

 

 

'Data Analyst > daily' 카테고리의 다른 글

[프로젝트 5주차]  (0) 2024.12.06
[프로젝트 4일차]  (1) 2024.12.04
[프로젝트 2일차]  (0) 2024.12.02
프로젝트 시작!  (2) 2024.11.29
[코트카타 102]  (0) 2024.11.27