rubus0304 님의 블로그
[프로젝트 3일차] 본문
1. 전처리
2. 파생변수 (시간정보)
3. 파생변수 (공간정보)
train_df_9 (동별 대구 CCTV, 보안등, 어린이 보호구역, 주차장 갯수/ 동별 사망자수, 중상자수, 경상자수, ECLO/ 고속도로유무/ 도로유형)
test_df_9 (동별 대구 CCTV, 보안등, 어린이 보호구역, 주차장 갯수/ 동별 사망자수, 중상자수, 경상자수, ECLO/ 고속도로유무/ 도로유형)
4. 모델링
test_x_1 = test_df_9.drop(columns=['ID','군구','사고유형시']).copy()
train_x_1 = train_df_9[test_x_1.columns].copy()
train_y_1 = train_df_9['동사망자수'].copy()
train_y_2 = train_df_9['동중상자수'].copy()
train_y_3 = train_df_9['동경상자수'].copy()
train_y_4 = train_df_9['동부상자수'].copy()
train_y_5 = train_df_9['ECLO'].copy()
from sklearn.preprocessing import LabelEncoder
categorical_features = list(train_x_1.dtypes[train_x_1.dtypes == "object"].index)
# 추출된 문자열 변수 확인
for i in categorical_features:
le = LabelEncoder()
le=le.fit(train_x_1[i])
train_x_1[i]=le.transform(train_x_1[i])
for case in np.unique(test_x_1[i]):
if case not in le.classes_:
le.classes_ = np.append(le.classes_, case)
test_x_1[i]=le.transform(test_x_1[i])
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import SelectFromModel
X = train_x_1
y = train_y_5
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X,np.log1p(y), test_size=0.2, random_state=42)
# Create an XGBoost Regressor
model = XGBRegressor(
max_depth=8,
learning_rate=0.01,
subsample=0.9,
colsample_bytree=0.9,
random_state=42,
min_child_weight=50,
objective='reg:squarederror',
eval_metric='rmse')
model.fit(X_train, y_train)
# Display feature importances
feature_importances = model.feature_importances_
feature_names = X.columns
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
sel_features = feature_importance_df[feature_importance_df['Importance']>0]['Feature']
train_x_1 = train_x_1[sel_features]
test_x_1 = test_x_1[sel_features]
train_x_1
5. CATBOOST - 적합 (인코딩 자동)
# CatBoost Regression Model
from catboost import CatBoostRegressor
# Initialize the CatBoostRegressor with RMSE as the loss function
model = CatBoostRegressor(loss_function='RMSE', iterations=5000, depth=9, l2_leaf_reg=3)
# Fit the model on the training data with verbose logging every 100 iterations
model.fit(X_train, y_train, verbose=100)
# Import the mean squared error (MSE) function from sklearn and alias it as 'mse'
from sklearn.metrics import mean_squared_error as mse
# Generate predictions on the training and validation sets using the trained 'model'
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Calculate and print the Root Mean Squared Error (RMSE) for training and validation sets
print("Training RMSE: ", np.sqrt(mse(y_train, y_train_pred)))
print("Validation RMSE: ", np.sqrt(mse(y_test, y_test_pred)))
'Data Analyst > daily' 카테고리의 다른 글
[프로젝트 5주차] (0) | 2024.12.06 |
---|---|
[프로젝트 4일차] (1) | 2024.12.04 |
[프로젝트 2일차] (0) | 2024.12.02 |
프로젝트 시작! (2) | 2024.11.29 |
[코트카타 102] (0) | 2024.11.27 |