Problem Statement

The task was to build a model for an insurance company for predicting the propensity to pay renewal premium and build an incenive plan for its agents to maximize the net revenue (i.e., renewals - incentives given to collect the renewals) collected from the policies post their issuance.

Information about past transactions from the policy holders along with their demographics was given. The solution was evaluated on the base probability prediction of receiving a premium on a policy without considering any incentive and the monthly incentive plan for agents at policy level which maximizes the net revenue. Source: Analytics Vidhya

Modeling

Data preprocessing: Label encoding followed by one-hot encoding was performed for the categorical variables of sourcing channel and residence area type. Mean imputation was done for application underwriting score since it was a continuous variable and median imputation was done for the count of late payments.

Train-development split: 25% of the training data was used to create the development (validation) set.

Algorithms: Gradient boosting, XGBoost and logistic regression with TensorFlow were implemented for the data challenge. Hyperparameter tuning was performed for XGBoost and Gradient Boosting algorithms. The following parameters were tuned successively for the XGBoost algorithm:

  • Maxmimum depth of tree and minimum sum of weights of all observations required in a child
  • Gamma (specifies minimum loss reduction required to make a split)
  • Subsample (denotes fraction of observations to be randomly sampled for each tree) and fraction of columns to be randomly sampled for each tree
  • Lambda (L2 regularization term for weights) and alpha (L1 regularization term for weights)
  • The following parameters were tuned successively for the Gradient Boosting algorithm:

  • Maxmimum depth of tree and minimum number of samples required in a node to be considered for splitting
  • Minimum number of samples required in a terminal node or leaf
  • Maximum number of features to consider while searching for a best split
  • Subsample (fraction of observations to be selected for each tree)
  • The best model selected based on performance on the ROC-AUC score on the development set was a gradient boosting algorithm with tuned hyperparameters. A function was written for estimating the incentives plan on policy level for the agents using the renewal premium predictions and given relationships.

    Ranking

    The submission ranked 40 out of 884 submissions on the private leaderboard.

    Repository