Loan Risk Analysis in Banking Sector

, London, UK

**Abstract**:

**Purpose**: A Bank is a financial institution which takes deposits from the public and provides credits to the customers, companies, etc. Banks basically make money by lending money at rates higher than the cost of money they lend and are one of the major pillars of global economy. As they play a crucial role in the financial stability of a country the risk attached to the credit that bank lends to its customers, businesses, organizations, etc. always remains high. Hence, Loan Risk Analysis has become imperative in the banking domain. It helps banks and economists to prognosticate the credit risks, market risks, liquidity risks, operational risks etc. Hence, it helps in elimination of unforeseen circumstances. It also enables the bank to understand the probability of a debtor to repay loans and determine his likelihood to fail a payment.

**Design/Approach**: Banks collect data like its customer’s information pertaining to credit history, credit limit etc. for the risk assessment. Techniques like statistical predictive analysis are applied on customer database to evaluate credit risk and creditworthiness which facilitate the banks to make a decision to issue the loan or reject the application of the customer.

**Findings/Results**: The research was done on 924 files of credit given to industries by mercenary bank in the time frame ranging from 2003-2006. The naïve Bayesian Classifier algorithm is used for classification and probability predictions. And, the result is a good classification of the order 63.85%. To evaluate the performance of the model, a Receiver Operating Characteristic Curve (ROC) is plotted. This paper is based on factuality that Central Bank has obliged all mercenary banks to conduct a survey study to collect qualitative data for better credit details of the panhandlers.

**Important terms**: Banking sector, Bayesian classifier algorithm, Default risk, Risk assessment, ROC Curve.

* *

## I. INTRODUCTION[1]

CREDIT RISK being one of the major risks in the Banking System, has made Loan risk analysis quintessential. A variety of methods are used for the risk level calculation. As defined by many Committees on Banking Supervision, this is defined as the potential that a borrower fails to meet in accordance with the agreed terms and conditions. Hence, there emerged the dire need for the bank to identify and classify their debtors or clients. Classification is done on the basis of customer information which also facilitates the evaluation of related subjective factors. Financial Ratios analysis is one of the imperative techniques of Financial Statements Analysis as it briefly summarizes the results of detailed and complicated computation. These are further divided into objective and subjective ratios; which indicates the precise financial condition of the business. Cash flows, Balance Sheet, etc. are some examples of the objective ratios while bank decides the subjective ones.

In light of the twofold grouping issue, either outside mapping approach or interior rating framework can be utilized for figuring capital prerequisites for credit dangers. Latter is easier to implement, thus used to evaluate credit scoring methods for loan approval or rejection. By using this, in a way, information of the clients become official or formalized and scoring framework shapes a reason for credit endorsement. The pre-processing techniques like cleaning of data and then classification is done. It is used to group the data as per the clients/borrowers i.e. individuals, firms, businesses, etc. and using regression technique score is evaluated. Linear Regression is best suited for the continuous type of data which is not categorial and thus, can make a prediction of the score.

Ongoing examination has proposed, utilizing Bayesian Classification with naïve Bayes classifiers the bankruptcy prediction can be done in a better way than the other existing techniques because of its simplicity and dynamic nature as it can incorporate more data in future.

*The Business Problem*

* *

i) The conceptual framework of agency theory: the vital use of this hypothesis to the moneylender borrower issue is the determination of ideal loaning contract. Usually, the borrower has better information about the projects, finances, returns and risk in the market as compared to the lender. So, there is a lack of information and this asymmetrical information leads to principal- agent problem. This further leads to moral hazard problem model and adverse selection models or ex-post and ex-ante models where the information is still asymmetric.

ii) Accurate assessment of the credit risk score for predicting early warning sign of defaults.

iii) High dimensional data

*Solution*

* *

i) The exploration demonstrated that the ideal contract which takes care of this issue is the supposed standard (or straightforward) obligation contract. This standard obligation contract is described by its assumed worth, which ought to be reimbursed by the operator when the undertaking is done.

To beat the unbalanced data issue and its outcomes using a loan chance evaluation in reality, banks utilize either security or liquidation forecast demonstrating or both.

ii) To deal with credit scoring, there are 2 approaches:

a) the structural or market-based model: Default happens when the estimation of the association’s benefits falls underneath some basic dimension.

b) the empirical or accounting-based model: the relationship of default with the qualities of a firm, this relationship is found out from the information.

iii) Naïve Bayesian classifiers works best for high dimensional data.

* *

*Sample and Data*

The preparation informational index incorporates various cases, each containing qualities for a scope of information and yield factors. The principal choice we have to make is which factors to utilize. The second one concerns the subjects whose conduct we need to anticipate. For our case, the factors are pointers of default chance and the subjects are borrowers. Banks were approached to give credit chance classes to their borrowers. By the finish of each quarter, it arranges these documents into five groups, every one comparing to a hazard class. The four residual classes relate to four more dangerous classes of firms with three months, a half year, nine months and 1-year (or more) postponement of installment, separately. Thus, subordinate variable Y is the likelihood of default. We utilize a spurious variable, Y, which parallels 0 if the firm is named solid and 1 generally. Information factors were ordered into two classifications: non-income proportions and income proportions.

So, we have __dependent variables__ Along these lines, Y=0ifnodelayofpayment

Y = 1 if more there is in excess of a three-month delay

__Independent variables__: Default chance expectation depends, all in all, on a decent evaluation of the couple hazard return of an organization. Money related proportions drawn from fiscal reports (monetary record, salary and income proclamation) are normally utilized. Money related proportion examination bunches the proportions into classes which inform us concerning distinctive features of an organization’s accounts and tasks (liquidity, movement or operational, influence and benefit).

In our trial, we hold 24 money related and nonfinancial pointers, 22 of them are monetary proportions and 2 are definitely not.

## II. Methodology

* *

*Data Mining Methodology:*

* *

In this model, CRISP (Cross Industry Standard Process for Data Mining) methodology of Data Mining has been used as it has different phases like business understanding i.e. business problem, data understanding i.e. data collected of the users, etc. as described in the problem statement. Moreover, it is a non-linear phase where the client demographics and other useful information taken is analyzed. At any point of time it can be back traced if there comes any conflict and hence producing the robust and meaningful outcome of the collected data, which is here predicting the default.

*Pre-Processing Methodology:*

* *

The main tasks listed in data-preprocessing include collecting and cleaning the data for the data to be made usable to predict something imperative out of it. The steps include importing the data, checking out for missing values, checking or looking for the categorial data, detecting the outliers, etc. respectively for collection and cleaning. Creating the training data is also an important part of it which takes the major part of data mining.

In this research, the data collected was a mix of textual, categorial, quantitative, continuous, etc. data type. So, there is a high need of changing the data type to numerical or quantitative so that the Bayesian classifier can be applied to it. Also, this algorithm predicts the statistical probability which is a number between 0-1. Before coming to the last result of probability, the credit score was also evaluated which is also a quantitative result. Linear regression used to predict a credit score value.

There was also a need to group the data into clusters or

into classes like business class, industry class, etc. But as data collected is textual, algorithm like k-means clustering which works best on quantitative data cannot simply be used. They might have used one encoding or other technique which is not clearly mentioned in the review.

So, to improve thought regarding our information before running the credulous Bayes classifier models, we will play out a trial of mean contrasts between the two hazard classes characterized previously. When we run the mean difference analysis between the two risk classes (healthy and risky groups), this analysis can give us a flavor of our data, as such an analysis allows us to verify if there is a difference between the two classes in terms of financial ratios.

*Algorithm*

* *

In business banks, credit chance evaluation is significant to recognize solid customers from the non-dependable ones. Thusly, the models that could foresee the defaults accurately are profoundly required. The basic Bayesian classifier was connected for evaluating the back probabilities of default as the factors taken are free for a given Y. Thus, it represents conditional relationships in the probabilistic sense.

Naive Bayes performs well when we have multiple classes and working with text classification. When dealing with text, it’s very common to treat each unique word as a feature, this makes for a large number of features. The relative simplicity of the algorithm and the independent features assumption of Naive Bayes make it a strong performer for classifying texts. Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data, even if the Naïve Bayesian assumption doesn’t hold. It requires less model training time. It works best when we have small training data set. So, Naïve Bayesian algorithm is the best fit for our data as it gives the probability prediction for example is the borrower is defaulter or not? If the probability comes <.5, then it is predicting that client could be a defaulter and there is the credit risk attached. So, bank can reject the application of the applicant/client.

A ROC curve is also plotted to visualize the performance of the binary classifier i.e. with two possible values of the default. Basically, it summarizes all of the confusion matrices the threshold produced.

## III. Result

## The fundamental outcomes demonstrate that the presentation of income factors improves the forecast quality, and the characterization rates go from 59.63 to 63.85 percent, separately, in the non-income and income models. To assess the execution of the model, a ROC bend was plotted. The outcome demonstrates that the AUC foundation is of the request of 69 percent.

An example of ROC curve:

## IV. Conclusion

When we take a gander at the importance of the mean contrasts, we understand that, all inclusive, the great pointers are unrivaled in the sound gathering, while the terrible markers are higher in the dangerous gathering.

To assess the execution of the model, a ROC bend was plotted. The outcome demonstrates that the AUC rule is of the request of 69 percent. By looking at ROC bends, we can get familiar with the distinction in grouping accuracy between at least two classifiers. The higher bend will be closer the ideal classifier and will have more exactness. Thus, Credit scoring is essentially an utilization of arrangement strategies, which characterize borrowers into various hazard gatherings. The target of scoring strategies is to anticipate the likelihood that a candidate or existing borrower will default.

## V. Limitations & Critical Comments

This paper reviewed did not focus on the data pre-processing techniques. There was no mention of how data was cleaned and which algorithm/techniques were used for changing the datatype to make it consistent. So, there were problems like Multi collinearity ratios. Initial set was 32 variables, out of which 8 were removed and analysis was done on 24 variables. If ignored, could have shown bad results.

The problem of dirty data or high correlation variable existence could easily be solved using techniques like calculating Euclidian distance or by plotting a s-graph in logistic regression based on the data. But, the paper really did not talk about any such thing.

Bayesian networks like Naive Bayes assumes that all input variables are independent. If that assumption is not correct, then it can impact the accuracy of the Naive Bayes classifier.

Using the same data, this criterion is improved and passed to 83 per cent when we NN methodology is used. However, if the data is accumulated and increased rapidly i.e. the training data increases tremendously, then Neural Networks would perform the best and would give the more precise result over the Bayesian algorithm.

References

[1] Han, Jiawei. ICDM 2008 : Proceedings, Eighh IEEE International Conference on Data Mining : 15-19 December 2008, Pisa, Italy. IEEE Computer Society, 2008,

[2] Timo, Koski and John, M. Noble, *Bayesian Networks, An Introduction* (Book style)*.* ch. 1, pp. 1–31.

[3] Posch, Loffler, *Credit Risk Modeling using Excel and VBA*. , ch. 2, pp. 27-30.

[4] S. Linoff, Gordon, Data Mining Techniques

**[5]** Tan, Pang-Ning and Kumar, Vipin,* Introduction* to Data, ch. 1-5, pp. 19-63, 98-110, 145-198.

[6] https://www.youtube.com/watch?v=RixQygYyDKI&t=1s

[7] https://app.pluralsight.com/library/courses/understanding-applying-logistic-regression/table-of-contents Modules 1,2,3.

[8] https://app.pluralsight.com/library/courses/understanding-applying-linear-regression/table-of-contents Modules 1,2.