How happy would you be knowing ALL segments whose default rate is more than twice of the average? The advent of BigData and cloud technologies has made such credit anomaly (or fraud) detection possible. For instance, our algorithm has found all segments with at least 40 customers and 50% default rate from the German credit data. This post suggests a new way of identifying patterns of highly badperforming segments using BigData technology and compares the new with existing methods. It will show that the new method can not only work in a complementary way with the existing tools but also deliver new insights that the existing methods could not provide.



A new method
How it works
In a nutshell, it generates a list of all possible segments and evaluates whether each segment satisfies both conditions below:
 Size: a segment should have enough entities (e.g. loans, transactions, customers and etc.); otherwise, a resulting solution won't be meaningful. For instance, 100% default rate of 3 loans can easily be due to randomness
 Event rate: a segment should have an enough number of events; otherwise, a resulting solution won't be useful. For example, 30% default rate of a 100loan segment in a loan portfolio with 30% default rate does not bring any value to business.
Why now?
Though it sounds easy, generating a list of all candidates and evaluating each is computationally very expensive (using Computer Science terms, the time complexity is exponential). The recent advent of the below two technologies made our algorithm possible.
BigData Technology
The algorithm involves a lot of tasks, which for one computer will take forever. The MapReduce framework allows the tasks to be distributed to many computers and orchestrates the distribution so that these tasks complete much faster without loss of data.
Cloud Technology
We only need the computers when we run the algorithm and the cloud technology enables us to do so. With cloud technology, we can rent hundreds of powerful servers only temporarily to run the algorithm fast.
How is it different from decision tree?
Though it sounds similar to the decision tree algorithm, it is fundamentally different. The decision tree algorithm considers sample size in making branches and usually the top branches are those that make impacts to larger groups in a dataset. This effectively removes the possibility that the decision tree algorithm will consider or detect anomaly cases.
Pros & cons of the new method
This approach basically automates the investigation responsibility of manual investigators and the descriptive analytics tasks of data scientists.
Pros
 Insight
 When patterns are not obvious (which is usually the case when trying to detect anomaly), data scientists often resort to using more complicated algorithms. This tendency can increase performance of resulting models but often these models get too complicated that it almost runs like a blackbox. In contrast, the algorithm clearly shows the characteristics and magnitude of anomaly cases
 Early and complete detection
 Because it searches all segments completely, it can detect small fraud segments as well, which greatly helps detect fraud early.
 Tailored
 The algorithm can work with any dataset (credit, fraud, marketing, you name it) and deliver customized results.
Cons
 Overfitting
 Because it finds all combinations of values that satisfy the two conditions (absolute volume and relative frequency of outlier events in a group), some of the results are due to overfitting and not worth attention. This problem can be mitigated by first looking at those results with highest lifts (i.e. those showing higher degree of relationship with anomaly cases). Moreover, stringent conditions can be provided in running the algorithm to make the result set smaller and more meaningful.
 Time complexity
 The algorithm's time complexity is basically exponential and as the amount of data grows, it will require a significant amount of time. This problem can be handled by renting more stronger servers and/or sampling data with only meaningful columns. In addition, in an initial run, a user can mandate very strict conditions to shorten the computation time.
Examples & use cases
German credit data
 Data source: UCI Machine Learning Repository
 Number of rows: 1,000
 Frequency of event of interest: 300 (30.0%)
 Conditions provided:
 Size: at least 40 loans
 Rate: at least 50% default rate
 Number of segments found: 1,649
We have applied the algorithm to the famous German credit data. The algorithm found 1,649 segments with at least 40 loans and 50% default rates, and our tests confirmed that the algorithm's solutions are accurate (a solution satisfies both size and rate requirements) and complete (other than the solutions the algorithm yields, there is no other solution). Below are the two groups with the highest default rates.
 Group 1:
 Group size: 40
 Default rate: 75.0% (30 defaults)
 Descriptions
 Status of existing checking account: Less than zero balance
 Installment rate in percent of disposable income: 4
 Job: Skilled employee / official
 Number of existing credit at this bank: 1
 Other installment plan: None
 Other debtors and guarantors: None
 Foreign worker: Yes
 Group 2:
 Group size: 42
 Default rate: 73.8% (31 defaults)
 Descriptions
 Status of existing checking account: Less than zero balance
 Installment rate in percent of disposable income: 4
 Credit history: Existing credits paid back duly until now
 Telephone: None
 Other installment plans: None
 Other debtors and guarantors: None
 Foreign worker: Yes
Bank marketing campaigns
Find more from this blog post: Who responds to marketing campaigns?.
Conclusion
The use of the algorithm in conjunction with the existing methods will deliver earlier, more complete and robust risk management to your organizations. If you have any question or want to apply the algorithm to a dataset in your organization, please contact us by clicking the "Contact" button in the menu above.