Concept + practical explanation! This article takes you to understand the data analysis of RFM model [kaggle project actual combat sharing]

Today’s study share , Enclosed please find !

RFM Model is an important tool to measure customer value and customer profitability . In numerous customer relationship management (CRM) In the analysis mode of ,RFM Models are widely mentioned .

therefore RFM Model is the knowledge that data analysts must master , And this article details RFM While modeling , And it comes with kaggle Project practice , Collect this article , You’re afraid you don’t understand RFM Model , Don’t you know how to classify users ?

RFM The model is based on a customer’s recent purchase behavior 、 The overall frequency of the purchase and how much it cost 3 Indicators to describe the value of the customer .

R value :Recency, Last consumption

  • The last consumption refers to the time from the last purchase to the current time . For example, when was the last time I bought a car , When was the last time I bought a record .

Theoretically , The customer who spent the last time more recently should be a better customer , They are also most likely to respond to the provision of immediate goods or services . If marketers want to grow their performance , Only by stealing the market share of competitors , If we should pay close attention to consumers’ purchase behavior , So the latest consumption is the first tool that marketers should use .

function :R The function of value is not only to provide promotion information , Marketers’ consumption reports can monitor the soundness of the business . Good marketers regularly check consumption analysis , To grasp the trend . If the monthly report shows customers who bought very close last time ,( Consumption is 1 Months ) If the number increases , It means that the company is a steady growth company ; conversely , Last time, fewer and fewer customers spent one month , It is a sign that the company is on an imperfect road .

F value :Frequency, Consumption frequency

  • Consumption frequency is the number of times a customer has purchased in a limited period of time . We can say the most frequent customers , They are often the most satisfied customers . Increasing the number of times customers buy means stealing market share from competitors , Earn turnover from others .

Based on this indicator , We can divide our customers into five equal parts : If the customer who purchases once is a new customer , Customers who buy twice are potential customers , Customers who buy three times are old customers , Customers who buy four times are mature customers , If you buy five or more times, you are a loyal customer . The goal of operators is to let consumers upgrade .

Be careful : The consumption frequency of different types of goods often has a large gap , Such as wedding products and snacks , The former is often bought almost once ( More society will be chaotic, ha ha ), The latter are consumables , Consume comments and engage in , It is relatively easy to repeat purchases , So F Value is not suitable for cross category comparison .

M value :Monetary, Consumption amount

  • The amount of consumption is the same as the frequency of consumption , There is a limited time frame , It means a period of time ( Usually 1 year ) Consumption amount in . It can also verify Pareto’s law ( Commonly known as the 28 law ), namely 80% Our income comes from 20% The customer .

M The value is RFM Relative to R Values and F Value is the hardest to use , But the most valuable indicator . Beauty products of the same brand , The price fluctuation range is basically within the acceptable range of a specific consumer group , In addition, the purchase frequency of a single category is not high , So for general stores ,M Value has a relatively weak effect on customer segmentation .

be based on RFM Model for customer segmentation

CRM In practice, you can choose RFM In the model 1-3 Customer segmentation based on three indicators , As shown in the following table . Keep in mind that the breakdown indicators need to be within a reasonable range that you can control , Not more is better , Once the user subdivides too many groups , First, it will bring greater difficulty to the implementation of their own marketing plan , In the future, you may miss the user group or cause multiple interruptions to the same user .

There are two reference criteria for how to select the final index : The customer base of the store , The goods and customer structure of the store .

 Insert picture description here

The customer base of the store : When the number of customers in the store is small , choice 1-2 One dimension can be subdivided ; On the contrary, you can choose 2-3 The user uses two indicators .

The goods and customer structure of the store : If the commodity level in the store is relatively single , When the difference of customer unit price is small , Purchase frequency (F value ) And the amount of money spent (M value ) Highly correlated cases , You can only choose the purchase frequency that is easy to operate (F value ) Substitute consumption amount (M value ). For stores that have just opened and have not yet formed customer stickiness , You can give up the purchase frequency (F value ), Directly use the last consumption (R value ) Or the amount of consumption (M value ).

adopt RFM Output the target user after scoring the model

RFM The model score mainly has three parts :

  1. determine RFM Segments of the three indicators and the score of each segment ;

  2. Calculate each customer RFM The score of the three indicators ;

  3. Calculate the total score of each customer , And select high-quality customers according to the total score

Take the picture above as an example .
 Insert picture description here
At this point, we add the scores obtained by each user under each index , You can get the final score .

But what we need to pay attention to here is , For each score corresponding to each indicator, it should not be the same as the above figure , Further assignment shall be made according to different stores ( Listen to other netizens say you can use AHP Analytic hierarchy process , I haven’t learned about ).
also , When adding, it is best to set a weight for each index first , For example, the final calculation formula can be :score = 0.5R+0.3F+0.2M.
For specific weight settings, please refer to the above-mentioned Two reference standards .

be based on RFM Common strategies for

RFM It is very suitable for enterprises that provide a variety of goods , The unit price of these goods is relatively low , Or complementary to each other , It is necessary to buy repeatedly , These enterprises may provide the following goods : Consumer goods 、 clothing 、 Small appliances, etc ;RFM It also applies to such enterprises , They provide both high-value durable goods 、 At the same time, it also provides supporting parts or maintenance services , as follows : Precision machine tool 、 Complete set of production equipment 、 Printers, etc ;RFM For commodity wholesale 、 Trade in raw materials 、 And some service industries ( Such as travel 、 insurance 、 transport 、 Courier 、 Entertainment, etc ) It is also suitable for enterprises .

RFM It can be used to increase the number of transactions of customers . Commonly used in the industry DM( Direct mail ), Often send thousands of mail order lists at a time , In fact, this is a waste of money . According to the statistics ( In terms of general mail order daily necessities ), If all R(Recency) Our customers are divided into five levels , The best response rate of level 5 is three times that of level 4 , Because these customers have just completed the transaction , Therefore, we will pay more attention to the product information of the same company . If you use M(Monetary) To divide customers into five levels , The best and second best average response rate , Almost no significant difference .

Some people will use the absolute contribution of customers to analyze whether customers are lost , But absolute amounts sometimes misinterpret customer behavior . Because the price of each commodity may be different , There are different discounts for the promotion of different products , So the relative grading ( for example R、F、M Are divided into five levels ) To compare the changes of consumers in the level range , It can show relative behavior . For enterprises R、F The change of , We can infer the change of customer consumption , According to the possibility of customer churn , List customers , Again from M( Consumption amount ) From the perspective of , You can focus on customers with high contribution and high loss opportunities , Focus on visiting or contacting , Recover more business opportunities in the most effective way .


The above three indicators will subdivide the dimensions 4 Share , In this way, we can subdivide 4x4x4=64 Class user , Then according to each type of user precision marketing …… obviously 64 Such users are beyond the computing scope of ordinary human brain , Not to mention targeting 64 Customized marketing strategy for class users . In practice , We just need to do two points for each dimension once , In this way 3 In two dimensions, we still get 8 Group users .

( Numbering sequence RFM,1 On behalf of the high ,0 Represents low )
Important value customers (111): The recent consumption time is close 、 The frequency and amount of consumption are very high , Must be VIP ah !
Important customer retention (011): Recently, the consumption time is far away , But the consumption frequency and amount are very high , It shows that this is a loyal customer who hasn’t been here for a while , We need to take the initiative to keep in touch with him .
Key development customers (101): Recently, the consumption time is relatively close 、 The consumption amount is high , But the frequency is not high , Low loyalty , Potential users , We must focus on the development of .
Important retention customers (001): Recently, the consumption time is far away 、 Consumption frequency is not high , But users with high consumption amount , It may be users who are about to lose or have lost , Retention measures should be given .

Project source :

Project brief introduction :
This is a cross-border data set , It includes stores registered in the UK on 2010 year 12 month 1 solstice 2011 year 12 month 9 All online retail transactions that occur between days . The company mainly sells unique all-weather gifts , Many customers are wholesalers .

The main purpose of this project is to use RFM Model for user classification .

PS: This project is in jupyter Running on .

The import module

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei'] 

Load data

df = pd.read_csv('data.csv',encoding = 'ISO-8859-1', dtype = {

Next, let’s officially start the analysis !

1 Data exploration and data cleaning

1.1 Data exploration


 Insert picture description here


 Insert picture description here
Data contains 541910 That’s ok ,8 A field , The field content is :

InvoiceNo: The order no. , Every deal has 6 It’s an integer , The return order number begins with a letter ’C’.
StockCode: Product number , from 5 An integer makes up .
Description: Product description .
Quantity: Product quantity , A minus sign indicates a return .
InvoiceDate: Order specific time .
UnitPrice: The unit price ( pound ), The price per unit product .
CustomerID: Customer number , Each customer number consists of 5 Digit composition .
Country: The name of the country , Each customer’s country / The name of the region .

 Insert picture description here
It’s not hard to see , We need to convert the date format , And conventional missing value statistics 、 De duplication and outlier detection and processing .

1.2 Missing value statistics

df.apply(lambda x :sum(x.isnull())/len(x),axis=0)

 Insert picture description here


 Insert picture description here

df['CustomerID'] = df['CustomerID'].astype('str')

 Insert picture description here

df['CustomerID'] = df['CustomerID'].fillna('unknown')

1.3 Date format conversion

df['date'] = [x.split(' ')[0] for x in df['InvoiceDate']]
df['date'] = pd.to_datetime(df['date'])
df['month'] = df['date'].astype('datetime64[M]')
df[['date', 'month']]

 Insert picture description here

1.4 duplicate removal

df = df.drop_duplicates()

 Insert picture description here

1.5 Exception handling

ad locum , We treat return orders as abnormal data ( That is, data with negative quantity or negative unit price ).

df[(df['Quantity']<0) | (df['UnitPrice']<0)]

 Insert picture description here

df = df[(df['Quantity']>0) & (df['UnitPrice']>0)]
df[(df['Quantity']<0) | (df['UnitPrice']<0)]

 Insert picture description here

2 The user classification

R_value = df.groupby('CustomerID')['date'].max()
R_value = (df['date'].max() - R_value).dt.days  

 Insert picture description here

F_value = df.groupby('CustomerID')['InvoiceNo'].nunique()

 Insert picture description here

df['amount'] = df['Quantity'] * df['UnitPrice']

M_value = df.groupby('CustomerID')['InvoiceNo'].nunique()
M_value = df.groupby('CustomerID')['amount'].sum()

 Insert picture description here


 Insert picture description here

R_value.hist(bins = 30)

 Insert picture description here


 Insert picture description here

M_value.hist(bins = 30)

 Insert picture description here

 Insert picture description here
It can be seen that it is very uneven .

M_value[M_value<2000].hist(bins = 30)

 Insert picture description here


 Insert picture description here

F_value.hist(bins = 30)

 Insert picture description here

 Insert picture description here

F_value[F_value<30].hist(bins = 30)

 Insert picture description here
The same is very uneven .

R_bins = [0,30,90,180,360,720]
F_bins = [1,2,5,10,20,5000]
M_bins = [0,500,2000,5000,10000,200000]

First, R value :

R_score = pd.cut(R_value,R_bins,labels=[5,4,3,2,1],right=False)

 Insert picture description here
Next is F value :

F_score = pd.cut(F_value,F_bins,labels=[1,2,3,4,5],right=False)

 Insert picture description here
And finally M value :

M_score = pd.cut(M_value,M_bins,labels=[1,2,3,4,5],right=False)

 Insert picture description here

Generate a new data frame and take a look :

rfm = pd.concat([R_score,F_score,M_score],axis=1)

 Insert picture description here

 Insert picture description here
Change the data format :

rfm['R_score'] = rfm['R_score'].astype('float')
rfm['F_score'] = rfm['F_score'].astype('float')
rfm['M_score'] = rfm['M_score'].astype('float')

 Insert picture description here
According to the average value, the value is divided under each index :

rfm['R'] = np.where(rfm['R_score']>3.82,' high ',' low ')
rfm['F'] = np.where(rfm['F_score']>2.03,' high ',' low ')
rfm['M'] = np.where(rfm['M_score']>1.89,' high ',' low ')

 Insert picture description here


 Insert picture description here

def rfm2grade(x):
    if x==' Gao Gaogao ':
        return ' High value customers '
    elif x==' High and low ':
        return ' Focus on developing customers '
    elif x==' Low high high ':
        return ' Focus on keeping customers '
    elif x==' Low low high ':
        return ' Focus on retaining customers '
    elif x==' High and low ':
        return ' General value customers '
    elif x==' High and low ':
        return ' General development clients '
    elif x==' Low and high ':
        return ' Generally keep customers '
        return ' Generally, keep customers '  
rfm[' User level ']=rfm['RFM'].apply(rfm2grade)

 Insert picture description here

3 Classification results

rfm[' User level '].value_counts() 

 Insert picture description here

rfm[' User level '].hist(figsize=(12,9))

 Insert picture description here

rfm[' User level '].value_counts() / rfm[' User level '].value_counts().sum()

 Insert picture description here

rfm[' User level '].value_counts().plot(kind = 'pie', 
        figsize = (15, 9),
        title = 'RFM The user classification ', 
        textprops = {
plt.legend(loc=2, bbox_to_anchor=(1.05,1.0),borderaxespad = 0.)  

 Insert picture description here

4 Conclusions and suggestions

From the results of the proportion of user classification , High value customers and important development customers account for… Of the total 47%, It is an important source of company income .

  • High value customers (111)
    RFM All three values are high , To provide vip service .

  • Focus on developing customers (101)
    Consumption frequency is low , But the other two values are high , We have to find a way to increase his consumption frequency , It is recommended to timely push the company’s activity information or new product related information to attract customers .

  • Focus on keeping customers (011)
    The recent consumption is far from the present time , That is to say F Low value , But the frequency and amount of consumption are high . This kind of user , It’s a loyal customer who hasn’t come for a while . You should take the initiative to keep in touch with him , Increase the repurchase rate . You can give coupons or push product discount information to increase the number of purchases .

  • Focus on retaining customers (001)
    The recent consumption time is far from now 、 Consumption frequency is low , But the amount of consumption is high . This kind of user , It’s going to drain , Take the initiative to contact users , Investigate what went wrong , And find a way to recover . Of course, you can also give coupons or push product discount information to increase the number of purchases .

  • General development clients (100)
    The company shall obtain the detailed data information of the customer , Understand the customer’s consumption attributes . It is recommended to carry out precision marketing and timely push product information to such customers .

Of course , The final marketing strategy should be based on the company’s own financial investment .

RFM You can’t use it too much , And customers who cause high transactions continue to receive letters . Every enterprise should design a customer contact frequency rule , For example, you should send a thank-you call or… Within three days or a week Email, And actively care about whether consumers have use problems , After one month, send out the inquiry on whether the use is satisfactory , After three months, provide cross selling suggestions , And began to pay attention to the possibility of customer loss , Constantly create opportunities to actively contact customers . thus , The opportunity for customers to buy again will also be greatly improved .

For the convenience of friends in need, run the code , I also put the complete code and data files on the network disk , A friend in need takes it by himself .
link :
Extraction code :1024

Quote and thank you

Recommended Columns

machine learning : Share practical machine learning projects and common model explanations
Data analysis : Share data, analyze practical projects and sort out common skills

CSDN@ The report , I have to study hard today

Read more here: Source link