Step 0: Clean Data Obtain data from Kaggle . Remove all duplicates. For the attribute “class”, change 0 to -1 so that -1 represents normal and +1 represents fraud. After this, the resulting dataset should contain 473 fraud and 283253 normal transactions. Step 1: Scale Time \& Amount All other features were PCA transformed, except for Time \& Amount. However, the ranges of these 2 features differ a lot. Therefore, the data under ‘Time’ and ‘Amount’ need to be scaled so that none of the features will weigh in a lot scale. Step 2: Re-sample Data The data set is extremely unbalanced: only of data entries are fraud transactions, which is expected since fraud should be abnormal. Partition data so that of data is the testing data and of data is the training data (make sure both parts have fraud transactions!). Optionally, you can apply 5 -fold cross-validation (can refer to sklearn.model selection.KFold ). For the training data ONLY, choose appropriate resampling technique(s) (can refer to and Over-sampling methods ) to resample the training data. DO NOT resample testing data. Step 3: Train Model the model using the resampled training data. Step 4: Analyze Result Use your testing data to score your model. At least compute the Accuracy, Precision \& Recall. Feel free to explore more scoring options . Finally, choose 1 thing to adjust, and compare the results. That is, keep everything except… – Scaler in step 1, to check the effect of different scalers (or the same scaler with different parameters); – Or resampling method in step 2, to check the effect of different resampling methods (or the same method with different parameters); – Or the model in step 3, to check the effect of different models (or the same model with different parameters). Adjusting \& testing one of them is enough. If you do more, you may get some extra credits. Also, try to explain the reason for the different results you see. Resources to Use – You probably want to write code in Python since those pre-written machine learning packages (scikit-learn , imbalanced-learn, etc . .) are written in Python. A Jupyter notebook should be enough for you to write and run your code. – There are a lot of tutorials about using these packages. And you can find plenty of nice examples under the code and discussion section of the dataset. You can reference them, but be sure to cite them if you did (you don’t have to use a formal citation, comment with the URL is enough) – And here’s my poorly-written paper about comparing LOF and SVM on this problem . The first several sections explained the ideas, which were all covered in class. The last 2 sections (IV.EXPERIMENTS AND RESULT \& CONCLUSION) may give you some idea about how to analyze your result. Submission Please do NOT zip the following, but submit them as separate files. – Your source code (in any language you like, but Python is recommended…) – A report in .doc or .pdf that includes your choices in steps 2 – 4, screenshot(s) of the results, and your analysis. The
Read more here: Source link