Data Preprocessing

Essay by xinyi2 • February 3, 2018 • Research Paper • 314 Words (2 Pages) • 1,216 Views

Essay Preview: Data Preprocessing

prev next

Page 1 of 2

neural network

1) Do neural network on entire data.

We run several models from single hidden layer and experiment with the number of nodes using the default TanH activation function to two layers and 15 nodes. We start with three nodes and slowly increasing the number of nodes in the model. We examine the resulting confusion matrices, and find that twelve nodes give a good model. The input and result show as below.[pic 1]

[pic 2]

The misclassification rate is 9.5%. It’s RSquare up to 49.4% and RMSE is 25.83%, which means this model fits correlatively well.

[pic 3]

2) Do neural network on oversampled data.[pic 4][pic 5]

Lift curve

[pic 6]

Roc curve

[pic 7]

3) Do neural network on oversampled data after duplication.

After several trials, we find out when we only set TanH in first layer to eleven. We can get the best model from them. Confusion matrices shows as below. [pic 8][pic 9]

The misclassification rate is 18.54% and the RMSE is 36.40%, which increased compared to using entire data. However, this model’s RSqure is 56.60%. This model is fitting better.

lift curve[pic 10]

ROC curve[pic 11]

[pic 12]

Discriminant analysis

1) Do discriminant analysis on entire data.

Because this method should separate continuous variables and continuous variables and the response variable is category variable. We should change all predictors into continuous variables.

So, we make indicator columns to category predictors and use these indicators to do the discriminant analysis.

[pic 13]

Do discriminant analysis.

[pic 14]

2) Do discriminant analysis on oversampled data.

After revision

[pic 15]

Do discriminant analysis with dummy variables and continuous predictors

[pic 16]

ROC curve

[pic 17]

3) Do discriminant analysis on oversampled data after duplication.

Revise variables first.[pic 18]

do discriminant analysis[pic 19]

As the screenshot shows, when we predict with this model, there will be 20.14% probability of misclassification and 36.03% RSquare.

So, we can try to use this model to predict, but it’s not recommended.

ROC curve

[pic 20]

...

Download as: txt (2 Kb) pdf (2.5 Mb) docx (1.1 Mb)

Continue for 1 more page »

Read Full Essay Save

Only available on OtherPapers.com

Similar Essays

Crm Data Conversion

Data Conversion The first major task was data conversion. The four systems from which data had to be retrieved were a commercial out patient system,

440 Words | 2 Pages
Pros and Cons of Interpol Data

Dr. Jan Van Dijk opposes the reliance cross-national crime researchers have with Interpol data. Van Dijk believes that this is not an effective way to

1,154 Words | 5 Pages
Benefits of Virtual Data Center

Benefits of Virtual Data Center A virtualized datacenter help your business to be more consistent, easy to be managed, and best of all implement a

733 Words | 3 Pages
Supply and Demand for Easter Eggs Data

1: Using the supply and demand model, we need to find out what the demand is for Easter eggs and what the suppliers are willing

940 Words | 4 Pages
Cis 319 - Individual Computer Information Paper - Accuracy of Data Input

INDIVIDUAL COMPUTER INFORMATION PAPER Jason Kilthau - CIS319 May 2, 2011 Accuracy of data input is important. What method of data input would be best

1,038 Words | 5 Pages
Including in Vitro Diagnostics Point of Care Testing in Continuity of Care Medical and Administrative Data Management Software

Including in vitro Diagnostics Point of Care Testing in Continuity of Care medical and administrative Data Management Software B. Spyropoulos, M. Botsivaly Abstract Care medical

2,843 Words | 12 Pages
Crime Data Are Things

Crime data are things that devise accurate methods of collecting crime data, using these tested methods to measure the amount and trends of criminal activity,

288 Words | 2 Pages
Reseach Methology - Data Collection Method

Description of data collection method. In understanding our research , we have used two types of data sources, they are primary data and secondary data.

421 Words | 2 Pages