Business Intelligence Methods
Objective: Address a business intelligence challenge using logistic regression and a technique of your choice.
Background: A student advisory service is looking at the admission statistics of a prestigious graduate program. They can provide value to students by recommending which graduate institutions they should apply to. Given the fact that applications are expensive and time consuming, the probability of admission is an important factor. They are also interested in identifying other analysis such as cluster analysis, association rules, or classification techniques, which would provide insight into the characteristics of the students applying and admitted to the program. The firm may go back to the school as well, to provide value for marketing and outreach for these students. For example, the school might decide they want more high performing students from lower rank schools.
Part 1: Logistic Regression
Your task: Run a logistic regression. Using the output answer the following:
(I got the output by using the program SSPS, I put it in a different file)
1) What is the probability of admission for a student who came from a rank 2 institution with a 750 GRE and a GPA of 3.33?
P(Y) = Exp(a + ß1X1 + ß2X2 + … ßnXn)/ [1+ Exp(a + ß1X1 + ß2X2 + … ßnXn)]
P(Admission) = Exp(a + ß1X1 + ß2X2 + … ßnXn)/ [1+ Exp(a + ß1X1 + ß2X2 + … ßnXn)]
P(Admission) = Exp(-3.45 -.56*2 + .777*3.33 + .002*750)/ [1+ Exp(-3.45 -.56*2 + .777*3.33 + .002*750)] = .38
So P(Admission| rank=2, GPA =3.33, GRE = 750) = .38
What is the probability of admission for a student with an 800 GRE, 3.11 GPA and a Rank 3 school?
2) Select at random 10 individuals. Calculate the probability of admission for each.
3) Using the criteria p(admission) > .5 predicts admission, please create a classification matrix for the ten observations you choose.
Ex: if we observed a student who had a 750 GRE, 3.33 GPA, and Rank “2” who was admitted, that student would be an “false negative”, that is we predicted she would not have been admitted (p= .38, which is less than .5), when in fact she was admitted. How many of your 10 observations were classified correctly?
Part II: Analysis of your choice
Run an analysis of the data with a cluster analysis, association rule, or classification technique of your choice. (By using Weka) (I have done the lab part, I got the output of the classification technique using J48 from the program Weka, I put it in a different file) (I didn’t choose the association rule because Weka didn’t allow me to do it because it is hard to find the data, so that is why I chose to do the classification technique)
1) Outline briefly a research question/business issue you are investigating.
2) Provide a brief overview of what technique you chose to investigate.
3) Analyze the results: describe two or more leaves, clusters, or rules in detail.
4) Describe the insights you gained form the model, and any recommendations or interesting findings you came away with.
How to use Weka:
1- Open the program
2- Click on Explorer
3- Open file
4- Files of type: choose (csv), and then choose the excel file
5- Click on Classify => choose => trees => J48 => start:- now you can see the output of J48 design tree
Here is the like to download Weka:-
use version weka-3-6-10jre-x64.exe
– It must be written in a very simple language by using only very simple and Basic English words. So I don’t want it to be like professional writing, I want it to be like a first year student college writing level by using simple and basic words.
– I have uploaded the excel file (data).
– You cannot use other than SSPS for the first part and Weka for the second part, you can however do the calculation by the excel or by hand.
– I have already done all the lab parts, the SSPS and Weka, so you don’t have to do them just look at the output, or you can do them to see them in a better view!
– Please see the next page!
I need your way of writing to be same exact of this writing to make the reader feel that this writing and the one you are going to write has only one writer( this writing is my older assignment about the classification by using J48 tree of the output and SimpleCart tree too. you can copy some sentences of my old assignment if they are helpful)
((This old assignment has different data))
My completed old assignment:-
Business Intelligence Methods: Classification: Decision Trees
Frame and Business Objective
The essence of this task is to use a J48 decision tree in Weka with adjustments for a maximum number of tree leaves. As such, adjusting the number assumes that attributes in Naive Bayes are equally important and statistically independent given the class value. Despite the fact that there might be inaccuracies in assuming the statistical independence of attributes, it works flawlessly in practice. It is imperative to note that classification does not require accurate probability estimates so long as each class has a corresponding greatest chance, and it is correct. However, in practice, adding redundant attributes might cause problems. Thus, it is mandatory to deploy distinct attributes in the selection process. It was imperative to have an objective at this point. As such, it meant that an increase in the number of subscribers opening bank accounts would result in a subsequent increase in revenue.
There were two information gaps that the researcher discovered. First, there were instances where the information was unknown. For example, some people did not have contacts; thus, the metric outcome was significantly affected. In fact, in some cases, the result displayed was “other” or “failure,” indicating a lack of authenticity of data presented. Second, income presented an information gap. Account balances play an insignificant role in explaining why individuals open accounts. However, their income is imperative to the process because it depicts the possibility of opening a bank account. According to the scenario presented, it is worth noting that if there were individuals with high income, they would most probably open bank accounts than low-income earners. Measurements
The assignment entailed using a range of measures to describe data. First, ‘age’ was represented by a numeric attribute. Second, ‘job’ was categorical depending on an individual’s preference. In this case, individuals were described as ‘admin.’, ‘blue-collar’, ‘entrepreneur’, ‘housemaid’, ‘management’, ‘retired’, ‘self-employed’, ‘services’ ‘student’, ‘technician’, ‘unemployed’, and ‘unknown’. Third, ‘marital’ depicted their marital status. As such, in a similar representation to ‘job’, it was categorical. Thus, individuals were ‘single’, ‘married’, or ‘divorced’. It is imperative to note that the term ‘divorced’ applied to both widowed and separated individuals because they were no longer living together, a factor that would likely contribute to opening a bank account.
Fourth, ‘education’ was a categorical attribute with ‘unknown’, ‘secondary’, ‘primary’, and ‘tertiary’ as options. Fifth, ‘default’ tested if the client had credit in default. As such, the entry was binary: ‘yes’ or ‘no’. Sixth, ‘balance’ was a numeric attribute that entailed the average yearly balance in Euros. Seventh, ‘housing’ was a binary attribute for a housing loan. Eighth, ‘loan’ was binary and inquired if the prospect had a personal loan.
The following were related to the last contact with the current campaign. First, ‘contact’ was categorical and it implied contact communication type, that is, ‘unknown’, ‘telephone’, or ‘cellular’. Second, ‘day’ was a numeric option that implied the last contact day of the month. Third, ‘month’ was a categorical option implying the last contact month of year, that is ‘jan’, ‘feb’, ‘mar’, ‘apr’, ‘may’, ‘jun’, ‘jul’, ‘aug’, ‘sep’, ‘oct’, ‘nov’, or ‘dec’. Lastly, ‘duration’ was a numeric input for the last contact duration in seconds. Other attributes used in the analysis were as follows.
First, ‘campaign’ was the number of contacts performed during this campaign and for this client in numeric form. Second, ‘pdays’ was the number of days that passed by after the customer was last contacted from a previous campaign in numeric format. Third, ‘previous’ implied the number of contacts performed before this campaign and for this client and it was numeric. Fourth, ‘poutcome’ entailed the outcome of the previous marketing campaign. It was categorical with options ‘unknown’, ‘other’, ‘failure’, or ‘success’. Lastly, the output variable ‘y’ was binary and it queried whether the client had subscribed a term deposit or not.
The analytic method used in this essence was J48. It is imperative to note that J48 is a machine-learning model that decides the dependent variable, that is, target value, from a set of data. It has different nodes denoting distinct attributes and nodes for classification. In this regard, the dependent variable was ‘poutcome’. On a different note, independent variables were ‘housing’, ‘loan’, ‘contact’, ‘day’, ‘month’, ‘duration’, ‘campaign’, and ‘pdays’.
The program was run as a single instance analysis using J48 and SimpleCart.
Presentation of Results
It is critical to note that individuals will open bank accounts based on the independent variables listed above. In this regard, the four most important entail jobs, age, education, and loan. First, people in management are more likely to open bank accounts than technicians, retired, unemployed, administrators, or people in blue-collar jobs. Second, individuals in their 30s and 40s were more likely to open accounts because of current responsibilities, nature of jobs, and the need to save for the future. Third, education was imperative to making the choice. As such, those who had attained a degree from a tertiary institution would open bank accounts than their counterparts who achieved secondary school education but still outperformed primary school graduates. Finally, a loan was central to decision-making because most individuals with loans were unlikely to open an account for the fear of incurring additional debt.
TAKE ADVANTAGE OF OUR PROMOTIONAL DISCOUNT DISPLAYED ON THE WEBSITE AND GET A DISCOUNT FOR YOUR PAPER NOW!