This is a collection of small datasets used in the course, classified by the type of statistical technique that may be used to analyze them. A couple of datasets appear in more than one category. The datasets are now available in Stata format as well as two plain text formats, as explained below. The copy with extension. This version is best for users of S-Plus or R and can be read using read. This version is better for users of Stata or other packages that prefer numerical codes.
However, Stata can read the character version if you specify the string width using str. To download any of these files using your browser I recommend that you right-click and choose 'save as If you left-click what happens next depends on how your browser is configured to handle these file types, and will often require an extra step.
The datasets are also available as Stata system files with extension. This is the easiest method for Stata users. You can also right click on the links to save a local copy.
R users can read the Stata files using Tom Lumley's read. Here are the famous program effort data from Mauldin and Berelson. This extract consist of observations on an index of social setting, an index of family planning effort, and the percent decline in the crude birth rate CBR between andfor 20 countries in Latin America. The data are available as plain text files effort. The data are also available in Stata format as effort.
Reference: P. Mauldin and B. Berelson Conditions of fertility decline in developing countries, Studies in Family Planning9 These are the salary data used in Weisberg's book, consisting of observations on six variables for 52 tenure-track professors in a small college.
The variables are:. The file is available in the usual plain text formats as salary. Here's an excerpt of the "dat" file:. Reference: S. Weisberg Applied Linear RegressionSecond Edition. New York: John Wiley and Sons.
Page The sample has observations after deleting 32 cases with incomplete information on five variables:. The data are available in plain text format in the files phbirths. Reference: I. Elo, G.In Supervised Machine Learning, Regression algorithms helps us to build a model by which we can predict the values of a dependent variable from the values of one or more independent variables.
For example, predicting future demand for a product based on previous demand. In this tutorial, we will use linear regression to predict salaries using Linear Regression based on given input attributes. Salary data includes the nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U. Lets see sample data for the the Salaries dataset by running below command which selects every 5th row starting from 1st row.
Here is the description given by CAR package for given data.
Predicting salaries using Linear Regression
Notice above the different attributes and their data type and the first few values of each attribute. As you know, input sample data may have missing values which can impact the overall prediction. First, we will visualize the variable Salaries against rank, discipline, yrs.
Lets fit the independent variables rank, discipline, yrs. To detect multi-colinearity, we can calculate the variance inflation and generalized variance inflation factors for linear and generalized linear models with the vif function. Multi-colinearity takes place when a predictor is highly correlated with others.
If multi-colinearity exists in the model, you might see some variables have a high R-squared value but are shown as variables insignificant. We can perform the Breusch-Pagan test for heteroscedasticity with the bptest function available with the lmtest package. The simple regression model assumes that the variance of the error is constant or homogeneous across observations. Heteroscedasticity means that the variance is unequal across observations.
Run below commands to diagnose the heteroscedasticity of the regression model using the bptest function. From p-value above, It implies that the standard errors of the parameter estimates are incorrect. However, we can use robust standard errors to correct the standard error do not remove the heteroscedasticity and increase the significance of truly significant parameters with robcov function available in the rms package.
With a fitted regression model, we can apply the model to predict unknown values. For regression models, we can express the precision of prediction with a prediction interval and a confidence interval. Note that the confidence interval predicted salary values are quite close to real values in question.Describes how to create a bar plot based on count data.
For an example of count data, see the email50 curated data set which was taken from the Open Intro AHSS textbook not affiliated. An example of count data in this dataset would be the spam column. Select one 1 column to create its barplot and then click 'Submit'. If you do not choose count data, you may get unexpected results. Students may also be interested in creating barplots for contingency tables. For a stacked side-by-side barplot, see the other barplot app.
Select 1 one column from a contingency table like the Gender and Politics or VADeaths curated datasets. If you do not choose a contingency table, you may get unexpected results. You can import a dataset if you are logged-in. Shows the student how to create a single stacked bar plot based on a column in a contingency table. For a basic barplot single column based on count data see the count data barplot app. For a stacked side-by-side barplot see the other stacked barplot app for categorical data.
Select 1 one column from a contingency table. If you don't have your own dataset, you can choose the Gender and Politics or VADeaths curated datasets. If a contingency table is not chosen, you may get unexpected results.
A contingency table has columns like a regular dataset, but the first row contains row names that categorize and "split-up" the dataset. An example of a contingency table would be something like this:. This contingency table is take from the Gender and Politics dataset. You can get a preview by selecting the dataset from the Curated Data dropdown above.
This app shows the student how to create a pie chart from a contingency table by hand using a Picostat dataset.We believe that we offer a valuable resource to the academic community and to anyone who is interested in the shape of the Computer Science in the most competitive institutes of the US.
You can download the database from here if you want to explore the data. To avoid conflicts you can only download and comment on the spreadsheet. You do not need to be signed in with a Google account. If you find any errors, you can right-click the cell you want to fix and add a comment.
We will double-check your correction and we will accept your change accordingly. How to insert a new professor If you cannot find a professor you can right-click and add a comment in one of the empty lines in the bottom of the spreadsheet.
Try to insert as many of the required fields as possible. We do not make any guarantee on the quality of the data. As you can read below, we used crowdsourcing techniques to create this database. We need your help to fix any errors that are still left. These are the fields of research according to their popularity.
Supplemental Analysis by Prof. Jeff Huang Professor Jeff Huang has released a detailed analysis that accompanies this dataset on his website. You can find his analysis here. They intend to release a ranking for other CS areas soon. You can find their ranking here. PhD student M. Adil Yalcin of University of Maryland incorporated our dataset in keshif. You can perform various queries and see the results in an accessible GUI.
Our goal was to create a full record of all Computer Science professors in 50 top Graduate Programs, according to the popular ranking of US News. For now, we restrict the professors to only Full, Associate, or Assistant.
Lecturers, Researchers, Teaching faculty are not allowed momentarily. We also only allow professors of the 50 provided universities. In the future we wish to expand this database to fully reflect the status of each Computer Science department. One of the 20 fields reported by Microsoft Academic Bachelors : University they acquired their BSc degree from Masters : University they acquired their MSc degree from Doctorate : University they acquired their PhD degree from PostDoc : University or Company they did their post-doctoral training Sources : at least one link to the source that contains the above information.
Since we ended up with 2 instances of each department we managed to increase the accuracy of the database by merging them into a single dataset. This work would not be possible without the hard work of 19 exceptional students.
How to fix an incorrect entry If you find any errors, you can right-click the cell you want to fix and add a comment.
How to use this dataset Here is a list of suggestions: Look up faculty in a particular area when applying to grad schools Track Alumni Chose what field to study! Look at what is the breakdown of research areas per department Disclaimer We do not make any guarantee on the quality of the data. Did you know?Categorical variables also known as factor or qualitative variables are variables that classify observations into groups. They have a limited number of different values, called levels.
For example the gender of individuals are a categorical variable that can take two levels: Male or Female. Regression analysis requires numerical variables. So, when a researcher wishes to include a categorical variable in a regression model, supplementary steps are required to make the results interpretable.
In these steps, the categorical variables are recoded into a set of separate binary variables. This is done automatically by statistical software, such as R. For simple demonstration purpose, the following example models the salary difference between males and females by computing a simple linear regression model on the Salaries data set [ car package]. R creates dummy variables automatically:. The p-value for the dummy variable sexMale is very significant, suggesting that there is a statistical evidence of a difference in average salary between the genders.
The contrasts function returns the coding that R have used to create the dummy variables:. R has created a sexMale dummy variable that takes on a value of 1 if the sex is Male, and 0 otherwise. The decision to code males as 1 and females as 0 baseline is arbitrary, and has no effect on the regression computation, but does alter the interpretation of the coefficients. The fact that the coefficient for sexFemale in the regression output is negative indicates that being a Female is associated with decrease in salary relative to Males.
This results in the model:.
Dataset of 2200 faculty in 50 top US Computer Science Graduate Programs
So, if the categorical variable is coded as -1 and 1, then if the regression coefficient is positive, it is subtracted from the group coded as -1 and added to the group coded as 1. If the regression coefficient is negative, then addition and subtraction is reversed. Generally, a categorical variable with n levels will be transformed into n-1 variables each with two levels.
These n-1 new variables contain the same information than the single variable. This recoding creates a table called contrast matrix. This variable could be dummy coded into two variables, one called AssocProf and one Prof:. This dummy coding is automatically performed by R. For demonstration purpose, you can use the function model. When building linear model, there are different ways to encode categorical variables, known as contrast coding systems. The default option in R is to use the first level of the factor as a reference and interpret the remaining levels relative to this level.
Note that, ANOVA analyse of variance is just a special case of linear model where the predictors are categorical variables. And, because R understands the fact that ANOVA and regression are both examples of linear models, it lets you extract the classic ANOVA table from your regression model using the R base anova function or the Anova function [in car package].
We generally recommend the Anova function because it automatically takes care of unbalanced designs. Taking other variables yrs. Significant variables are rank and discipline. For example, it can be seen that being from discipline B applied departments is significantly associated with an average increase of In this chapter we described how categorical variables are included in linear regression model.
As regression requires numerical inputs, categorical variables need to be recoded into a set of binary variables. We provide practical examples for the situations where you have categorical variables containing two or more levels.
Note that, for categorical variables with a large number of levels it might be useful to group together some of the levels.
Faculty Compensation Survey
Some categorical variables have levels that are ordered.Please contact staff by email. Join Login. Data from the survey are also available in several custom formats: results portal, complete datasets, peer compensation reports, and survey report tables. Results portal access is available for institutions to create and save peer lists and conduct fully customizable analysis on specific variables not included in peer compensation reports.
Complete datasets are provided in Excel and include all of the report appendix institutional listings plus additional fields.
Reports and data will be provided to active AAUP chapters and state conferences in good standing at no cost. Order data or find out more here. Skip to main content. Secondary menu Contact Events Career Center. Search form Search Submit. Faculty Compensation Survey. See the Survey Results Data from the survey are also available in several custom formats: results portal, complete datasets, peer compensation reports, and survey report tables.
Butler University v. Steinberg Readers Respond: Kenneth L.A researcher investigates the factors that are associated with the salaries of professors who teach courses at a major university.
The researcher gathers data about the subject area and the highest degree obtained from a random sample of professors. The researcher uses Fit General Linear Model to determine whether subject area and degree are associated with salary. Then, she uses the stored model with Comparisons to assess salary differences by subject area.
Academic salaries data. Worksheet column Description Variable type Subject The subject matter that the professor teaches: humanities 1social sciences 2engineering 3or management 4. Factor Degree The highest degree that the professor obtained: bachelor's 1master's 2or Ph.
Read our policy. The subject matter that the professor teaches: humanities 1social sciences 2engineering 3or management 4.
The highest degree that the professor obtained: bachelor's 1master's 2or Ph. The amount of payment received for teaching a course in thousands of dollars.