Psychol
Methods. Author manuscript; available in PMC 2017 Dec 1. Published in final edited form as: PMCID: PMC5221569 NIHMSID: NIHMS821520 The introduction to this special issue on psychological research involving big data summarizes the highlights of 10 articles that address a number of important and inspiring
perspectives, issues, and applications. Four common themes that emerge in the articles with respect to psychological research conducted in the area of big data are mentioned, including: 1. The benefits of collaboration across disciplines, such as those in the social sciences, applied statistics, and computer science. Doing so assists in grounding big data research in sound theory and practice, as well as in affording effective data retrieval and analysis. 2. Availability of large datasets on
Facebook, Twitter, and other social media sites that provide a psychological window into the attitudes and behaviors of a broad spectrum of the population. 3. Identifying, addressing, and being sensitive to ethical considerations when analyzing large datasets gained from public or private sources. 4. The unavoidable necessity of validating predictive models in big data by applying a model developed on one dataset to a separate set of data or hold-out sample. Translational abstracts that
summarize the articles in very clear and understandable terms are included in Appendix A, and a glossary of terms relevant to big data research discussed in the articles is presented in Appendix B. Keywords: big data, machine learning, statistical learning theory, social
media data, digital footprint, decision trees and forests Big data involves the storing, retrieval, and analysis of large amounts of information and has been gaining interest in the scientific literature writ large since the 1990s. As a catch-all term, big data has also been referred to by a number of other related terms such as: data mining, knowledge discovery in databases, data or predictive analytics, or data science. The domain has traditionally
been associated with computer science, statistics, and business, and now it is clearly, quickly, and usefully making inroads into psychological research and applied practice. There is a healthy and growing infrastructure for dealing with big data, some of it being open source and free to use. For example, Hadoop (a name originally based on that of a child’s toy elephant) is a widely used open source file system and framework. Within this framework, MySQL is a structured query language
that is also open source and is used a great deal. It allows powerful capabilities to “Select” a specific group of entities, “From” a specific database or set of files, “Where” one or more specific conditions hold. For example, an academic researcher could select and analyze data based on student identification numbers from class records in several majors, where the GPA is less than 2.0. In turn, this could allow for the possibility of strategic data-driven interventions with these students to
offer enrichment or tutoring that would bolster their grades and improve their chances of staying in school and succeeding. Once big data are queried and refined, they can be analyzed with a number of tools, increasingly with commonly known software and programs such as R and Python, respectively. Who is using big data? Business industries in this area abound (e.g., insurance, manufacturing, retail, pharmaceuticals, transportation, utilities, law, gaming, eBay, telecommunication,
hotels). Social media is also prominently involved (e.g., Google, Facebook, LinkedIn, Yahoo, Twitter). Various academic disciplines also have a visible presence (e.g., genomics, medicine, and environmental sciences, the latter often using spatial geographic information systems, or GIS). There are several journals in this area, including the open access and peer-reviewed journal Big Data, founded in 2013 and currently edited by Dhar. Their web page
(http://www.liebertpub.com/overview/big-data/611/) boasts a comprehensive coverage and audience—yet has not yet mentioned psychology or even the broader social sciences. At least two other journals were founded in 2014, the open access Journal of Big Data that
is edited by Furht and Khoshgoftaar, and Big Data Research that is edited by Wu and Palpanas. Likewise, these two journals also do not appear to be directed to those in psychology or the larger social sciences. Similarly, a quick Google search in September 2016 for “big data book” revealed more than 48 million results, although it is noteworthy that all of the big data books listed on the front page, are not specifically directed to social science fields. Noting all of this is not to indict the
current state of big data for neglecting psychology—quite the opposite: Psychology and the social sciences should be proactive and take advantage of a real opportunity in front of them. The timing is ripe, now that the big data movement has matured beyond many of its fads. So, where does psychology fit into the field of big data or related areas such as computational social science? There are a number of areas in which psychology could and has begun to weigh in, such as wellness,
mental health, depression, substance use, behavioral health, behavior change, social media, workplace well-being and effectiveness, student learning and adjustment, and behavioral genetics. A number of recent books of interest to psychology researchers have been published (Alvarez, 2016; Cioffi-Revilla,
2014; Mayer-Schönberger & Cukier, 2013; McArdle & Ritschard, 2014, to name a few). Researchers are studying topics such as health and the human condition in big datasets comprising thousands of individuals, such as in the Kavli Human Project
(http://kavlihumanproject.org/; Azmak et al., 2015). In a similar vein,
Fawcett (2016) discusses the analysis of what is called the quantified self in which individuals collect data on themselves (e.g., number of steps, heart rate, sleep patterns) using personal trackers such as Fitbit, Jawbone, iPhone, and similar devices. Researchers envision studies that could link such personal data to health and productivity to reveal patterns or links between
behavior and various outcomes of interest. It is apparent that big data or data science is here to stay, with or without psychology. This broad-and-growing field offers a unique opportunity for interested psychological scientists to be involved in addressing the complex technical, substantive, and ethical challenges with regard to storing, retrieving, analyzing, and verifying large datasets. Big data science can be instrumental in collaboratively working to uncover and illuminate
cogent and robust patterns in psychological data that directly or indirectly involve human behavior, cognition, and affect over time and within sociocultural systems. These psychological patterns, in turn, give meaning to non-psychological data (e.g., medical data involving health-related interventions; booms and busts tied to financial investing behavior). The big data community, and big data themselves, can together propel psychological science forward. In this special issue, we
offer 10 articles that focus on various aspects of big data and how they can be used by applied researchers in psychology and other social science fields. One of the common themes of these articles is also clearly evident in federal funding announcements for big data projects: Psychologists and psychology benefit from the collaboration and contributions of other disciplines—and vice-versa. For example, such collaborations can incorporate cutting-edge breakthroughs from computer science that can
help access and analyze large amounts of data, as well as theory and behavioral science from across the social sciences that offer insight into the areas that are most in need of understanding, prediction, and intervention. A second theme is that data are widely available in open forums such as Facebook, Twitter, and other social media sites, and can offer the opportunity to identify trends and patterns that are important to address. For example, tapping the content of Google
activity could indicate geographic areas where users are inquiring about various flu or other symptoms, thus pointing to areas in which it may be important to focus health intervention efforts. The psychological nature of the query content might allow for early planning in targeting the intervention (e.g., judging the level of knowledge and concern about the health problem and its related symptoms and treatment). Note that when big data analyses incidentally detect a useful signal in the noise
of social media data, one’s discoveries and research efforts need not stop there; researchers can develop new construct-driven measures that help amplify those signals that may have initially been discovered serendipitously. A third general theme is that it is critically important to consider and carefully attend to the ethical issues of big data projects, including data acquisition and security, the protection of the identity of the users who often inadvertently provide extensive
data, and decisions about how the information will be used and interpreted vis-à-vis the nature of the audience or stakeholders involved. A fourth shared theme of these articles is that it is essential to develop theories and hypotheses on an initial training set of data and then verify those findings with other validation datasets, either from a hold-out sample of the original data or from separate, independent data. With the existence of large datasets that often may not have had
an overriding theory or set of hypotheses guiding their formation, an initial analysis of big data is often at the exploratory or data mining level. At least one or more subsequent analyses of separate data may be needed to be able to generalize past the initial data, particularly as there can be a large number of variables that are relevant to prediction, but not necessarily the best measures that one could obtain with additional foresight and planning. Given a large number of incidental
variables, and given the flexible modeling afforded by big data analyses, it is perhaps more important than ever to avoid over-interpreting what might be considered a modern-day version of the classic “crud factor” (Meehl, 1990, p. 108), namely where researchers could find the appearance of relationships between variables in a large dataset that are robustly upheld (e.g., through
cross-validation), yet these relationships may change or dissipate over time, as the nature of the relevant sample, population, and the phenomenon under study change as well. Each of the articles in this special issue address one or more of these four themes in relatively easy-to-understand presentations of how big data can be used by researchers in psychology. A summary of the highlights of the articles is presented, below, followed by
Appendix A, which provides translational abstracts (TAs) of the articles, briefly describing the essence of the papers in clearly understandable language. Appendix B includes a Glossary of some of the major terms used in the 10 articles, providing brief descriptions of each and an indication of
which articles refer to these terms. To be clear, the Glossary is not intended to provide an exhaustive list of big data concepts; it is more of a summary of some of the ideas and practices that are referred to in these special issue articles so that readers can have a reference of the terminology and find out which special issue articles are discussing them. To help identify which terms are included in the Glossary in
Appendix B, these terms are italicized in this introductory article, although not necessarily in the separate articles themselves. The first article by Chen and Wojcik offers an excellent guide to conducting behavioral science research on large datasets. In addition to describing some background and concepts, they provide three tutorials in the supplemental materials in which
interested readers can move through the steps. Their first tutorial clearly indicates how to acquire the congressional speech data through application programming interfaces (APIs) that reflect specific procedures needed to acquire data from a site. Their second tutorial demonstrates how these data are analyzed using procedures known as latent semantic analysis (LSA) and latent Dirichlet allocation (LDA) topic modeling, both of which can be
used to assess the co-occurrence of words in a dataset based on underlying topics and relationships between documents. Other terms, common to the big data community and discussed in their main article and their third tutorial, include bag of words, stop words, support vector machines, machine learning, and supervised learning algorithms (see also our Glossary in
Appendix B of this article). Chen and Wojcik also provide two appendices to help apply the material they discuss. Their Appendix A provides the Python code for acquiring data from the Congressional Daily Digest that are discussed in the first and second tutorials, and the use of MySQL.
Their Appendix B offers a checklist for conducting research with big data. In the second article, Landers et al. discuss web scraping, an automated process that can quickly extract data from websites behind the scenes. Behavioral scientists are increasingly involved in this type of research, within academia and in organizations, determining the pulse of social consciousness
and norms on web sources such as Facebook, Twitter, Instagram, and Google. Along with delineating potential benefits of web scraping, Landers et al. also provide their expert advice on the need to emphasize theory in such a project. In particular, they discuss what they call theory of the data source or data source theory to help ensure the relevance and meaningfulness of data that are obtained from web scraping. Although there are not yet exact standards on the ethics of
scraping the web for data, Landers et al. suggest that the APA Ethical Principles of Psychologists and Code of Ethics (2010), along with those from the Data Science Association, can suggest policies and procedures for collecting data in a responsible manner that respects the participants and the research field in which conclusions will be shared. Assessing large datasets that are
gleaned or scraped from the web using the theory-driven method suggested by Landers et al. can help lessen the possibility that the findings are just happenstances of a large collection of information. The third article, by Kosinski et al., discusses how to use large data bases collected from the web to understand and predict a relevant outcome. Their paper is a tutorial that describes an example of using Facebook digital footprint data, stored in what is called a
user-footprint matrix, to predict personality characteristics. The authors analyze input from over 100,000 Facebook users (see myPersonality project, http://www.mypersonality.org/; Kosinski, Matz, Gosling, Popov, & Stillwell, 2015) using dimension-reduction
procedures such as singular value decomposition (SVD) that is computationally easy to use as a method for conducting principal components analysis. The Kosinski et al. article also discusses a clustering procedure known as latent Dirichlet allocation (LDA) to help form dimensions with similar content from large datasets of text or counts of words or products. Findings from an LDA model can be visually depicted in a heatmap that shows darker
colors when a trait or characteristic is more correlated with one of the LDA clusters. Thus, you can see at glance the patterns that characterize each cluster. In the fourth article, Kern et al. discuss the analysis of big data found on social media, such as on Facebook and Twitter. The authors discuss several steps in acquiring, processing, and quantifying these kinds of data, so as to make them more manageable for statistical analyses. The authors discuss a World Well-Being
Project and use LDA or latent semantic analysis that helps reduce large amounts of text-based information into a smaller set of relevant dimensions. They also discuss a procedure known as differential language analysis, encouraging the use of database management systems that pervade the world of business and increasingly are being implemented in psychological research. Cautioning that results could be specific to a particular dataset and need to be further tested
with independent data, Kern et al. explain and implement the k-fold cross-validation method that tests a prediction model across repeated subsets of a large dataset to support the robustness of the findings. The authors also discuss prediction methods such as the lasso (i.e., least absolute shrinkage and selection operator) as a regression method for robust prediction, based on screening a large set of predictors and weighting predictors that were selected
conservatively (i.e., with lower magnitudes than traditional OLS regression). They also caution against ecological fallacies, whereby researchers derive erroneous conclusions about individuals and subgroups based on results from a larger group of data, and exception fallacies, when a conclusion is drawn based on outliers (exceptions) in the data that may stand out but may not fully represent the group. Not everyone uses social media, and some use it far more often or
idiosyncratically than others. Still, these authors are optimistic about the amount and richness of the data that can be gleaned from social media, and the insights that can be gained from such data. In the fifth article, Jones, Wojcik, Sweting, and Silver examine the content of Twitter posts after three different traumatic events (violence in or near college campuses), applying linguistic analyses to the text for negative emotional responses. They discuss a procedure known as
Linguistic Inquiry and Word Count and an R-based computer twitteR package to analyze such data. Using an innovative approach, the authors recognize pertinent Twitter users by identifying people who follow relevant community networks tied to the geographical area of the event, and they are careful to compare results with control groups not similarly geographically situated, to help ensure that results were event-driven, versus other contemporaneous events that were more
geographically widespread. Overall, this work demonstrates how psychological themes can be reliably extracted and related to region- and time-dependent events, similar to prior related work in the health arena. In the sixth article, Stanley and Byrne contribute a theory-driven approach to big-data modeling of human memory (i.e., long-term knowledge storage and retrieval), testing two theoretical models that predict the tags that users apply to Twitter and Stack Overflow posts. Incorporating but going beyond the psychological tenet that “past behavior predicts future behavior,” the current models robustly predict how and to what extent this tenet applies given the nature, recency, and frequency of past behavior. This paper exemplifies an important general point, that big-data analyses benefit from being theory-driven, demonstrating how theories can develop in their usefulness as a joint function of empirical competition (i.e., deciding which model affords better prediction) and empirical cooperation (i.e., demonstrating how model ensembles might account for the data more robustly than models taken individually). The authors discuss the use of an ACT-R based Bayesian model and a random permutation model to understand and clarify predictions about links between processes and outcomes. The seventh article, by Brandmaier et al, discuss ensemble methods that they developed, one of which is called structural equation model (SEM) trees that combines decision trees (also called recursive partitioning methods) and SEM to understand the nature of a large dataset. These authors suggest an extended method called SEM forests that allows researchers to generate and test hypotheses, combining both data- and theory-based approaches. These and other methods, such as latent class analysis and multiple sample SEM, help in assessing distinct clusters in the data. Several methods are described to gauge how effectively an SEM forest is modeling the data, such as examining variable importance based on out-of-bag samples from the SEM trees, as well as case proximity and conversely, an average dissimilarity metric, the latter indicating its novelty. Brandmaier et al. provide two examples to demonstrate the use of SEM forests. Interested researchers can conduct similar analyses using Brandmaier’s (2015) semtree package that is written in R, with their supplemental material providing the R code for the examples they provide. In the eighth article, Miller, Lubke, McArtor, and Bergeman detail a new method for detecting robust nonlinearities and interactions in large data sets based on decision trees. Called multivariate gradient boosted trees, this method extends a well-established machine-learning or statistical learning theory method. Whereas most predictive models in the big data arena seek to predict a single criterion, the present approach consider multiple criteria to be predicted (as does the Beaton et al. partial least squares correspondence analysis method). Such exploration is useful for informing and refining theories, measures, and models that take a more deductive approach. To do this, a boosted tree-based model for each outcome is fit separately, where the goal is to minimize cross-validated prediction error across all outcomes. An advantage of tree-based methods comes in detecting complex predictive relationships (interactions and nonlinearities) without having to specify their functional form beforehand. In the current approach, tree models can be compared across outcomes, and the explained covariance between pairs of outcomes can also be explored. The authors illustrate this approach using measures of psychological well-being as predictors of multiple psychological and physical health outcomes. Interested readers can apply this method to their own data with Miller’s R-based mvtboost package. In the ninth article, Chapman, Weiss, and Dubenstein consider measure-development models that focus squarely on predictive validity using a machine-learning approach that challenges—and complements—traditional approaches to measure development involving psychometric reliability. The proposed approach seeks out additional model complexity so long as it is justified by increased prediction; the approach incorporates k-fold cross validation methods to avoid model overfitting. Almost two decades ago, McDonald’s (1999) classic book, Test theory: A unified treatment, also suggested that measures of a construct judged to be similar should not only demonstrate psychometric reliability, but also show similar relationships with measures of other constructs in a larger nomological net. The current big-data paper reflects one important step toward advancing this general idea, discussing procedures and terms such as elastic net, expected prediction error, generalized cross-validation error, stochastic gradient boosting and supervised principal components analysis, as well as R-based computer packages glmnet, and superpc. For the final 10th article, Beaton, Dunlop, and Abdi jointly analyze genetic, behavioral, and structural MRI in a tutorial for a generalized version of partial least squares called partial least squares correspondence analysis (PLSCA). The method can handle disparate data types that are on widely different scales, as might become increasingly common in large and complex data sets. In particular, their methods can accommodate categorical data when analyzing relationships between two sets of multivariate data, where traditional analyses assume the data for each variable are continuous (or even more strictly, multivariate normal). These authors have developed a freely available R package, TExPosition, which allows readers to apply the PLSCA method to their own data. In closing, we hope you find something of interest to you in one or more of the 10 articles we present in this special issue on the use of big data in psychology. We recognize that other articles may approach these topics differently, and likewise, many other big data topics will be discussed in the future. We look forward to continued tutorials and other research publications in Psychological Methods that share even more about how to apply innovative and informative big data methods to meaningful and relevant data of interest to researchers in psychology and related social science fields. AcknowledgmentsThe co-editors (Harlow and Oswald) would like to thank the authors and reviewers who contributed to this special issue. We also would like to offer much appreciation and thanks to our manuscript coordinator, Meleah Ladd, who has played an integral part in helping to make every aspect of our work better and more enjoyable, and especially so with this special issue. Lisa Harlow also extends thanks to the National Institutes of Health grant G20RR030883. Appendix A: Translational Abstracts (TAs) for the 10 Special Issue Articles
Appendix B: Glossary of Some of the Major Terms used in the10 Special Issue ArticlesACT-R based Bayesian models are based on the ACT-R theory of declarative memory that can be operationalized as a big data predictive model, reflecting how declarative memory processes (e.g., exposure, learning, recall, forgetting) affect behavioral outcomes. The predictive model incorporates a version of the Naïve Bayes method, such that any piece of knowledge is assigned a prior probability for being retrieved by the user, independent of all other pieces of available knowledge, which is then weighted by the information in the current context to yield a posterior distribution and prediction. See Stanley and Byrne. APA Ethical Principles of Psychologists and Code of Ethics (2010), along with those from the Data Science Association, suggest policies and procedures for collecting data in a responsible manner that respects the participants and the research field in which conclusions will be shared. See Landers et al. Application Programming Interfaces (APIs) refer to sets of procedures that software programs use to request and access data in a systematic way from other software sources (APIs can be web-based or platform-specific). See Chen and Wojcik; Jones et al.; Kern et al.; and Stanley and Byrne. Average dissimilarity is a general term indicating how different a case tends to be from the rest of the data. See Brandmaier et al. Bag of words conveys word frequency in a relevant text (e.g., sentence, paragraph, entire document), without retaining the ordering or context of the words. See Chen and Wojcik. Case proximity is a general term for the similarity between entities in a data set, identifying any clear outliers. See Brandmaier et al. Crud factor (Meehl, 1990, p. 108) is a general term used to indicate that in any psychological domain, measures of constructs are all correlated with one another, at some overall level. Traditional analyses have dealt with this, as will big data analyses. See Harlow and Oswald. Data source theory refers to a well-thought out theoretical rationale, developed on the basis of the available variables in a given set, to support the nature of the data and the findings derived from them. Researchers working with big data projects are encouraged to have a data source theory to guide exploration, analyses, and empirical results in large data sets. See Landers et al. Database management system (DBMS) is a structure that can store, update, and retrieve large amounts of data that can be accrued in research studies. See Kern et al. Data Science Association (http://www.datascienceassn.org/) is an educational group that offers guidelines for researchers to follow regarding ethics and other matters relevant to organizations. See Landers et al. Decision trees (also called recursive partitioning methods) are models that apply a series of cutoffs on predictor variables, such that at each stage of selecting a predictor and cutoff point, the two groups created by the cutoff are as separated (i.e., internally coherent and externally distinct) as possible on the outcome variable. Decision trees model complex interactions, because each split of the tree on a given predictor is dependent on all splits from the previous predictors. See Brandmaier et al., and Miller et al. Differential language analysis (DLA) is an empirical method used to extract underlying dimensions of words or phrases without making a priori assumptions about the structure of the language, and then relating these dimensions to outcomes of interest. See Kern et al. Digital footprint refers to data that can be obtained from various sources such as the web, the media, and other forums in which publicly available information is posted by or stored regarding individuals or events. These kind of data can be stored in what is called a User-Footprint Matrix. See Kosinski et al. Ecological fallacies are incorrect conclusions made about individual people or entities that are derived from information that summarizes a larger group. For example, if a census found that higher educational levels were associated with higher income, it would not necessarily be true that everyone with high income had a high level of education. Simpson’s paradox is an extreme example, where each within-group relationship may be different from or even the opposite of a between-group relationship. See Kern et al. Elastic net refers to a regression model that linearly weights the penalty functions from two regression models: the lasso regression model (applying an L1 penalty that conducts variable selection and shrinkage of non-zero weights) and the ridge regression model (applying an L2 penalty that applies shrinkage, does not select variables, and will include correlated predictors, unlike lasso). See Chapman et al. Ensemble methods involve the use of predictions across several models. The idea is that combining predictions across models tends to be an improvement over the predictions taken from any single model in isolation. An example of an ensemble method is the structural equation model random forests (see this term, below). See Brandmaier et al. Exception fallacies involve mistaken conclusions about a group derived from a few unrepresentative instances in which an event, term, or characteristic occurs quite a lot. For example, if one or two participants in a dataset mention the word “sad” many times, it could falsely be surmised that the group of data as a whole experienced depression. See Kern et al. Expected prediction error (EPE) is an index of accuracy for a predictive model, decomposed into: (a) squared bias (systematic model over- or under-prediction across data sets), (b) variance (fluctuation in the model parameter estimates across data sets), and (c) irreducible error variance (variance that cannot be explained by any model). Expected prediction error captures the bias-variance tradeoff: Models that are too simple will under-fit the data and show high bias yet low variance in the EPE formula; models that are too complex will over-fit the data and show low bias yet high variance in the EPE formula. See Chapman et al. Generalized cross-validation error indicates the target that is be minimized (the loss function) in k-fold cross-validation: e.g., the sum of squared errors, the sum of absolute errors, or the Gini coefficient for dichotomous outcomes. See Chapman et al. glmnet is a computer package written in R code by Friedman, Hastie, Simon and Tibshirani (2016) that fits lasso and elastic-net models, with the ability to graph model solutions across the entire path of relevant tuning parameters. See Chapman et al. Heatmaps plot the relationships among variables and/or clusters, using colors or shading to indicate the strength of relationship among variables. See Kosinski et al. k-fold cross-validation involves partitioning a large dataset into k subsets of equal size. First, a model is developed on (k-1) partitions of the data – the “test” data set; then predicted values from model are obtained on the kth partition of the data that was held out – the “training” data set). This process is repeated k times so that all the data serve as training data, and all data therefore have predicted values from models in which they did not participate. See Chapman et al., and Kern et al. Lasso (least absolute shrinkage and selection operator) is a regression method that helps screen out predictor variables that are not contributing much to a model relative to the others. See Kern et al. Latent class analysis can help explain the heterogeneity in a set of data by clustering individuals into unobserved types, based on observed multivariate features. Features may be continuous or categorical in nature. See Brandmaier et al. Latent Dirichlet allocation (LDA) is a method that models words within a corpus as being attributable to a smaller set of unobserved categories (topics) that are empirically derived. See Chen and Wojcik. Latent semantic analysis (LSA) involves the examination of different texts, where it is assumed that the use of similar words can reveal common themes across different sources. See Chen and Wojcik, and Kern et al. Linguistic inquiry and word count (LIWC) is a commercial analysis tool for matching target words (words within the corpus being analyzed) to dictionary words (words in the LIWC dictionary). Target words are then characterized by the coded features of their matching dictionary words, such as their tense and part of speech, psychological characteristics (e.g., affect, motivation, cognition), and type of concern (e.g., work, home, religion, money). See Jones et al. Machine learning, which has also been called statistical learning theory, is a generic term that refers to computational procedures for identifying patterns and developing models that improve the prediction of an outcome of interest. See Chapman et al.; Chen and Wojcik; Harlow and Oswald; Kern et al; and Miller et al. Multiple sample structural equation modeling (SEM) helps in testing differences across the different clusters that emerge, to identify the patterns of heterogeneity. See Brandmaier et al. Multivariate gradient boosted trees involve a nonparametric regression method that applies the idea of stochastic gradient boosting to trees (see stochastic gradient boosting). Trees are fitted iteratively to the residuals obtained from previous trees, while seeking to optimize cross-validated prediction across multiple outcomes (not just one). See Miller et al. mvtboost is a package written in R code by Miller that implements multivariate gradient boosted trees, allowing the user to tune and explore the model. See Miller et al. MyPersonality project (http://www.mypersonality.org/; Kosinski, Matz, Gosling, Popov, and Stillwell, 2015) stores the scores from dozens of psychological questionnaires as well as Facebook profile data of over six million participants. See Kosinski, et al. MySQL is an open source version of a structured query language for working with big data projects. See Harlow and Oswald, and Chen and Wojcik. Novelty refers to how different a case is from the rest of the data, showing little proximity and more dissimilarity. See Brandmaier et al. Out-of-bag samples are portions of a larger dataset that do participate in the development of a predictive model and can be used to generate predicted values (and error). Out-of-bag samples are similar to the test sample data referred to previously in k-fold cross-validation. See Brandmaier et al. Partial least squares correspondence analysis (PLSCA) is a generalization of partial least squares that can extract relationships from two separate sets of data measured on the same sample. In particular, PLSCA is useful for handling both categorical and continuous data types (e.g., genetic single-nucleotide polymorphisms that are categorical, and behavioral data that are roughly continuous). Permutation tests and bootstrapping are applied to conduct statistical inference for the overall fit of the model as well as inference on the stability of each obtained component. See Beaton et al. Random permutation model is an approach for determining whether to preserve information about word order in text analytics, in case doing so provides additional predictive information. Permutations create uncorrelated vectors as a point of contrast with the actual ordering. See Stanley and Byrne. semtree is a computer package that was developed by (Brandmaier, 2015; http://brandmaier.de/semtree/) and written in R. It can be used to analyze SEM tree and forest methods to help explore and discern clusters or subgroups within a large dataset. See Brandmaier et al. and related references. Singular value decomposition (SVD) is a procedure used to reduce a large set of variables or items to a smaller set of dimensions. It is one approach to conducting a principal components analysis. See Kosinski et al. Stack Overflow is an online question-and-answer forum for programmers (using R, Python, and otherwise). See Stanley and Byrne. Stochastic gradient boosting is a general term for an iterative method of regression, such that the predictor entered first has the highest functional relationship with the outcome; then residuals are created, and the same rule is applied (where the outcome now becomes the residuals). Also, at each iteration, only a subset of the data is used to help develop more robust models (where out-of-bag prediction errors can be obtained from the data outside of the model). The learning rate and number of iterations are, loosely, inversely related (low learning rate, or improvement in prediction at each step, generally means more iterations) and optimizing these can be explored and supported through cross-validation. See Chapman et al. Stop words are words that are not essential to a phrase or text and therefore can be omitted to help keep a file more concise. Examples of stop words include “an” and “the” or other similarly nondescript words that can be deleted from a large database (e.g., Twitter, Facebook) and do not need to be analyzed. See Chen and Wojcik; Kern et al.; and Stanley and Byrne. Structural equation model (SEM) forests are classification procedures that combine SEM and decision-tree or SEM-tree methods to understand the nature of subgroups that exist in a large dataset. SEM forests extends the method of SEM trees by resampling the data to form aggregates of the SEM trees that should have less bias and more stability. See Brandmaier et al. Structural equation model (SEM) trees combine the methods of decision trees and SEM to conduct theory-guided analysis of large datasets. SEM trees are useful in examining a theoretically based prediction model, but can be unstable when random variation in the data is inadvertently featured in a decision tree. See Brandmaier et al. superpc is a computer package written in R code by Bair and Tibshirani (2010) that conducts the procedure known as supervised principal components analysis, a term that is defined below. See Chapman et al. Supervised learning algorithms are procedures that can be developed on a training dataset, and then be used to build regression models that can predict an outcome with one or more variables. See Chen and Wojcik. Supervised principal components analysis (SPCA) is a generalization of principal components regression that first selects predictors with meaningful univariate relationships with the outcome and then performs principal components analysis. Cross-validation is used to determine the appropriate threshold for variable selection and the number of principal components to retain. See Chapman et al. TExPosition is a computer package written in R code by Beaton and colleagues that implements partial least squares correspondence analysis (this latter term being defined previously). See Beaton et al. Theory of the data source is the process whereby a larger conceptual framework is adopted when analyzing and interpreting findings from a large dataset, particularly one obtained for another purpose, such as with web scraping of generally available data. See Landers et al. twitteR is a package written in R code by Jeff Gentry that accesses the Twitter API (see glossary entry on this term), which then allows one to extract subsets of Twitter data found online, search the data, and subject the data to text analyses. See Jones et al. User-footprint matrix holds information obtained from sources such as the web or various records and lists. See Kosinski et al. Variable importance is a term indicating how much the inclusion of a specific variable will reduce the degree of uncertainty there is in a model (or models) of interest. The uncertainty criterion and the model of course must be mathematically formalized. See Brandmaier et al. Web scraping is a process that culls large amounts of data from web pages to be used in observational or archival data collection projects. See Landers et al. World Well-Being Project(WWBP, http://www.wwbp.org/) involves a collaboration with researchers from psychology and computer science. The project draws on language data from social media to study evidence for well-being that can be revealed through themes of interpersonal relationships, successful achievements, involvement with activities, and indication of meaning and purpose in life. See Kern et al. FootnotesA draft of a portion of this introduction was previously presented in Harlow, L. L., & Spahn, R. (2014, October). Big data science: Is there a role for psychology? Abstract for Society of Multivariate Experimental Psychology, Nashville, TN.
Contributor InformationLisa L. Harlow, Department of Psychology, University of Rhode Island. Frederick L. Oswald, Department of Psychology, Rice University. References
Which of the following are types of psychological research?The Three Types of Psychology Research. Causal or Experimental Research.. Descriptive Research.. Relational or Correlational Research.. Which of the following are examples of the areas in which psychological Animal research has provided benefits to human populations?which of the following are examples of the areas in which psychological animal research has provided benefits to human populations? psychotherapy techniques and behavior medicine, alleviation of the effects of stress and pain, drugs to treat anxiety and severe mental illness, rehabilitation of neuromuscular disorders.
Which of the following are among the contemporary approaches to psychological science?Contemporary Approaches to Psychology
There are seven contemporary approaches in psychology: behavioral, psychodynamic, cognitive, behavioral neuroscience, evolutionary, sociocultural, and the humanistic movement and positive psychology.
What method involves gaining knowledge by collecting data and thinking logically about the findings?All scientists (whether they are physicists, chemists, biologists, sociologists, or psychologists) are engaged in the basic processes of collecting data and drawing conclusions about those data.
|