The popularity of social media, the digitization of large databases, smartphone, laptop and tablet apps that gather user-generated data, the internet of things, and the reliance on cloud storage has increased the availability of very large data sets for both discovery and predictive analyses. These large data sets, usually referred to as big data, carry with them difficult problems associated with developing data mining and statistical analysis methods that are capable of handling the enormous amount of data involved without introducing error into the analysis. The data are there but making good use of these data can be difficult, expensive and time consuming.
The temptation to use big data without adequate concern for the possible problems involved is great. An internet economy that is driven by click counts coupled with data owners such as Facebook and Twitter that are eager to increase profits by selling access to their data can easily produce click-bait headlines, flawed analyses, or unsound conclusions that are based on poorly understood and badly analyzed big data.
In a recent article published in the journal Science researchers from Carnegie Mellon and McGill Universities pointed out a number of ways data drawn from social media sites can be compromised. For example, data that is publicly displayed on a social media website (e.g., what’s trending on Twitter) may be filtered by the website owner and the hows and whys of the filtering process are usually hidden from users. In addition, social media data from real users is likely to be combined with data from bots, paid or malicious spammers, and public relations firms who are pretending to be real users as a way to market their clients.
In order to draw justified conclusions about a large group such as adults or voters in the US (usually referred to as the “population” in statistical analyses) based on data drawn from a smaller group such as data drawn from social media sites (the “sample”), the small group has to have the same characteristics as the large group with regard to the question you are trying to answer. Statisticians call a small group that has the appropriate characteristics a representative sample.
The size of the sample, has nothing to do with whether it is representative of the population of interest or not. You can’t draw justified conclusions about whether men prefer to watch baseball or basketball on TV by sampling women’s TV viewing habits no matter how many women you include in your sample. The fact that social media sites have an enormous amount of data at their disposal doesn’t mean that samples drawn from these data are representative of the populations that are of interest to their customers. If the sample isn’t representative of the population of interest, it doesn’t matter how big it is; it’s useless for answering questions about the population of interest.
A simple and straightforward source of potential problems with social media data is that different social media sites attract different audiences. In an earlier post The Info Monkey looked at a report from the Pew Research Center on social media demographics which showed that Facebook, Twitter, Instagram, Pinterest and LinkedIn are used by different, and sometimes markedly different, kinds of people. Drawing conclusions about a particular demographic group based on data from a different demographic is, at best, a way to guarantee that your conclusions are unjustified, and is, at worst, a recipe for disaster.
Here’s a fictitious example of the representativeness problem using social media data. Suppose you want to find out which pop music stars are most popular among teenage boys so you can hire these stars to market your product to male teenagers. You decide to answer this question by counting the number of times different stars’ pictures appear on Pinterest. Pinterest tends to be used by suburban women. A sample drawn from Pinterest users is not going to be representative of the population of teenage boys.
Could you answer the question about the popularity of pop music stars among teenage girls with Pinterest data? Pinterest users tend to be female but they also tend to be 49 years of age or less, have completed at least a college degree and have annual incomes greater than $75,000 a year. A sample drawn from Pinterest users is not going to be representative of the population of teenage girls because few, if any, teenage girls have completed college and have annual incomes over $75K per year. The generally older, more educated, and wealthier women who tend to use Pinterest are unlikely to have the same tastes in music as the younger women whose taste in pop music you are trying to determine.
Here’s a real example of the representativeness problem using social media data. The graphic on the left was used to illustrate a supposed “correlation” between age, the percentage of people within each age group that votes, and the use of social media. You are supposed to draw the conclusion that, for example, people between 35 and 54 tend to use Facebook and Twitter and are more likely to vote than other age groups. This was offered as one piece of evidence that social media are effective in reflecting the views of potential voters in the US. The idea was that you could get an accurate idea about what 35 to 54 year old voters in the US think about different candidates by seeing what they have to say about those candidates on Facebook and Twitter. Unfortunately, there is a serious representativeness problem here. Facebook users tend to be women who are 49 years of age or younger, have completed some college, and have annual incomes of $30K or less while Twitter users tend to be non-rural, non-hispanic blacks who are 29 years of age or younger. Neither group is representative of the population of US voters between the ages of 35 and 54.
In general, data drawn from US users of Twitter, Instagram Pinterest or LinkedIn do not provide representative samples of the US population. The users of each of these social media platforms are biased toward different segments of the population. Even Facebook, which according to the Pew report is used by somewhere between 60% and 65% of US adults, is skewed toward women who have completed some college, are aged 49 or less, and who make less than $30K per year. None of these social media platforms produce data that is generally representative of the US population.
If you care at all whether the evidence that is offered supports the conclusions that are reached, then analyses or predictions based on social media data should be considered with a great deal of caution. Anyone who uses social media data as the basis for their argument needs to address how they have dealt with the inherent limitations of these data. If they don’t, it means that either they’re clueless about this stuff, or they understand that their conclusions are not well supported by their data but they think you’re clueless and you’ll believe anything that has numbers in it because you saw it on the internet.