“Hindsight is an extremely useful thing but unfortunate thing about it is that, it is not available when it is really needed” – Thomas Cover
Information theorist and statistics professor
Abstract: Social media has grown into big data production engines. People share personal, professional and social lives on social media including Facebook, Twitter and Instagram. By analysing social media like twitter, we will be able to unravel insightful trending topics, which are hidden knowledge that can be applied in different domains like politics, marketing and social science. This essay provides an overview of the existing literatures and models on predictive analytics for social media with an emphasis on twitter data. First, we present statistical issues in predictive modelling with twitter data. This essay also addresses the problems and outcomes of predicting twitter data as time series classification model, in a distributed framework like Lambda architecture, parametric and non-parametric models and using different distance metrics models. Lastly, we analysis our findings from different research papers.
Social network constitutes to the state of being connected between social entities. It consists of individuals which are interdependent. On a mathematical view we can consider it as a set of objects and the relationship between them which can be represented by a graph containing nodes and edges. Objects can be represented as nodes and the relationship between the objects as edges. Small world effect (Stanley Milgram,1969) is a property of social network, which is based on the six degree of separation theory. It states that we all are six steps away from each other, means if we send a random message to a group of people and ask them to forward it to a friend who they think will bring the message closer to the exact recipient, it was found that it can be achieved or message can be delivered in a short path of about six people. Likewise, each node can be connected by a short path in a social network graph.
Big data has been rapidly produced in a un pre-determined manner in social networks. Such data offers data analytics to reveal trends in data with hidden knowledge. It will help to have a study of human behaviour, product sentiments and can have insights into marketing analytics. Social data analytics can produce understanding on politics by knowing public opinion shifts (Sandner,Tumasjan, Welpe & Sprenger, 2010), marketing and product development by predicting sales, in movie revenues ahead of their release (Huberman &Asur, 2010), predicting trends in stock market (Mao, Zeng & Bollen,2011),epidemic emergencies by predicting disease outbreaks (Dredze &Paul, 2011).
Twitter is a global communication network to communicate a larger audience. The scale and scope of twitter data has given rise to design methodological innovations to predict social media data outcomes. In twitter people tweet in a short message no more than 140 characters. Topic is a phrase in the tweet, for example election day is a topic in different tweets. A trending topic is a topic that becomes popular. Twitter gives us information about the trending topics but as the rate of production of such data increases continuously overtime, we need to overcome quantitative issues like faster execution of the algorithm for better performance and qualitative issues like better presentation of relevant data. Twitter trending topic can be considered as a time series problem. The training data is a small percentage of all the tweets in a dataset and test data is dynamic means created from real time tweets data
II. Literature Review
Earlier efforts has made in trend detection problem as a time series classification problem, by making topic classification with Euclidean distance of the topic time series. Popularity only, is not enough for a topic to be trending, twitter also considers newness over popularity. It is the tempo of communication around a specific topic which should be observed.
The pioneers in the research of twitter trend prediction (Stanislav Nikolov and Devavrat Shah,MIT,2012) proposed a nonparametric approach (model parameters extent with data) to compare real-time tweets which applies machine learning algorithms using distance calculations weighted. Twitter announced Barclays is trending in 2012.Reserchers have done an algorithm on that dataset, that had already become trending and found that it has become trending an hour before. On trending topics, the model should declare that it is trending and on non-trending topic model should not declare that they are becoming trending. The algorithm success rate was 95% and error rate 4%. They have used 250 datasets of trending and non-trending category for the testing. The model has also successfully predicted Miss USA 2012 Olivia Culpo will become trending before it has become trending in Twitter. The output of non-parametric algorithm approach is shown in figure1.
Figure1: Non parametric timeseries analysis.
A distributed framework for predicting trending topics with Lambda architecture (Athena Vakali, Kitmeridis Nikolaos, Panourgia Maria,2016) uses batch-processing and stream-processing methods to have an illustration of online data. This architecture can handle the problems of the size of the data and magnitude of the data, one of the reasons why the team has selected the model. The key features used are timestamp and exact date of the topic, normalisation methods for timeseries and squared Euclidean distance and cosine metrics for the comparison of timeseries. The main concept of the research was that since tweet trends have a short life span a real time representation of trending topic need to be identified. So the training data and test data was updated continuously using a framework and validated with the twitter API. The architecture consists of Batch Layer which includes MapReduce and HDFS to process the dataset. Speed Layer uses Apache Storm to the figure out the data movement and balance the delay before the transfer of real time data as test dataset. The serving layer has Redis servers to store the timeseries output for the trending topics.
Figure 2 :Lambda Architecture
It was evaluated that the lambda architecture has an improved performance of prediction, from the actual trending topics about 80% has been classified as likely trending topics by this method, within an execution period of 48hrs. The true positive percentage of Cosine was 78% and squared Euclidian was 52% after 48 hour execution.
In further development, models for predicting the results of football matches based on tweets (Stylianos Kampakis,Andreas Adamides, University College London.2014) has been proposed and researches were made whether these models can succeed over the predictive models which use historical data and statistics. A third models was constructed with both historical and Twitter data. The final model when measured by Cohen’s kappa (It is a statistic measure used to calculate two raters when each individual rate one trial on the same sample) revealed that twitter-based model performed more better than the model that using historical data and simple statistics. And if we combine both the model, we can achieve a performance higher than that of individual models. So, twitter data can provide useful information for the prediction of football matches. The datasets used were twitter dataset, historical dataset and combined dataset. The twitter dataset was created using twitters’ streaming API which consists of 2 million tweets of fans of their favourite clubs. A list of hashtags associated with each team has been created. The tweets, that have hashtags for more than one team has been discarded and also if it there is more than one hashtag on a particular team it was assigned to that team. TwitterNLP was used to process data. The results were measured as a win for the home team, a win for the away team or a draw. Each match, which was considered as an instance, has three features home, away and a response variable. The input consists of home features and away features. For twitter model different bag of words has been used for home team and away team as input dataset, which was pre-processed using chi-square. For example, Arsenal and Manchester City will have the same bag-of-words as home team, but both will have different features as way team. Naïve Bayes, Random forests, SVM and Logistic regression models were used. It was found that, the random forest was the best classifier for the twitter model. The accuracy was higher than the Naïve Bayes accuracy achieved, but for the historical dataset, Naïve Bayes was the best. Random forest was the best classifier in case of combined model, and the best performance of it was achieved when using bigrams. So, we can conclude that Twitter contains information which is enough to predict the results of a football game.
Hash Tag Prediction Algorithm (Tianxi Li & Yu Wu, Stanford University) predict tags, by utilizing machine learning. “Hash tags” is used by Twitter users, to classify their tweets. The algorithm was unique in the sense that is was different from texts classification which is usually normal, but in hash tag classification, we do not know how many clusters we need to figure out. Also, the frequent changes in tag settings makes it impossible to make proper classification or clustering, a new tag will create a new class and a classification rule. By measuring relation between different tweets as a mathematical metric, the tweets can be treated as points in a high dimensional space and latent space model can be used to create the network. Second method which can be used is to calculate Euclidean distance between points as the measurement of their resemblance. The model tried to predict tags depending on the distance. It selects the tag of the closest tweet. Further it takes some of the closest tweets, and predicts based on tag ratio to increase accuracy. At first n closest tweets is selected and then keep on adding tweets same time checking a tag which is dominating. If the tag ration is greater than 50% it will be selected as a predicted tag.
From the research papers discussed we can make the following analysis.In parametric model of trend detection, we consider a type of activity that usually comes, before a topic has become trending and will try to figure out the type of activity. How it is achieved is by analysing whether the activity remains stable for most of the time and shows steep jumps occasionally. The jump will let us know that it is becoming popular. We declare that something is trending,when the jumpiness parameter (p) exceeds a threshold value. In this model we cannot fully identify the types of pattern that come before a topic which is becoming trending. It means there could be several small jumps which leads to a big jump or a gradual rise without a clear jump. We need to make different parametric type models to detect different pattern types.
In non-parametric models, the data itself will define the model. We collect data containing information about the patterns, that come before a topic trends and those which do not and then categorise all patterns that is probable to happen. The data is directly used to decide whether a pattern will cause a trend or not. A pattern type could be gradual rise small jump and then a big jump or a jump and then a gradual rise, etc.
In Lambda Model different distance metrics for comparison of time series are used like DTW (Dynamic Time wrapping allows us to compare timeseries by dynamically working on the path, we compare the point that is most likely a better fit), ERP (Edit Distance with Real Penalty, it works by searching minimal path in a distance matrix, which is built using the Euclidean distance for mapping the two strings. In ERP unlike from DTW, the sequence of points which are unmatched with other points, or in other words gaps are permitted and based on the distance of the unmatched points, these gaps are penalized and MSM(Move Split and Merge distance can be defined as the sum of cheapest sequence of operations that transforms the first series into the second one).It is the reason why it can predict trends with a similar percentage of Twitters trending topic percentage and its ability to detect trends early.
Twitter also contains enough information to be useful for predicting outcomes even for a football match and the algorithm performed well in bigrams (can consider two words at a time) means that the information contained in twitter is more complex.
The Hash Tag Prediction Algorithm uses Euclidean distance as a measure to compare resemblance, unlike other models it has a false positive rate which is less than 15% and the model has advantages like uncomplicated data storage.
A large number of studies have been done on the models to improve predictive analysis on twitter data whether we can predict it earlier or Is there any information or pattern we can identify just before a topic has becoming trending. Predictive models offer different tools for forecasting trends in social media like twitter, we can go for simple models to understand the features of data and complex distributed models. Short term forecasting can be done effectively by a predictive model which is created following the state of the art.
- Athena Vakali, Kitmeridis Nikolaos & Panourgia Maria.( 2017). A distributed framework for early trending topics detection on big social networks data threads.DOI: 10.1007/978-3-319-47898-2_20,Conference: INNS Conference on Big Data.
- Stylianos Kampakis & Andreas Tianxi Li and Yu Wu Yu Zhang Adamides (2014). Using Twitter to predict football outcomes, University College London.
- Tianxi Li & Yu Wu Yu Zhang,(2011)Twitter Hash Tag Prediction Algorithm, Stanford University.
- George H. Chen,Stanislav Nikolov & Devavrat Shah (2012).A Latent Source Model for Nonparametric Time Series Classification.
- Asur, S., & Huberman, B. a. (2010). Predicting the Future with Social Media. Computers and Society
- 6.Niels Buus Lassen, Lisbeth la Cour, Ravi Vatrapu (2017).Predictive Analytics with Social Media Data, The SAGE Handbook of Social Media Research Methods.
- 7.Lecture 6 – Trend Detection In Twitter Social Data (Analyzing Big Data With Twitter), Berkeley School of Information.
- 8.J. Travers,(1969).An experimental study of the small world problem pp. 425–443.