Correlation (RIT) and University of Rochester (U of R),

Correlation Between Tweet Clusters and Cultural Hotspots inArea of New York State  1     Introduction Twitter is asocial platform where numerous users express their personal opinions about acertain social or private event.

In a world where people are constantly usingnumerous social media to share their day to day activities, twitter hasremained very popular in generating a large number of monthly active users,which in turn has helped a great deal in social media and industry research(Ahmed, 2017). Rochester, although a small city, is home to two majoruniversities, Rochester Institute of Technology (RIT) and University ofRochester (U of R), and numerous business organizations (Rochester, 2017). Thesecan be called hotspots for a lot of work and study related tweets. Similarly,Rochester also has plenty of restaurants across the city of varied cuisinesserving all age groups of people (Rochester, 2017).  These can be called hotspots for food andentertainment related tweets.Since Twitter datahas rich content about user opinions regarding an event, it is being used byresearchers in the various research fields like sentiment analysis andsurveillance of diseases (Song & Xia, 2016; Allen et al 2016).

The aim ofthis project is to introduce a data mining approach based on geo tags fromtwitter data to determine the cultural hotspots in a city so as to understandthe correlation between physical environment and social phenomena of people’sinterests. MacEachren, A.M et al (2011) developed a webbased geovisual and analytics application through which information collectedfrom social media, Twitter data in particular, can be analyzed to supportsituational awareness.

Another important area of research where numerous datamining approaches have been used on Twitter data is Opinion mining andSentiment analysis. Gokulakrishnan.B et al (2012) try out differentpreprocessing methods and classifying algorithms on Twitter data to drawimportant conclusions on opinion classification and sentiment analysis ofTwitter data which is very different from normal text data given the 140character restriction and wide usage of hashtags and abbrevations.  Research guides ofWisconsin-Madison libraries (2017) states that a geographic information system (GIS) is a system designed to capture,store, manipulate, analyze, manage, and present all types of geographical data.The research guide also mentions that GIS related geospatial data and spatialanalytics can be used in important analysis like, where a particular featureoccurs in abundance or scarce, activities in an area of interest (AOI) andevents happening around a feature or phenomenon.

Given the in-depth analysisthis project requires, many GIS related tools like ArcGIS Maps and ArcGIS Prowill be used for multi-level analysis.  2     Problem2.1    Problemstatement  With the numerous active social mediausers being able to express their opinions, businesses find it challenging totend to people’s fast growing cultural needs and expectations.   2.

2    Motivation to study, significance ofproblem, and potential benefitsAs stated in the problem statement,people are highly dependent on social media for their day to day activities.They increasingly depend on social media for say reviews on a particular restaurantor business growth of a company or quality of courses offered at a university.So, it has become vital for businesses to be at their toes and keep up with thenew technologies and trends that attracts more people. When active users tweetabout a particular place or event, the tweets consist of rich content andkeywords regarding that place or event.

From the users perspective, it is justanother piece of information. But in the perspective of research and analysis,the keywords speak a lot about the place or event which can be used foranalytical purposes.  This project focuses on one such idea,determining cultural hotspots from tweets clusters.

Tweet clusters typicallyrefer to the cluster of tweets that have been tweeted from a particular area.If the cluster is denser, it means many users are interested in the event orplace. These tweet clusters can be obtained once the preprocessed Twitter datais clustered using a suitable clustering algorithm like k-means. People tweetfrom restaurants, work places, social events like concerts or football gamesand even from universities.

By finding a correlation between cultural hotspotsand tweets clusters, it is possible to make important conclusions regarding thepeople’s interests and tastes about an event or a place. For instance, by analyzingand clustering all the geo tagged tweets from a particular location’srestaurant, the clusters of tweets with higher density means more number ofpeople are interested in that restaurant or maybe because the location of therestaurant is in a cultural hotspot, like near a university which attracts manycustomers. Oku.

K and Hattori.F (2015) built a recommender system for touristspots by mapping geo tagged Tweets to tourist spots based on activity region ofthe spot. They used One-Class Support Vector Machines to detect areas ofsubstantial activity near target spots based on the tweets and photographs thatwere explicitly geotagged.  Anotherinstance is where Zhao.Y (2013) used Twitter data to analyze the businessdevelopment of the Australian Department of Immigration and Citizenship (DIAC)by studying the DIAC Twitter account and tweets using text mining techniques tofind out how the tweets were spread over the Twitter network. With such analysis, businesses that arelooking to expand or people who are considering starting a new business canhave a fair idea about the customer’s preferences in that area and also drawimportant conclusions about the optimum location to start or expand theirbusiness. Based on these conclusions, it is possible to further compare theanalysis results of multiple locations so as to figure out the key featuresthat interests people.

Such important derivations can be used by businesses toimprove their strategies and grow economically.  With the use of geo tagged twitterdata, this analysis can answer important research questions like, what are themajor areas of interest in a given location or what features occur in abundanceor scarcely in a given location. This project uses numerous GeographicInformation Science (GIS) tools and concepts like ArcGIS Map, ArcGIS Pro,spatial analysis and spatial clustering, and is thus expected to widen theapplications of GIS. There can be conditions when tweets are not explicitly geotagged.

In such cases, the Twitter Search API has a geocoded parameter whereusers can pass specific latitude, longitude and radius to cover areas ofinterest at a granular level. So, this way even tweets that were not geotaggedcan be included into the data set.   2.3    Project GoalsThe main goal of this project is tofind a correlation between tweets clusters and significant cultural hotspots byanalyzing geo tagged twitter data. The focus will be to analyze twitter datausing data mining techniques and GIS tools to find tweets clusters and comparethe clusters to significant cultural hotspots of the area to draw importantconclusions about the trend of the tweets from a particular location.      3.0   Plan The first step of this capstone project will be tocollect geo tagged twitter data. This project requires twitter data from NewYork state and the data should consist of tweets related to food,entertainment, work and study.

There are numerous ways by which researchers canobtain Twitter data. Some feasible methods are to access an existing Twitterdataset, retrieve from the Twitter API or purchase from Twitter. The number oftweets required to build the data set will depend on the original tweetsreceived and if the tweets meet the required criteria for this project, like ifthe tweets are geo tagged or what are the major keywords of the original tweet.

Ideally, a data set consisting of 10,000 to 20,000 tweets will be optimum forthis project.The second step will be to analyse and classify thetwitter data. Usually, tweets can consist of redundancies because of numeroususers retweeting to a particular tweet, that is users responding to a tweetwill include the original tweet. So, the data has to be cleaned to make sureall the tweets in the data set are unique. Next, since the data setwill be very broad, it is required to classify the data using classifier toolslike Weka using a suitable algorithm.

Once the tweets are classified, eachcategory of tweets will have tweets consisting of particular keywords.The third step of this project will beto perform spatial analysis and spatial clustering on the classified data usingGIS tools like ArcGIS Pro with R. Using a suitable and efficient clusteringalgorithm the tweets clusters will be obtained. The clusters will represent theactivity in a particular location. Higher the cluster density will mean moreusers tweeted from that location.

The fourth step will be to use GIStools like ArcGIS Map to analyse and find the cultural hotspots in the state ofNew York. ArcGIS Pro has Mapping Clusters Toolset consisting of tools like HotSpot Analysis and Cluster and Outlier analysis to evaluate the characteristicsof the input data.  Cultural hotspots canbe places like universities, work places, stadiums, concerts and theatres. Amap of all these hotspots will be created. Finally, the tweetclusters that are obtained based on locations will be overlapped with the mapof New York state that was created to represent the cultural hotspots. Thiswill bring a correlation between the clusters which have higher density and thelocations which are considered to be hotspots.

From this comparison and visualcorrelation, important conclusions can be drawn regarding people’s interest andtheir corresponding geographic locations.            References Ahmed, W.(2017). Using twitter as a data source: An overview of social media researchtools (Updated for 2017) Blog post.

Retrieved from Allen C, TsouM, Aslam A, Nagel A, Gawron J. (2016).

Applying GIS and machine learning methodsto Twitter data for multiscale surveillance of influenza. 1(7). http://dx.plos.

org/10.1371/journal.pone.0157734. Gokulakrishnan.Bet al.

(2012). Opinion mining and Sentiment analysis on twitter data stream. IEEE Xplore. Retrievedfrom MacEacheran.A.

Met al. (2011). SensePlace2: GeoTwitter analytics support for situationalawareness. In IEEE Xplore.

Retrievedfrom  Mapping andGeographic Information System (GIS): What is GIS? (2017) In Research Diaries,University of Wisconsin-Madison Libraries. Retrieved from https://researchguides. Oku.

K andHattori.F. (2015). Mapping geotagged tweets to tourist spots consideringactivity region of spot. Retrieved from file:///C:/Users/rajashruthi/Downloads/9783662472262-c2.pdf Rochester, NewYork. (2017). Retrieved from https://en.,_New_York Song, Z andXia, J. 2016. Spatial and Temporal Sentiment Analysis of Twitter data.

In:Capineri, C, Haklay, M, Huang, H, Antoniou, V, Kettunen, J, Ostermann, F andPurves, R. (eds.) European Handbook of CrowdsourcedGeographic Information, Pp.

205–221. London: Ubiquity Press. DOI:

5334/bax.p. License: CC-BY 4.0. Zhao.Y.(2013). Analyzing twitter data with text mining and social network analysis.

InAustralian Data Mining Conference. Retrievedfrom