Correlation (RIT) and University of Rochester (U of R),

Correlation Between Tweet Clusters and Cultural Hotspots in
Area of New York State


1     Introduction

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now


Twitter is a
social platform where numerous users express their personal opinions about a
certain social or private event. In a world where people are constantly using
numerous social media to share their day to day activities, twitter has
remained very popular in generating a large number of monthly active users,
which in turn has helped a great deal in social media and industry research
(Ahmed, 2017). Rochester, although a small city, is home to two major
universities, Rochester Institute of Technology (RIT) and University of
Rochester (U of R), and numerous business organizations (Rochester, 2017). These
can be called hotspots for a lot of work and study related tweets. Similarly,
Rochester also has plenty of restaurants across the city of varied cuisines
serving all age groups of people (Rochester, 2017).  These can be called hotspots for food and
entertainment related tweets.

Since Twitter data
has rich content about user opinions regarding an event, it is being used by
researchers in the various research fields like sentiment analysis and
surveillance of diseases (Song & Xia, 2016; Allen et al 2016). The aim of
this project is to introduce a data mining approach based on geo tags from
twitter data to determine the cultural hotspots in a city so as to understand
the correlation between physical environment and social phenomena of people’s
interests. MacEachren, A.M et al (2011) developed a web
based geovisual and analytics application through which information collected
from social media, Twitter data in particular, can be analyzed to support
situational awareness. Another important area of research where numerous data
mining approaches have been used on Twitter data is Opinion mining and
Sentiment analysis. Gokulakrishnan.B et al (2012) try out different
preprocessing methods and classifying algorithms on Twitter data to draw
important conclusions on opinion classification and sentiment analysis of
Twitter data which is very different from normal text data given the 140
character restriction and wide usage of hashtags and abbrevations.


Research guides of
Wisconsin-Madison libraries (2017) states that a geographic information system (GIS) is a system designed to capture,
store, manipulate, analyze, manage, and present all types of geographical data.
The research guide also mentions that GIS related geospatial data and spatial
analytics can be used in important analysis like, where a particular feature
occurs in abundance or scarce, activities in an area of interest (AOI) and
events happening around a feature or phenomenon. Given the in-depth analysis
this project requires, many GIS related tools like ArcGIS Maps and ArcGIS Pro
will be used for multi-level analysis.


2     Problem

2.1    Problem


With the numerous active social media
users being able to express their opinions, businesses find it challenging to
tend to people’s fast growing cultural needs and expectations.




2.2    Motivation to study, significance of
problem, and potential benefits

As stated in the problem statement,
people are highly dependent on social media for their day to day activities.
They increasingly depend on social media for say reviews on a particular restaurant
or business growth of a company or quality of courses offered at a university.
So, it has become vital for businesses to be at their toes and keep up with the
new technologies and trends that attracts more people. When active users tweet
about a particular place or event, the tweets consist of rich content and
keywords regarding that place or event. From the users perspective, it is just
another piece of information. But in the perspective of research and analysis,
the keywords speak a lot about the place or event which can be used for
analytical purposes.


This project focuses on one such idea,
determining cultural hotspots from tweets clusters. Tweet clusters typically
refer to the cluster of tweets that have been tweeted from a particular area.
If the cluster is denser, it means many users are interested in the event or
place. These tweet clusters can be obtained once the preprocessed Twitter data
is clustered using a suitable clustering algorithm like k-means. People tweet
from restaurants, work places, social events like concerts or football games
and even from universities. By finding a correlation between cultural hotspots
and tweets clusters, it is possible to make important conclusions regarding the
people’s interests and tastes about an event or a place. For instance, by analyzing
and clustering all the geo tagged tweets from a particular location’s
restaurant, the clusters of tweets with higher density means more number of
people are interested in that restaurant or maybe because the location of the
restaurant is in a cultural hotspot, like near a university which attracts many
customers. Oku.K and Hattori.F (2015) built a recommender system for tourist
spots by mapping geo tagged Tweets to tourist spots based on activity region of
the spot. They used One-Class Support Vector Machines to detect areas of
substantial activity near target spots based on the tweets and photographs that
were explicitly geotagged.  Another
instance is where Zhao.Y (2013) used Twitter data to analyze the business
development of the Australian Department of Immigration and Citizenship (DIAC)
by studying the DIAC Twitter account and tweets using text mining techniques to
find out how the tweets were spread over the Twitter network.

With such analysis, businesses that are
looking to expand or people who are considering starting a new business can
have a fair idea about the customer’s preferences in that area and also draw
important conclusions about the optimum location to start or expand their
business. Based on these conclusions, it is possible to further compare the
analysis results of multiple locations so as to figure out the key features
that interests people. Such important derivations can be used by businesses to
improve their strategies and grow economically.


With the use of geo tagged twitter
data, this analysis can answer important research questions like, what are the
major areas of interest in a given location or what features occur in abundance
or scarcely in a given location. This project uses numerous Geographic
Information Science (GIS) tools and concepts like ArcGIS Map, ArcGIS Pro,
spatial analysis and spatial clustering, and is thus expected to widen the
applications of GIS. There can be conditions when tweets are not explicitly geo
tagged. In such cases, the Twitter Search API has a geocoded parameter where
users can pass specific latitude, longitude and radius to cover areas of
interest at a granular level. So, this way even tweets that were not geotagged
can be included into the data set.




2.3    Project Goals

The main goal of this project is to
find a correlation between tweets clusters and significant cultural hotspots by
analyzing geo tagged twitter data. The focus will be to analyze twitter data
using data mining techniques and GIS tools to find tweets clusters and compare
the clusters to significant cultural hotspots of the area to draw important
conclusions about the trend of the tweets from a particular location.







3.0   Plan


The first step of this capstone project will be to
collect geo tagged twitter data. This project requires twitter data from New
York state and the data should consist of tweets related to food,
entertainment, work and study. There are numerous ways by which researchers can
obtain Twitter data. Some feasible methods are to access an existing Twitter
dataset, retrieve from the Twitter API or purchase from Twitter. The number of
tweets required to build the data set will depend on the original tweets
received and if the tweets meet the required criteria for this project, like if
the tweets are geo tagged or what are the major keywords of the original tweet.
Ideally, a data set consisting of 10,000 to 20,000 tweets will be optimum for
this project.

The second step will be to analyse and classify the
twitter data. Usually, tweets can consist of redundancies because of numerous
users retweeting to a particular tweet, that is users responding to a tweet
will include the original tweet. So, the data has to be cleaned to make sure
all the tweets in the data set are unique. Next, since the data set
will be very broad, it is required to classify the data using classifier tools
like Weka using a suitable algorithm. Once the tweets are classified, each
category of tweets will have tweets consisting of particular keywords.

The third step of this project will be
to perform spatial analysis and spatial clustering on the classified data using
GIS tools like ArcGIS Pro with R. Using a suitable and efficient clustering
algorithm the tweets clusters will be obtained. The clusters will represent the
activity in a particular location. Higher the cluster density will mean more
users tweeted from that location.

The fourth step will be to use GIS
tools like ArcGIS Map to analyse and find the cultural hotspots in the state of
New York. ArcGIS Pro has Mapping Clusters Toolset consisting of tools like Hot
Spot Analysis and Cluster and Outlier analysis to evaluate the characteristics
of the input data.  Cultural hotspots can
be places like universities, work places, stadiums, concerts and theatres. A
map of all these hotspots will be created.


Finally, the tweet
clusters that are obtained based on locations will be overlapped with the map
of New York state that was created to represent the cultural hotspots. This
will bring a correlation between the clusters which have higher density and the
locations which are considered to be hotspots. From this comparison and visual
correlation, important conclusions can be drawn regarding people’s interest and
their corresponding geographic locations.















Ahmed, W.
(2017). Using twitter as a data source: An overview of social media research
tools (Updated for 2017) Blog post. Retrieved from


Allen C, Tsou
M, Aslam A, Nagel A, Gawron J. (2016). Applying GIS and machine learning methods
to Twitter data for multiscale surveillance of influenza. 1(7).


et al. (2012). Opinion mining and Sentiment analysis on twitter data stream. IEEE Xplore. Retrieved


et al. (2011). SensePlace2: GeoTwitter analytics support for situational
awareness. In IEEE Xplore. Retrieved



Mapping and
Geographic Information System (GIS): What is GIS? (2017) In Research Diaries,
University of Wisconsin-Madison Libraries. Retrieved from


Oku.K and
Hattori.F. (2015). Mapping geotagged tweets to tourist spots considering
activity region of spot. Retrieved from file:///C:/Users/rajashruthi/Downloads/9783662472262-c2.pdf


Rochester, New
York. (2017). Retrieved from,_New_York


Song, Z and
Xia, J. 2016. Spatial and Temporal Sentiment Analysis of Twitter data. In:
Capineri, C, Haklay, M, Huang, H, Antoniou, V, Kettunen, J, Ostermann, F and
Purves, R. (eds.) European Handbook of Crowdsourced
Geographic Information, Pp. 205–221. London: Ubiquity Press. DOI: License: CC-BY 4.0.


(2013). Analyzing twitter data with text mining and social network analysis. In
Australian Data Mining Conference. Retrieved