########################################################################################################## ########################################################################################################## # Course: Using Digital Trace Data in the Social Sciences # Instructor: Andreas Jungherr, U Konstanz # July 18 - July 22, 2016, U Konstanz # Session 7: Extracting Data for Typical Analyses ########################################################################################################## ########################################################################################################## # For a detailed tutorial on how to collect and work with Twitter data in the social sciences see # Jürgens & Jungherr (2016a) pp.42-79. # Script Examples to illustrate the workings of the scripts provided in Jürgens & Jungherr (2016b). ###Further Reading: # Pascal Jürgens and Andreas Jungherr. 2016a. A Tutorial for Using Twitter Data in the Social Sciences: # Data Collection, Preparation, and Analysis. Social Science Research Network (SSRN). # doi: 10.2139/ssrn.2710146 # Pascal Jürgens and Andreas Jungherr. 2016b. twitterresearch [Computer software]. # Available at https://github.com/trifle/twitterresearch ########################################################################################################## ########################################################################################################## # In this session, we focus on how to extract summary statistics from our database allowing a series of # standard analytical approaches to Twitter data. # First, let's point the command line to your working directory for this project cd "/Users/(...)/twitterresearch" # Now, let's get some data. If you are directly participating in the course you will be provided with # a data file allowing for a shared analysis. If you are following this course without actually # participating in it physically, have a look at Jürgens & Jungherr (2016), pp. 43f. # After saving the data file in your working directory under the name "tweets.db" we are ready to load # the data into our databse as discussed during our last session. ipython run database ########################################################################################################## ### Now, let's count some entities! Tweet.select().count() User.select().count() # OK, now let's export a ranked list of the accounts most often mentioned in the tweets in our database. # For this, we need a function defined in our file "examples.py" import examples examples.export_mention_totals() # Check your working directory. Now, you’ll find a new file in your working directory called # mention_totals.csv. In the file you find a list of 50 accounts most often mentioned in the the tweets # collected in the database. # Or maybe you are interested in the most often retweeted accounts:: examples.export_retweet_totals() # After focusing on users, let's have look at dominating objects: # Hashtags: examples.export_hashtag_totals() #Retweets: examples.export_retweet_text() #URLs: examples.export_url_totals() ########################################################################################################## ### How does this look over time? ### Time Series # While already counting entities can provide you with interesting research topics (see the tutorial for # a more detailed discussion), examining the development of entities over time provides even richer # source material (see tutorial). # Here, we show you how you can extract data documenting temporal trends in the appearance of specific # Twitter entitities. # Let's start by exporting the daily message count during the week's worth of messages in our example # data set: examples.export_total_counts() # Now you should find a new .csv file in your working directory listing the total message count for # each day covered in our data set. To work with this and other output files load the into R: # Leave Python for now and start R Studio. # If you haven't done so already, now it is time toa install a small selection of necessary packages: install.packages(c("ggplot2","scales")) # After installing them in R make sure you load them to your workspace: library(ggplot2) library(scales) # As with Python point R to your working directory: setwd("/Users/(...)/twitterresearch") # Now, we have to load the data exported before into R objects that allow # analysis and plotting: # As a first step load the complete file into a data frame: message_counts_df<-data.frame(read.csv("total_counts.csv")) # Now extrate the column containing date information in a date format,... dates<-as.POSIXct(message_counts_df[,1]) #...load the column with daily message counts... all_messages<-message_counts_df[,2] #...and combine the both in a new data frame ready for plotting: plot_all_messages_df<-data.frame(dates,all_messages) # Now we are ready to plot the data. # For plotting we use the R pacakge ggplot2 by Hadley Wickham (2016). This package # supports you tremendously in the creation of simple and very complicated plots. # It's notation might take a little getting use to but it is definitely worth your # time if you aim to keep working with data. # For an easy to follow introduction with a lot of examples see: # Now we load the data frame into ggplot and specify our prefered layout. # For the detailed workings of this command see Wickham (2016) or Chang (2012). plot_all_messages<-ggplot(plot_all_messages_df,aes(x=dates,all_messages))+ geom_line(stat="identity") + geom_point(size=2)+ theme_bw()+ xlab("")+ theme(axis.text.x=element_text(angle=45,hjust=1))+ ylab("Message Volume, Daily") plot_all_messages ggsave(file="Message Volume Daily.pdf", width = 170, height = 90, unit="mm", dpi=300) dev.off() # Makre sure to check out the tutorial and our scripts in greater detail as to see what types # time series exports we have already implemented. If these sample commands do not cover your # interests it's time to get to work for and adapt our code accordingly. # Winston Chang (2012) R Graphics Cookbook. O’Reilly Media, Inc. # Hadley Wickham (2016) ggplot2: Elegant Graphics for Data Analysis. 2nd ed. Springer. ########################################################################################################## ### How about networks? # Before we completely run out of time a quick note to networks: Twitter data offer a very promising # basis for network analyses as @mentions and retweets offer a very intuitive network structure. # In our sample scripts we provide you with a series of easy to use commands extracting data from # the database so that they form a basis to run standard network analyses on them. # If you are interested in analyses like this see Jürgens & Jungherr (2016), pp. 70-77. ########################################################################################################## ########################################################################################################## # In this session, we only had time for the most cursory of glances at exporting data and using them # in analyses. But not to worry, in the tutorial we cover this step of the research process with digital # trace data in much greater detail. There, we discuss potential research designs, list exemplary studies # illustrating potential analytical approaches, and provide detailed code examples in Python and R. # So make sure you check out Jürgens & Jungherr (2016a) pp.42-79 and go through the examples given there # step by step. # For more examples and advise how to prepare and analyze Twitter data please see the tutorial: # Pascal Jürgens and Andreas Jungherr. 2016a. A Tutorial for Using Twitter Data in the Social Sciences: # Data Collection, Preparation, and Analysis. Social Science Research Network (SSRN). # doi: 10.2139/ssrn.2710146 # If you run into trouble with or find any bugs in the code provided in Jürgens, Jungherr (2016b) please # report an issue in our GitHub repository https://github.com/trifle/twitterresearch ########################################################################################################## ##########################################################################################################