Text Mining in Twitter with Spark and Scala
dc.contributor.author | Adam, Simitos | en |
dc.date.accessioned | 2017-05-09T07:15:24Z | |
dc.date.available | 2017-05-10T00:00:17Z | |
dc.date.issued | 2017-05-09 | |
dc.identifier.uri | https://repository.ihu.edu.gr//xmlui/handle/11544/15305 | |
dc.rights | Default License | |
dc.title | Text Mining in Twitter with Spark and Scala | en |
heal.abstract | This dissertation was written as a part of the MSc in “Mobile and Web Computing” at the International Hellenic University, Thessaloniki, Greece. Text Mining is a research area that tries to solve the document overabundance problem by using Data Mining, Machine Learning, Natural Language Processing, Information Retrieval, and Knowledge Management techniques. Text Mining’s main purpose is the automate documents categorization in classes. People’s thoughts and opinions have always been studied and researched by the sciences of sociology and history. Social Media revolution has made opinion expression a very easy, simple and quick procedure. Thanks to Social Media an Internet user can propagate their opinion and read other users’ opinions as well. As a result, the Internet is “flooded” by a vast volume of data that is difficult to be managed. Social Media is one of the factors that contribute to the phenomenon called “Big Data” in computer science. The object of this master thesis is the collection and manipulation of social media users’ opinions about political situation in Greece by using text mining methods. Specifically, the application developed crawls opinions for Greek parliament members from Twitter social medium and categorizes them in positive, neutral, and negative. Statistics produced are indicative for each member’s popularity. | en |
heal.academicPublisher | IHU | en |
heal.academicPublisherID | ihu | en_US |
heal.access | free | en_US |
heal.advisorName | Papadopoulos, Apostolos | en |
heal.classification | Information Technology | en |
heal.committeeMemberName | Berberidis, Christos | en |
heal.committeeMemberName | Ampatzoglou, Apostolos | en |
heal.committeeMemberName | Gatzianas, Marios | en |
heal.creatorID.dhareID | a.simitos@ihu.edu.gr | |
heal.fileFormat | en_US | |
heal.keywordURI.LCSH | Data mining | |
heal.keywordURI.LCSH | Data mining--Computer programs | |
heal.keywordURI.LCSH | Data mining--Data processing | |
heal.keywordURI.LCSH | Data mining--Social aspects | |
heal.keywordURI.LCSH | Data mining--Statistical methods | |
heal.keywordURI.LCSH | Social media | |
heal.keywordURI.LCSH | Social media--Political aspects | |
heal.keywordURI.LCSH | ||
heal.keywordURI.LCSH | Twitter--Political aspects--Greece | |
heal.keywordURI.LCSH | Twitter--Social aspects | |
heal.keywordURI.LCSH | Spark (Electronic resource : Apache Software Foundation) | |
heal.keywordURI.LCSH | Scala (Computer program language) | |
heal.keywordURI.LCSH | SPARK (Computer program language) | |
heal.keywordURI.LCSH | Information retrieval | |
heal.keywordURI.LCSH | Information retrieval--Data processing | |
heal.keywordURI.LCSH | Information retrieval--Technological innovations | |
heal.language | en | en_US |
heal.license | http://creativecommons.org/licenses/by-nc/4.0 | en_US |
heal.numberOfPages | 78 | en_US |
heal.publicationDate | 2016-12-23 | |
heal.recordProvider | School of Science and Technology, MSc in Mobile and Web Computing | en_US |
heal.secondaryTitle | Twitter as Political Barometer in Greece | en |
heal.spatialCoverage | Greece | en |
heal.tableOfContents | Abstract Contents List of Pictures List of Tables 1 Introduction ........................ 2 Big Data ............................ 2.1 What is Big Data............................. 2.2 Big Data Challenges ................... 2.3 Managing Big Data ....................... 2.3.1 Spark .......................... Spark stack .......................... Spark Core ..................................... Spark SQL ............................ Spark Streaming............................. MLlib ...................... GraphX .............................. Cluster Managers .......................................... Spark Runtime Architecture .................................. The Driver ..................................... Executors ..................................... Cluster Manager .............................................. 2.3.2 Scala ..................................... 3 Twitter ........................................ 3.1 Twitter Analytics ........................................... 3.2 Crawling Twitter Data ...................................... 3.2.1 Open Authentication .................................... 3.2.2 Collecting search results Collecting tweets using REST API ................. Collecting tweets using Streaming API .................. 3.3 Tweets Sentiment Analysis ............................. 3.4 Twitter and Politics .......................................... 3.4.1 Twitter for political communication ...................... 3.4.2 Twitter users as voters ................................... 3.4.3 Twitter in Greek political reality ..................... 4 Text Mining ....................................................... 4.1 Text Retrieval Methods .................................. 4.2 Finding Similar Documents ............................. 4.3 Document Classification Analysis ................. 4.4 Text retrieval evaluation methods .................... 4.5 Latent Semantic Indexing ................................ 5 The PolBar Application ...................................... 5.1 Tweets Collection ............................................ 5.1.1 Communicating with Twitter API ....................... 5.1.2 Choosing the suitable search keyword ..................... 5.1.3 Organizing keywords............................................................................... 5.1.4 Crawling and preprocessing tweets .............. 5.2 Tweets Storage ................................................ 5.3 Tweets Analysis and Classification ................. 5.3.1 Creating the training dataset ......................... Stopwords ......................................................... 5.3.2 Classifiers evaluation ...................................... Logistic Regression ................................................ Naïve Bayes .......................................................... Decision Tree ......................................................... Random Forest ........................................................ 5.4 Results Presentation .................................... 5.5 Extra Experiments .......................................... 5.5.1 Experiment with different datasets types ............ 5.5.2 Experiment with different datasets size ........... 6 Conclusions .................................................... 7 Future Prospects ............................................... Bibliography ............................................................ Appendix A .......................................................... Instance of Data Table........................................... Appendix B ............................................................. Instance of Month Statistics Table .......................... Appendix C .......................................................... Instance of Total Statistics Table ............................ | en |
heal.type | masterThesis | en_US |
Αρχεία
Πρωτότυπος φάκελος/πακέτο
1 - 1 από 1
Φόρτωση...
- Ονομα:
- Simitos_Master_Thesis.pdf
- Μέγεθος:
- 2.91 MB
- Μορφότυπο:
- Adobe Portable Document Format
- Περιγραφή:
Φάκελος/Πακέτο αδειών
1 - 1 από 1
Δεν υπάρχει διαθέσιμη μικρογραφία
- Ονομα:
- license.txt
- Μέγεθος:
- 2.58 KB
- Μορφότυπο:
- Item-specific license agreed upon to submission
- Περιγραφή: