Anal Kanti Roy successfully defended his master’s thesis “Automation of Crawling Blogosphere based on Pattern Recognition” Thursday, November 21, 2019. In attendance were committee members Dr. Nitin Agarwal (chair), Dr. Elizabeth Pierce, and Dr. Serpil Tokdemir Yuce as well as members of COSMOS.
Over the past two years, Roy has become very familiar with extracting online content. As COSMOS’ leading crawling expert, Roy has played a fundamental role in data extraction for various COSMOS research projects.
The motivation behind his work was not only the sheer volume of content published on blogs, but also the influence these blogs can have on the public. Blogs are not restricted to a specific character-limit nor are they moderated by a hosting platform. Authors are able to freely express themselves through blog posts. COSMOS has identified numerous, highly influential blogs in the past that are being used as an influence operation instrument to spread misinformation and propaganda. Blogs are closely linked to social media, as social media platforms are often used to disseminate blog content.
As blogs are dynamic and evolving, it is crucial to perform a computational analysis on blogsites in order to quickly identify and study online influence campaigns. For his master’s thesis, Roy worked on a crawler program able to extract and categorize any blog data. Currently, crawler programs are limited due to their blog-specific nature. None of them are generic. By using tools, such as python, scrapy xpath, regex, and MySQL, Roy crawled ten different blogs. The blogs ranged from having less than 20 blog posts to several thousand blog posts. The complex structure of scripts made it challenging to define patterns. Roy set up priorities measures for the blog components, such as blog title, date, author content, tags, categories, and comments. As the HTML within a page is broken down into pieces of text, each piece of text is being analyzed based on a list of patterns. A score for each text is being calculated. The text with the highest score (that exceeds a certain threshold) is then identified as the target blog data for a particular component and will be extracted as such.
Roy’s publication about this crawling process titled “Automating Blog Crawling Using Pattern Recognition” will be presented at The Ninth International Conference on Social Media Technologies, Communication, and Informatics (SOTICS 2019), which will take place from November 24 – 28 in Spain.