Web Crawler and Scraper

Use

You can scrape websites and blogs and store its content in the Social Intelligence tables using a python script. You use this stored information to analyze the sentiments and draw further conclusions on the same.

System Details

These details need to added in the script :

Server
Port
Username
Password
Schema
Client

Prerequisites

You have installed Python 2.7.
You have installed the following modules using Pip:
google
pyhdb
urllib2
httplib
urlparse
validators

How the Python Script Works

When the script is run, you are asked to enter a search term. Based on the entered search term, the system returns the top three results from Google Search using the Google module. The system stores the result links in the Top_Results.txt file. These top three sites are crawled and the data from it is scraped and stored in the SOCIALDATA table. Further, the links found in these sites are also scraped and stored in the SOCIALDATA table.

Steps

1. Copy the below script into your desired location

import urllib2
import httplib
import re
import sys
import pyhdb
import random
import datetime
import string
from google import search
from urlparse import urlparse
import validators
import os
################### To be Filled before executing the script #########################
# HANA System details
server = ''
port =
username_hana = ''
password_hana = ''
schema = ''
client = ''
######################################################################################
# Function to fetch the top results from google search for the passed search term. 
def top_links(searchterm):     top_res = open('Top_Results.txt','w')
# The number of results fetched can be changed by changing the parameters in the below call.     for url in search(searchterm, num = 3, start = 1 ,stop = 3):          print url          top_res.write(url)          top_res.write('\n')
# Function to scrape the content of a specific website. This is acheived using regex functions.
def scrape(resp,searchterm):
# Check if the link is a valid one or not.     pattern = re.compile(r'^(?:http|ftp)s?://')     mat = re.match(pattern, resp)     if(mat== None):          print 'Nothing there'     else:          try:               response = urllib2.urlopen(resp)
# Write the response body into a file called crawled.txt               html = response.read()               file1 = open('crawled.txt','w')               file1.write(html)               file1.close()               f1 = open('crawled.txt','r').read()               f2 = open('regex.txt','w')
# Since the main content of any website is stored in the body of the html, we extract and store only that part of it.               res1 = re.search('(<body.*</body>)',f1, flags = re.DOTALL)               if res1:               print 'Found'
# Further the unnecessary tags are removed, like the script, style tags etc.               scripts = re.sub(r'(<script type="text/javascript".*?</script>)|(<script type=\'text/javascript\'.*?</script>)|(<script>.*?</script>)','',res1.group(0),flags = DOTALL)               next = re.sub(r'|(<style.*</style>)|(.*//.*)|(^/*.*?\*\))','',scripts, flags=re.DOTALL)               n1 = re.sub(r'<style.*</style>','',next,flags = re.DOTALL)               f2.write(n1)               f3 = open('regex.txt','r').read()  #parse through the file removing html tags and other necessary characters and store in a file called Scraped.txt               f4 = open('Scraped.txt','w')               res3 = re.sub(r'<.*?>|</.*?>','',f3)               spaces = re.sub(r'\s\s+','\n',res3,flags = re.DOTALL)               f4.write(spaces)
# The final scraped content is stored in a file called 'Scraped_Final.txt'               lines = [line.rstrip('\n') for line in open('Scraped.txt')]               f5 = open('Scraped_Final.txt','w')                    for i in lines:                    if(len(i) > 10):                         f5.write(i)               file_scraped = open('Scraped_Final.txt','r').read()               print 'Scraped'
# This content is then inserted into the Database               insert_into_db(file_scraped,searchterm)          else:               print 'No match'
# Error Handling     except urllib2.HTTPError as e:          print e.code,' Skipping..'  # print e.read()     except urllib2.URLError as e:          print e.reason     except httplib.HTTPException as e:          checksLogger.error('HTTPException')
# Function to extract the internal links in each website. 
def get_links(base_url, scheme):     print 'Base url',base_url     f1 = open('crawled.txt','r').read()
# All the link tags and anchor tags are found and the links are extracted from them     links = re.findall('(<a.*?>)',f1,flags = re.DOTALL)     links2 = re.findall('(<link.*?>)',f1,flags = re.DOTALL)     li = open('li1.txt','w')     tmp_list = []     tmp_list1 = []     for j in links:          if not j in tmp_list:               tmp_list = j               tmp_list1.append(j)               li.write(j)               li.write('\n')               for k in links2:                     if not k in tmp_list:                         tmp_list = k                         tmp_list1.append(k)                         li.write(k)                         li.write('\n')     f5 = open('li1.txt','r').read()     links1 = re.findall('(href=\'.*?\')',f5,flags=re.DOTALL)     links5 = re.findall('(href=".*?")',f5,flags=re.DOTALL)     li2 = open('li2.txt','w')     list1 = []     list2 = []     for i in links1:          if not i in list1:               list1 = i               reg1 = re.search('\'.*\'',i)               if reg1:                    reg2 = re.sub(r'\'','',reg1.group(0))                    list2.append(reg2)                    li2.write(reg2)                    li2.write('\n')     for m in links5:          if not m in list1:               list1 = m               reg1 = re.search('".*"',m)               if reg1:                    reg2 = re.sub(r'"','',reg1.group(0))                    list2.append(reg2)                    li2.write(reg2)                    li2.write('\n')     print 'Opening Links'     li4 = open('Links.txt','w')     list3 = []
# Handle relative URLs as well by adding the base url of the website.     with open('li2.txt','r') as f12:          for line in f12:               if not line in list3:                    rel_urls = re.sub(r'^/\.','',line, flags = re.DOTALL)                    if( ((re.match(r'^#',line)) == None) or ((re.match(r'^/\.',line)) == None)):                         rel_urls = re.sub(r'^//',scheme ,line,flags = re.DOTALL)                         rel_urls = re.sub(r'^(/)',base_url+'/',line,flags = re.DOTALL)                         list3.append(rel_urls)                         li4.write(rel_urls)     final_list = []     li5 = open('Links_Final.txt','w')
# Check if the formed URL is valid using the python module 'Validators'.     with open('Links.txt','r') as f:          for line in f:               if not line in final_list:                    if(validators.url(line) is True):                         final_list.append(line)                         li5.write(line)                    else:                         print 'Removing invalid urls..'     print 'Links extracted'
# Return the list of links.     return final_list
# Function to get the current date time and format it.
def getCreatedat():     current_time = str(datetime.datetime.now())     d = current_time.split()     yymmdd = d[0].split("-")     hhmmss = d[1].split(".")[0].split(":")     createdat = yymmdd[0] + yymmdd[1] + yymmdd[2] + hhmmss[0] + hhmmss[1] + hhmmss[2]     return createdat
# Function to get the UTC date time and format it.
def get_creationdatetime_utc():     current_time = str(datetime.datetime.utcnow())     d = current_time.split()     yymmdd = d[0].split("-")     hhmmss = d[1].split(".")[0].split(":")     creationdatetime_utc = yymmdd[0] + yymmdd[1] + yymmdd[2] + hhmmss[0] + hhmmss[1] + hhmmss[2]     return creationdatetime_utc
# Function to insert the scraped content into the FND Tables.
# Ensure that you have WRITE privileges in the HANA system.
def insert_into_db(sclpsttxt,searchterm):     socialmediachannel = 'CR'     dummy_createdat = '20151204'     creationdatetime = str(datetime.datetime.now() )     creationdatetime_utc = get_creationdatetime_utc()
# The connection to the system is made with the appropriate credentials     connection = pyhdb.connect(host=server, port=port, user=username_hana, password=password_hana)     cursor = connection.cursor()     socialdatauuid = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(32))     socialpost = ''.join(random.choice(string.digits) for _ in range(16))     language = 'EN'     createdbyuser = username_hana     createdat = getCreatedat()     sclpsttxt = sclpsttxt.decode('ascii','replace')     sclpsttxt = sclpsttxt.replace("'","\"")     socialposttext = sclpsttxt     creationusername = username_hana     socialpostactionstatus = '3'
# socialposttype = 'Blog'     values ="'"+client+"','"+socialdatauuid+"','"+socialpost+"',\     '"+language+"','"+socialmediachannel+"','"+createdbyuser+"',\     '"+creationdatetime+"','"+"','"+"',\     '"+"','"+"','"+"',\     '"+"','"+"','"+socialpostactionstatus+"',\     '"+"','"+creationusername+"','"+"',\     '"+searchterm+"','"+createdat+"','"+socialposttext+"',\     '"+creationdatetime_utc+"','"+"','"+"',\     '"+"','"+"'"
# The SQL query is formed by entering the necessary values.     sql = 'Insert into ' + schema + '.SOCIALDATA values(' + values + ')'     try:
# Execute the sql query          cursor.execute(sql)          print 'Stored successfully\n\n'     except Exception, e:          print e          pass
# Commit and close the connection     connection.commit()     connection.close()
def main():     print 'Enter the search term'     searchterm = raw_input()
# The top N results from google search are fetched for the specified searchterm.     top_links(searchterm)     with open('Top_Results.txt','r') as f:          for line in f:               print 'Content',line
# The content of these links are scraped and stored in the DB               scrape(line,searchterm)               line_ch = line.rstrip()               n = urlparse(line_ch)               base_url = n.scheme + '://' + n.hostname               scheme = n.scheme               links = ''
# Further, the links inside each of the Top results are found and scraped similarly               links = get_links(base_url,scheme)               if(not links):                    print 'No internal links found'               else:                    for i in links:                         pattern = re.compile(r'^(?:http|ftp)s?://')                         mat = re.match(pattern, i)                         if(mat!= None):                              print 'Link url',i
# We call the scrape function in order to scrape the internal links as well                              scrape(i,searchterm)     print 'Scraping done.'
# Once the scraping and storing is done, the files created internally are deleted. Only the file 'Top_Results.txt' persists, since the user can change it according to need.     if os.path.isfile('li.txt'):          os.remove('li1.txt')     if os.path.isfile('li2.txt'):          os.remove('li2.txt')     if os.path.isfile('Links.txt'):          os.remove('Links.txt')     if os.path.isfile('Links_Final.txt'):          os.remove('Links_Final.txt')     os.remove('crawled.txt')     os.remove('regex.txt')     os.remove('Scraped.txt')     os.remove('Scraped_Final.txt')
if __name__ == '__main__':     main()

2. Edit the script to enter your SAP HANA user credentials in the function insert_into_db()

3. Open the command prompt of that location

4. Run the python script as shown below:

5. Once the script is run, it inserts data into the database as required. The screenshot of the same is given below:

Note

Based on your requirements, you can modify the number of results you want to receive. You can do this by changing the values in the top_links()function.
If you want to scrape a custom list of websites, you have to add these links in the Top_Results.txt file. Then, you have to comment out the call to the function top_links().

Web Crawler and Scraper

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112