python - Using multiple web pages in a web scraper -


i've been working on python code able links social media accounts government websites, research ease municipalities can contacted. i've managed adapt code work in 2.7, prints links facebook, twitter, linkedin , google+ present on given input website. issue i'm experiencing i'm not looking links on 1 web page, on list of 200 websites, have in excel file. have no experience importing these sorts of lists python, wondering if take @ code, , suggest proper way set these web pages base_url, if possible;

import cookielib  import mechanize  base_url = "http://www.amsterdam.nl"  br = mechanize.browser() cj = cookielib.lwpcookiejar() br.set_cookiejar(cj) br.set_handle_robots(false) br.set_handle_equiv(false) br.set_handle_redirect(true) br.set_handle_refresh(mechanize._http.httprefreshprocessor(), max_time=1) br.addheaders = [('user-agent',               'mozilla/5.0 (x11; u; linux i686; en-us; rv:1.9.0.1) gecko/2008071615 fedora/3.0.1-1.fc9 firefox/3.0.1')] page = br.open(base_url, timeout=10)  links = {} link in br.links():     if link.url.find('facebook')>=0 or link.url.find('twitter')>=0 or link.url.find('linkedin')>=0 or link.url.find('plus.google')>=0:     links[link.url] = {'count': 1, 'texts': [link.text]}  # printing link, data in links.iteritems(): print "%s - %s - %s - %d" % (base_url, link, ",".join(data['texts']), data['count']) 

you mentioned have excel file list of websites right ? therefore can export excel file csv file can read values in python code.

here's more information regarding that.

here's how work directly excel files

you can along lines :

import csv  links = []  open('urls.csv', 'r') csv_file:     csv_reader = csv.reader(csv_file)     # simple example single column of url's present     links = list(csv_reader) 

now links list of urls. can loop on list inside function fetches page , scrapes data.

def extract_social_links(links):     link in links:         base_url = link           br = mechanize.browser()         cj = cookielib.lwpcookiejar()         br.set_cookiejar(cj)         br.set_handle_robots(false)         br.set_handle_equiv(false)         br.set_handle_redirect(true)         br.set_handle_refresh(mechanize._http.httprefreshprocessor(),     max_time=1)         br.addheaders = [('user-agent',           'mozilla/5.0 (x11; u; linux i686; en-us; rv:1.9.0.1) gecko/2008071615 fedora/3.0.1-1.fc9 firefox/3.0.1')]         page = br.open(base_url, timeout=10)          links = {}         link in br.links():             if link.url.find('facebook')>=0 or link.url.find('twitter')>=0 or     link.url.find('linkedin')>=0 or link.url.find('plus.google')>=0:             links[link.url] = {'count': 1, 'texts': [link.text]}          # printing         link, data in links.iteritems():         print "%s - %s - %s - %d" % (base_url, link, ",".join(data['texts']), data['count']) 

as aside, should split if conditions make them more readable.


Comments

Post a Comment

Popular posts from this blog

Capture and play voice with Asterisk ARI -

python - How to use elasticsearch.helpers.streaming_bulk -