python - Using multiple web pages in a web scraper -
i've been working on python code able links social media accounts government websites, research ease municipalities can contacted. i've managed adapt code work in 2.7, prints links facebook, twitter, linkedin , google+ present on given input website. issue i'm experiencing i'm not looking links on 1 web page, on list of 200 websites, have in excel file. have no experience importing these sorts of lists python, wondering if take @ code, , suggest proper way set these web pages base_url, if possible;
import cookielib import mechanize base_url = "http://www.amsterdam.nl" br = mechanize.browser() cj = cookielib.lwpcookiejar() br.set_cookiejar(cj) br.set_handle_robots(false) br.set_handle_equiv(false) br.set_handle_redirect(true) br.set_handle_refresh(mechanize._http.httprefreshprocessor(), max_time=1) br.addheaders = [('user-agent', 'mozilla/5.0 (x11; u; linux i686; en-us; rv:1.9.0.1) gecko/2008071615 fedora/3.0.1-1.fc9 firefox/3.0.1')] page = br.open(base_url, timeout=10) links = {} link in br.links(): if link.url.find('facebook')>=0 or link.url.find('twitter')>=0 or link.url.find('linkedin')>=0 or link.url.find('plus.google')>=0: links[link.url] = {'count': 1, 'texts': [link.text]} # printing link, data in links.iteritems(): print "%s - %s - %s - %d" % (base_url, link, ",".join(data['texts']), data['count'])
you mentioned have excel file list of websites right ? therefore can export excel file csv
file can read values in python code.
here's more information regarding that.
here's how work directly excel files
you can along lines :
import csv links = [] open('urls.csv', 'r') csv_file: csv_reader = csv.reader(csv_file) # simple example single column of url's present links = list(csv_reader)
now links
list of urls. can loop on list inside function fetches page , scrapes data.
def extract_social_links(links): link in links: base_url = link br = mechanize.browser() cj = cookielib.lwpcookiejar() br.set_cookiejar(cj) br.set_handle_robots(false) br.set_handle_equiv(false) br.set_handle_redirect(true) br.set_handle_refresh(mechanize._http.httprefreshprocessor(), max_time=1) br.addheaders = [('user-agent', 'mozilla/5.0 (x11; u; linux i686; en-us; rv:1.9.0.1) gecko/2008071615 fedora/3.0.1-1.fc9 firefox/3.0.1')] page = br.open(base_url, timeout=10) links = {} link in br.links(): if link.url.find('facebook')>=0 or link.url.find('twitter')>=0 or link.url.find('linkedin')>=0 or link.url.find('plus.google')>=0: links[link.url] = {'count': 1, 'texts': [link.text]} # printing link, data in links.iteritems(): print "%s - %s - %s - %d" % (base_url, link, ",".join(data['texts']), data['count'])
as aside, should split if conditions make them more readable.
Nice Blog Post thanks for sharing it.
ReplyDeleteTp link service provider
TP-Link Support Australia