python - Using multiple web pages in a web scraper -

- July 15, 2014

i've been working on python code able links social media accounts government websites, research ease municipalities can contacted. i've managed adapt code work in 2.7, prints links facebook, twitter, linkedin , google+ present on given input website. issue i'm experiencing i'm not looking links on 1 web page, on list of 200 websites, have in excel file. have no experience importing these sorts of lists python, wondering if take @ code, , suggest proper way set these web pages base_url, if possible;

import cookielib  import mechanize  base_url = "http://www.amsterdam.nl"  br = mechanize.browser() cj = cookielib.lwpcookiejar() br.set_cookiejar(cj) br.set_handle_robots(false) br.set_handle_equiv(false) br.set_handle_redirect(true) br.set_handle_refresh(mechanize._http.httprefreshprocessor(), max_time=1) br.addheaders = [('user-agent',               'mozilla/5.0 (x11; u; linux i686; en-us; rv:1.9.0.1) gecko/2008071615 fedora/3.0.1-1.fc9 firefox/3.0.1')] page = br.open(base_url, timeout=10)  links = {} link in br.links():     if link.url.find('facebook')>=0 or link.url.find('twitter')>=0 or link.url.find('linkedin')>=0 or link.url.find('plus.google')>=0:     links[link.url] = {'count': 1, 'texts': [link.text]}  # printing link, data in links.iteritems(): print "%s - %s - %s - %d" % (base_url, link, ",".join(data['texts']), data['count'])

you mentioned have excel file list of websites right ? therefore can export excel file csv file can read values in python code.

here's more information regarding that.

here's how work directly excel files

you can along lines :

import csv  links = []  open('urls.csv', 'r') csv_file:     csv_reader = csv.reader(csv_file)     # simple example single column of url's present     links = list(csv_reader)

now links list of urls. can loop on list inside function fetches page , scrapes data.

def extract_social_links(links):     link in links:         base_url = link           br = mechanize.browser()         cj = cookielib.lwpcookiejar()         br.set_cookiejar(cj)         br.set_handle_robots(false)         br.set_handle_equiv(false)         br.set_handle_redirect(true)         br.set_handle_refresh(mechanize._http.httprefreshprocessor(),     max_time=1)         br.addheaders = [('user-agent',           'mozilla/5.0 (x11; u; linux i686; en-us; rv:1.9.0.1) gecko/2008071615 fedora/3.0.1-1.fc9 firefox/3.0.1')]         page = br.open(base_url, timeout=10)          links = {}         link in br.links():             if link.url.find('facebook')>=0 or link.url.find('twitter')>=0 or     link.url.find('linkedin')>=0 or link.url.find('plus.google')>=0:             links[link.url] = {'count': 1, 'texts': [link.text]}          # printing         link, data in links.iteritems():         print "%s - %s - %s - %d" % (base_url, link, ",".join(data['texts']), data['count'])

as aside, should split if conditions make them more readable.

Comments

Daisy Jones5 June 2021 at 05:09
Nice Blog Post thanks for sharing it.

Tp link service provider
TP-Link Support Australia
ReplyDelete
Replies

Add comment

Search This Blog

Stadnd

python - Using multiple web pages in a web scraper -

Comments

Post a Comment

Popular posts from this blog

Capture and play voice with Asterisk ARI -

python - Statsmodels.api Logit model error ValueError: endog must be in the unit interval -

python - How to use elasticsearch.helpers.streaming_bulk -