Web scraping - non href

by To Thanh Nhat Khang   Last Updated August 14, 2019 05:26 AM - source

I have a list of website in a csv, on which I'd like to capture all pdfs.

BeautifulSoup select works fine on <a href> but there is this website that starts the pdf link with <data-url="https://example.org/abc/qwe.pdf"> and soup couldn't catch anything.

Is there any codes that I could use to get everything that starts with "data-url" and ends with .pdf?

I apologize for the messy codes. I'm still learning. Please let me know if I can provide clarification.

Thank you :D

The csv looks like this

123456789 https://example.com

234567891 https://example2.com

import os
import requests
import csv
from urllib.parse import urljoin
from bs4 import BeautifulSoup

#Write csv into tuples
with open('links.csv') as f:
    url=[tuple(line) for line in csv.reader(f)]
print(url)

#If there is no such folder, the script will create one automatically
folder_location = r'C:\webscrapping'
if not os.path.exists(folder_location):os.mkdir(folder_location)

def url_response(url):
    global i
    final = a
    response = requests.get(url)
    soup= BeautifulSoup(response.text, "html.parser")
    for link in soup.select("a[href$='.pdf']"):
        #Translating captured URLs into local addresses
        filename = os.path.join(folder_location,link['href'].split('/')[-1])
        print(filename)
        #Writing files into said addresses
        with open(filename, 'wb') as f:
            f.write(requests.get(urljoin(url,link['href'])).content)
        #Rename files
        os.rename(filename,str(final)+"_"+ str(i)+".pdf")
        i = i + 1

#Loop the csv
for a,b in url:
    i = 0
    url_response(b)
`

Tags : python-3.x


Answers 1


If beautifulsoup is not helping you, a regex solution to find the links would be as follows:

Sample HTML:

 txt = """
        <html>
        <body>
        <p>
        <data-url="https://example.org/abc/qwe.pdf">
        </p>
        <p>
        <data-url="https://example.org/def/qwe.pdf">
        </p>
        </html>
        """

Regex code to extract links inside data-url:

import re

re1 = '(<data-url=")' ## STARTS WITH
re2 = '((?:http|https)(?::\\/{2}[\\w]+)(?:[\\/|\\.]?)(?:[^\\s"]*))' # HTTP URL
re3 = '(">)' ## ENDS WITH

rg= re.compile(re1 + re2 + re3 ,re.IGNORECASE|re.DOTALL)
links = re.findall(rg, txt)

for i in range(len(links)):
    print(links[i][1])

Output:

https://example.org/abc/qwe.pdf
https://example.org/def/qwe.pdf
Ankur Sinha
Ankur Sinha
August 14, 2019 05:25 AM

Related Questions


Images dimensions error in python

Updated February 24, 2018 05:26 AM



How to test Python 3.4 asyncio code?

Updated July 28, 2017 22:26 PM