I have this simple code to scrape the last data of cofirmed cases in Italy of coronavirus from a site with Python.
I just did some basic webscraping, importing:
import requests import urllib.request import time from bs4 import BeautifulSoup
Now that we have the tools, let’s use them
url = "https://lab24.ilsole24ore.com/coronavirus/" # access the site response = requests.get(url) print(response)
We made a request to the site of ilsole24ore. If we get this we get the response right:
<Response [200]>
We parse the html that we get as response
soup = BeautifulSoup(response.text, "html.parser")
We get all the <h2> tag, because the data is into a h2 tag, like you can see in the code below.
To get the window on the right that you see below, right click with the mouse and then choose inspect from the menu, then click on (1. click) the arrow on the top left, then click on the number of confirmed cases (click 2). You will se on the right the <h2> tag that contains the number.
h2 = soup.findAll("h2")
I can see that the first h2 is the one that I need; let’s see what we have
print(h2)
We will see this
[<h2 class="timer count-number" data-speed="1000" data-to="2706" id="num_1"></h2>, <h2 class="timer count-number" data-speed="1000" data-to="107" id="num_2"></h2>, <h2 class="timer count-number" data-speed="1000" data-to="276" id="num_3"></h2>, <h2 class="timer count-number" data-speed="1000" data-to="3089" id="num_4"></h2>, <h2 class="chartTitle"><center>I numeri complessivi</center></h2>, <h2 class="chartTitle">Il trend giorno per giorno</h2>, <h2 class="chartTitle">L’andamento nelle province con più contagi</h2>, <h2 class="chartTitle">L’andamento delle 5 regioni con più contagi</h2>, <h2 class="chartTitle">I dati per provincia</h2>, <h2 class="chartTitle">I nuovi tamponi giornalieri</h2>, <h2 class="chartTitle">Ricoveri e terapie intensive</h2>, <h2 class="chartTitle">Come crescono i ricoveri</h2>, <h2 class="chartTitle">Il contagio nei Paesi europei</h2>, <h2 class="chartTitle">I primi dieci Paesi al mondo per contagio</h2>]
as you can see, the first one it the one that we are looking for with the 2706 number.
I am gonna transform the first of the h2 data (h2[0]) into a string and the print the sixth (5) element of the list obtained splitting by the apostrophes the string.
h2 = str(h2[0]) print("Last data of confirmed case in Italy:") print(h2.split("\"")[5])
Just to be clear if I split the h2.split(“\””) I get this:
['<h2 class=', 'timer count-number', ' data-speed=', '1000', ' data-to=', '2706', ' id=', 'num_1', '></h2>']
And the sixth element is “2706”, what I was looking for.
The whole code
import requests import urllib.request import time from bs4 import BeautifulSoup # the site to scrape url = "https://lab24.ilsole24ore.com/coronavirus/" # access the site response = requests.get(url) # now we parse the html with BS soup = BeautifulSoup(response.text, "html.parser") h2 = soup.findAll("h2") h2 = str(h2[0]) print("Last data of confirmed case in Italy:") print(h2.split("\"")[5])
Let’s get the result with a shell here
# press RUN below to see the data
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
# the site to scrape
url = "https://lab24.ilsole24ore.com/coronavirus/"
# access the site
response = requests.get(url)
# now we parse the html with BS
soup = BeautifulSoup(response.text, "html.parser")
h2 = soup.findAll("h2")
h2 = str(h2[0])
print("Last data of confirmed case in Italy:")
print(h2.split("\"")[5])
The data of the previous days
# press RUN below to see the data
import pandas as pd
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
days = []
days2 = []
for n in range(5, 0, -1):
day = datetime.today() - timedelta(n)
days2.append(datetime(2020, day.month, day.day))
days.append(str(day.month) + "/" + str(day.day))
month = str(datetime.today().month)
c = []
def check(what, day):
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-{}.csv".format(what)
df = pd.read_csv(url, error_bad_lines=False)
print(what, end=" ")
result = df.loc[df["Country/Region"]=="Italy"]["{}/20".format(day)]
print(list(result)[0], end=" - ")
if what == "Confirmed":
c.append(list(result)[0])
what = "Confirmed", "Recovered", "Deaths"
for d in days:
print("{}".format(d), end=": ")
for w in what:
check(w, d)
print()
#sorted(days, key=lambda d: map(int, d.split('/')))
ax = plt.subplot(111)
ax.bar(days2, c)
ax.xaxis_date()
plt.xlabel("days: ")
plt.ylabel("Confirmed / positivi")
plt.show()
More data scraped
Here we add some other data:
<script async src="https://cdn.datacamp.com/dcl-react.js.gz"></script> <div class="exercise"> <div data-datacamp-exercise data-lang="python"> <code data-type="pre-exercise-code"> </code> <code data-type="sample-code"> import requests import urllib.request import time from bs4 import BeautifulSoup # the site to scrape url = "https://lab24.ilsole24ore.com/coronavirus/" # access the site response = requests.get(url) # now we parse the html with BS soup = BeautifulSoup(response.text, "html.parser") h2 = soup.findAll("h2") print("Today`s data about coronavirus in Italy:") print("Confirmed: ", str(h2[0]).split("\"")[5]) print("Deaths: ", str(h2[1]).split("\"")[5]) print("Recovered: ", str(h2[2]).split("\"")[5]) </code> <code data-type="solution"></code> <code data-type="sct"></code> <div data-type="hint">Just press 'Run'.</div> </div> </div>
Utility to create your interactive shell
In this code I put a little utility to make faster your shell with python for your sites.
You need this file called code_shell.txt in your directory
<script async src="https://cdn.datacamp.com/dcl-react.js.gz"></script> <div class="exercise"> <div data-datacamp-exercise data-lang="python"> <code data-type="pre-exercise-code"> </code> <code data-type="sample-code"> {{code}} </code> <code data-type="solution"></code> <code data-type="sct"></code> <div data-type="hint">Just press 'Run'.</div> </div> </div>
Then you save your python script (whatever you want to run on the web page and then you run this other script
import os from tkinter import filedialog filename = filedialog.askopenfilename( initialdir=".", filetypes=[("Python files", ".py")]) def openfile(filename): with open(filename) as filepy: filepy = filepy.read() return filepy filepy = openfile(filename) filetxt = openfile("code_shell.txt") # Create file to go into wordpress filetxt = filetxt.replace("{{code}}", filepy) filename = filename[:-4] + ".html" with open(filename, "w") as file: file.write(filetxt) os.startfile(filename)
When you run this, you will be asked to get the python file with the code you want to run in a web page, and after you choose, it will show you the code running in a shell in the browser.
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
# the site to scrape
url = "https://lab24.ilsole24ore.com/coronavirus/"
# access the site
response = requests.get(url)
# now we parse the html with BS
soup = BeautifulSoup(response.text, "html.parser")
h2 = soup.findAll("h2")
print("Today`s data about coronavirus in Italy:")
print("Confirmed: ", str(h2[0]).split("\"")[5])
print("Deaths: ", str(h2[1]).split("\"")[5])
print("Recovered: ", str(h2[2]).split("\"")[5])
[hoos name=”all”]