Webscraping coronavirus data: – python programming

I have this simple code to scrape the last data of cofirmed cases in Italy of coronavirus from a site with Python.

I just did some basic webscraping, importing:

import requests
import urllib.request
import time
from bs4 import BeautifulSoup

Now that we have the tools, let’s use them

url = "https://lab24.ilsole24ore.com/coronavirus/"

# access the site
response = requests.get(url)
print(response)

We made a request to the site of ilsole24ore. If we get this we get the response right:

<Response [200]>

We parse the html that we get as response

soup = BeautifulSoup(response.text, "html.parser")

We get all the <h2> tag, because the data is into a h2 tag, like you can see in the code below.

To get the window on the right that you see below, right click with the mouse and then choose inspect from the menu, then click on (1. click) the arrow on the top left, then click on the number of confirmed cases (click 2). You will se on the right the <h2> tag that contains the number.

h2 = soup.findAll("h2")

I can see that the first h2 is the one that I need; let’s see what we have

print(h2)

We will see this

[<h2 class="timer count-number" data-speed="1000" data-to="2706" id="num_1"></h2>, <h2 class="timer count-number" data-speed="1000" data-to="107" id="num_2"></h2>, <h2 class="timer count-number" data-speed="1000" data-to="276" id="num_3"></h2>, <h2 class="timer count-number" data-speed="1000" data-to="3089" id="num_4"></h2>, <h2 class="chartTitle"><center>I numeri complessivi</center></h2>, <h2 class="chartTitle">Il trend giorno per giorno</h2>, <h2 class="chartTitle">L’andamento nelle province con più contagi</h2>, <h2 class="chartTitle">L’andamento delle 5 regioni con più contagi</h2>, <h2 class="chartTitle">I dati per provincia</h2>, <h2 class="chartTitle">I nuovi tamponi giornalieri</h2>, <h2 class="chartTitle">Ricoveri e terapie intensive</h2>, <h2 class="chartTitle">Come crescono i ricoveri</h2>, <h2 class="chartTitle">Il contagio nei Paesi europei</h2>, <h2 class="chartTitle">I primi dieci Paesi al mondo per contagio</h2>]

as you can see, the first one it the one that we are looking for with the 2706 number.

I am gonna transform the first of the h2 data (h2[0]) into a string and the print the sixth (5) element of the list obtained splitting by the apostrophes the string.

h2 = str(h2[0])
print("Last data of confirmed case in Italy:")
print(h2.split("\"")[5])

Just to be clear if I split the h2.split(“\””) I get this:

['<h2 class=', 'timer count-number', ' data-speed=', '1000', ' data-to=', '2706', ' id=', 'num_1', '></h2>']

And the sixth element is “2706”, what I was looking for.

The whole code

import requests
import urllib.request
import time
from bs4 import BeautifulSoup

# the site to scrape

url = "https://lab24.ilsole24ore.com/coronavirus/"

# access the site
response = requests.get(url)
# now we parse the html with BS
soup = BeautifulSoup(response.text, "html.parser")
h2 = soup.findAll("h2")
h2 = str(h2[0])
print("Last data of confirmed case in Italy:")
print(h2.split("\"")[5])

Let’s get the result with a shell here


# press RUN below to see the data

import requests
import urllib.request
import time
from bs4 import BeautifulSoup

# the site to scrape

url = "https://lab24.ilsole24ore.com/coronavirus/"

# access the site
response = requests.get(url)


# now we parse the html with BS
soup = BeautifulSoup(response.text, "html.parser")
h2 = soup.findAll("h2")

h2 = str(h2[0])
print("Last data of confirmed case in Italy:")
print(h2.split("\"")[5])

Just press 'Run'.

The data of the previous days


# press RUN below to see the data

import pandas as pd
from datetime import datetime, timedelta
import matplotlib.pyplot as plt



days = []
days2 = []
for n in range(5, 0, -1):
    day = datetime.today() - timedelta(n)
    days2.append(datetime(2020, day.month, day.day))
    days.append(str(day.month) + "/" + str(day.day))

month = str(datetime.today().month)
c = []
def check(what, day):
    url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-{}.csv".format(what)

    df = pd.read_csv(url, error_bad_lines=False)
    print(what, end=" ")
    result = df.loc[df["Country/Region"]=="Italy"]["{}/20".format(day)]
    print(list(result)[0], end=" - ")
    if what == "Confirmed":
        c.append(list(result)[0])


what = "Confirmed", "Recovered", "Deaths" 

for d in days:
    print("{}".format(d), end=": ")
    for w in what:
        check(w, d)
    print()
    
#sorted(days, key=lambda d: map(int, d.split('/')))
ax = plt.subplot(111)
ax.bar(days2, c)
ax.xaxis_date()
plt.xlabel("days: ")
plt.ylabel("Confirmed / positivi")
plt.show()

Just press 'Run'.

More data scraped

Here we add some other data:

<script async src="https://cdn.datacamp.com/dcl-react.js.gz"></script>
  <div class="exercise">
    <div data-datacamp-exercise data-lang="python">
      <code data-type="pre-exercise-code">
      </code>
      <code data-type="sample-code">

import requests
import urllib.request
import time
from bs4 import BeautifulSoup


# the site to scrape

url = "https://lab24.ilsole24ore.com/coronavirus/"

# access the site
response = requests.get(url)
# now we parse the html with BS
soup = BeautifulSoup(response.text, "html.parser")
h2 = soup.findAll("h2")
print("Today`s data about coronavirus in Italy:")
print("Confirmed: ", str(h2[0]).split("\"")[5])
print("Deaths: ", str(h2[1]).split("\"")[5])
print("Recovered: ", str(h2[2]).split("\"")[5])


	

      </code>
      <code data-type="solution"></code>
      <code data-type="sct"></code>
      <div data-type="hint">Just press 'Run'.</div>
    </div>
  </div>

Utility to create your interactive shell

In this code I put a little utility to make faster your shell with python for your sites.

You need this file called code_shell.txt in your directory

<script async src="https://cdn.datacamp.com/dcl-react.js.gz"></script>
  <div class="exercise">
    <div data-datacamp-exercise data-lang="python">
      <code data-type="pre-exercise-code">
      </code>
      <code data-type="sample-code">

{{code}}

      </code>
      <code data-type="solution"></code>
      <code data-type="sct"></code>
      <div data-type="hint">Just press 'Run'.</div>
    </div>
  </div>

Then you save your python script (whatever you want to run on the web page and then you run this other script

import os
from tkinter import filedialog


filename = filedialog.askopenfilename(
    initialdir=".", filetypes=[("Python files", ".py")])


def openfile(filename):
    with open(filename) as filepy:
        filepy = filepy.read()
    return filepy


filepy = openfile(filename)
filetxt = openfile("code_shell.txt")

# Create file to go into wordpress
filetxt = filetxt.replace("{{code}}", filepy)
filename = filename[:-4] + ".html"
with open(filename, "w") as file:
    file.write(filetxt)

os.startfile(filename)

When you run this, you will be asked to get the python file with the code you want to run in a web page, and after you choose, it will show you the code running in a shell in the browser.



import requests
import urllib.request
import time
from bs4 import BeautifulSoup


# the site to scrape

url = "https://lab24.ilsole24ore.com/coronavirus/"

# access the site
response = requests.get(url)
# now we parse the html with BS
soup = BeautifulSoup(response.text, "html.parser")
h2 = soup.findAll("h2")
print("Today`s data about coronavirus in Italy:")
print("Confirmed: ", str(h2[0]).split("\"")[5])
print("Deaths: ", str(h2[1]).split("\"")[5])
print("Recovered: ", str(h2[2]).split("\"")[5])

Just press 'Run'.

[hoos name=”all”]

The whole code

Let’s get the result with a shell here

The data of the previous days

More data scraped

Utility to create your interactive shell

Published by pythonprogramming