Web Scraping Indonesian News Popular using Python, Requests, & BeautifulSoup4

Aris Febrianto
5 min readJan 12, 2021

Web scraping is an activity that is done to retrieve certain data semi-structured from a website page. The page is generally built using a markup language such as HTML or XHTML, the process will analyze the document before it starts retrieving data. It can also be called collecting data from a website page to be managed into information.

Collecting data can be done manually by copying and pasting data into a text file or spreadsheet (excel) but if the data in large quantities it will be difficult to collect it.

Then it takes a web scraping process to make it easier to collect data faster.

In this case of web scraping, we will do web scraping of a news portal in which there is the most popular news. Web scraping this time will use a python programming language assisted with library requests, BeautifulSoup4, and to display web scraping results into a minimalist website using the Flask Framework and Bootstrap4.

Collecting data can be done manually by copying and pasting data into a text file or spreadsheet (excel) but if the data in large quantities it will be difficult to collect it.

Then it takes a web scraping process to make it easier to collect data faster.

In this case of web scraping, we will do web scraping of a news portal in which there is the most popular news. Web scraping this time will use a python programming language assisted with library requests, BeautifulSoup4, and to display web scraping results into a minimalist website using the Flask Framework and Bootstrap4.

The necessary steps before doing web scraping are:
1. Download and install python 3.7
2. Download an IDE like pycharm
3. Creating a virtual environment and Activating a virtual environment
3. Install Requests library: pip install requests
4. Install beautifulsoup4 library: pip install beautifulsoup4
5. Install Flask: pip install flask
6. Freeze list requirements to file .txt: pip freeze list > requirements.txt

After finishing the above step then next we will start web scraping the most popular news from the second news. Steps to do web scraping:

1. Open page detiknews

Open page detiknews and search for the most popular news column then we will get a URL like this: https://www.detik.com/terpopuler?tag_from=wp_cb_mostPopular_more

2. Call and use requests library

Steps call requests library:

1. Create a python file, for example: scraping_detik_news.py

2. Import library requests:

import requests

3. Create a variable and enter the value from the requests.get library and print the result

req = requests.get(‘https://www.detik.com/most popular’, params={‘tag_from’: ‘wp_cb_mostPopular_more’})
print(req)

4. Run scraping_detik_news.py and see the results if we get a response 200 (<Response [200]>) then we will continue the next process, but if the result is not 200 please check again the writing code above.

3. Call and use the BeautifulSoup4 library

After step 1 is successful next we will use the beautifulsoup4 library.

  1. Import library beautifulsoup4
from bs4 import BeautifulSoup

2. Create a variable and enter a value with the beautifulsoup4 library and print the result

soup = BeautifulSoup(req.text, ‘html.parser’)
print(soup)

3. Run scraping_detik_news.py see the results if a set of HTML tags then the next process can be done.

4. Inspect elements of popular area website pages and do scraping in code

We will take the most popular pictures, titles, and news URLs for us to display back on a web page. First inspect the popular element area and then do scrape:

popular_area = soup.find(attrs={'class': 'grid-row list-content'})

5. Looping the results of scrape content_item

The result content_item is in the form of a list then we do looping to get the image, title and url of each of the most popular news in this way:

for index in content_item:    print(f"image : {index.find('a').find('img')}")    print(f"title : {index.find_all(attrs={'class': 'media__title'})[0].find('a').text}")    print(f"href : {index.find_all(attrs={'class': 'media__title'})[0].find('a')['href']}")
scraping_detik_news.py

Just look at the print results from looping above, it will produce information in the form of images, titles, and URLs from each news.

6. Create a templates folder under the project

This templates folder will be automatically called when flask is started

7. Create a base HTML template

Creating a base.html template to use as a base HTML header and footer, so later when we will add a new page we call the base.html, and then we just focus on filling the body HTML.

<!DOCTYPE html>
<html lang="en">
<head>
<title>Indonesian News Popular</title>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.16.0/umd/popper.min.js"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/4.5.2/js/bootstrap.min.js"></script>
</head>
<body>

<nav class="navbar navbar-expand-sm bg-dark navbar-dark">
<!-- Brand/logo -->
<a class="navbar-brand" href="/">Indonesian News Popular</a>

<!-- Links -->
<ul class="navbar-nav">
<li class="nav-item">
<a class="nav-link" href="/detik-popular">Detik News Popular</a>
</li>
<li class="nav-item">
<a class="nav-link" href="/kompas-popular">Kompas News Popular</a>
</li>
<li class="nav-item">
<a class="nav-link" href="/tempo-popular">Tempo News Popular</a>
</li>
</ul>
</nav>

<div class="container-fluid" style="margin-top:10px;">
{% block body %}

{% endblock %}
</div>
</body>
</html>

8. Create files run.py

Type the following code:

import requests
from bs4 import BeautifulSoup
from flask import Flask, render_template

app = Flask(__name__)

@app.route('/')
def home():
return render_template('base.html')
if __name__ == '__main__':
app.run(debug=True)

@app.route(‘/’) is a routing that will be used on web pages by rendering or calling file base.html

9. Create a route to display the most popular news lists

@app.route('/detik-popular')
def detik_popular():
html_doc = requests.get('https://www.detik.com/most popular', params={'tag_from': 'wp_cb_mostPopular_more'})
soup = BeautifulSoup(html_doc.text, 'html.parser')
popular_area = soup.find(attrs={'class': 'grid-row list-content'})
content_item = popular_area.find_all(attrs={'class': 'list-content__item'})

return render_template('seconds-scraping.html', content_item=content_item)
run.py

10. Create a file detik-scraping.html then call the file base.html that we made earlier in this way:

{% extends 'base.html' %}

11. Next to fill the html body in this way:

{% block body %}
{% for index in content_item %}
<div class="card card card-type-1">
<div class="wrapper clearfix">
{{ index.find('a').find('img')|safe }}
<a href="{{ index.find_all(attrs={'class': 'media__title'})[0].find('a')['href']|safe }}">{{ index.find_all(attrs={'class': 'media__title'})[0].find('a').text|safe }}</a>
</div>
</div>
{% endfor %}
{% endblock %}
detik-scraping.html

12. After finishing it next run run.py

See the results in console if there is no error then we can see the result by opening this URL in the browser http://127.0.0.1:5000/

Web scraping result on page website minimalist

The full source code can be seen in my github repository

--

--