Scraping in Background Process and Schedule Your Task Using Python and VPS

Muhammad Faisal

4 min readJan 30, 2022

in this article ,i’m gonna show you how to do web scraping in background process

why we should do that?

in web scraping sometime you found a case that the data in website you

scraped is updated every week ,every day ,or even every hours

is annoying if you must run the program again and again

why don’t we do it automatically 24/7?

what tools that we need?

1.vps

vps is virtual private server ,as like the name vps make we can run computer

virtually ,run the program ,save file ,edit file or deleting file

it is main core tools to do background process task

2.python

there is lot programming language but in this case i’m using python because

its simply and have package that we need to do web scraping

3.beautifulsoup

beautifulSoup is python library that make easy to scrap website ,its have html parser and that can searching , and modifying the parse tree.

4.request

request also python library, http request return a response object with all response data like content ,status ,encoding etc

5.schedule

getting started

first we should install package that we need

you can simply install python from the official website in here

and install all package we need using command :

pip install BeautifulSoup requests

if you already installed you can import the package

from bs4 import BeautifulSoupimport requests

then created variable called url the website we want to scrape

url='https://www.ebay.com/sch/i.html'

you can look in url there is paramater _nkw and 3f_jpg
_nkw stands for value the search and 3f_jpg is for the data per page

after that we can create variable called params

params={"_nkw":"laptop","3F_ipg":"50" 
}

Request HTTP

Next, we are going to get the URL and the page and parse them to html.parser:

req=requests.get(url,params=params)soup=BeautifulSoup(req.text,'html.parser')

then we find all every product

products=soup.find_all(‘li’,’s-item s-item__pl-on-bottom s-item — watch-at-corner’)

because we using find_all method we get all “li” element in array
then we should looping all element into variable in product variable

for product in products:

and get child element product

for product in products:
  link =product.find('a','s-item__link')['href']
  print(link)

and that we got

and we will scrape every details product with the same ways

here the code:

and lets wrapping the code to the function and we exported data to json and the we naming the file by time created

and now we should install scheduler python using command :

pip install schedule

there is many option in schedule you can run the function every day/week/hour even every minutes ,so you can check in schedule documentation here

so we can import schedule and run the function ,in this case we run every hour , so we can add the code:

schedule.every(1).hours.do(get_product)while True:
   schedule.run_pending()
   time.sleep(1)

and now we deploy the code to vps ,there are several ways to deploy the code ,you can deploy manually using winscp or clone it from github
in this case clone it from github