Node.js parsing technologies in the task of aggregating information and evaluating the parameters of cargo routes by extracting data from open sources

  • Анастасия Андреевна Корепанова St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), 14-th Linia, VI, 39, 199178, Saint Petersburg, Russia
  • Fedor Bushmelev St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), 14-th Linia, VI, 39, 199178, Saint Petersburg, Russia
  • Artem Sabrekov St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), 14-th Linia, VI, 39, 199178, Saint Petersburg, Russia
Keywords: web scraping, web crawling, web technologies, Node.js, HTML

Abstract

This article is devoted to the technologies of web scraping (web crawling) for Node.js, used in the task of aggregating information and estimating the parameters of cargo routes by extracting data from open sources. The challenge of web scraping occurs in many different contexts, both scientific and industrial. The tasks of web scraping have both wide practical applications and a significant educational aspect. However, the existing material on web scraping is scattered and unstructured. In this paper, using the example of solving the scientific and technical problem of aggregating information and evaluating the parameters of cargo routes by extracting data from open sources, an overview of the technologies for web scraping on Node.js is presented, the classification of sites by complexity is described, the systematization of the features of sites that are an obstacle to scrapinf is given, and possible ways to bypass them. Thus, the didactic goal of this article is achieved to systematize the material on parsing websites.

Author Biographies

Анастасия Андреевна Корепанова, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), 14-th Linia, VI, 39, 199178, Saint Petersburg, Russia

Junior Researcher, Laboratory of Theoretical and Interdisciplinary Problems of Informatics, SPC RAS, aak@dscs.pro

Fedor Bushmelev, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), 14-th Linia, VI, 39, 199178, Saint Petersburg, Russia

Junior Researcher, Labortory of Theoretical and Interdisciplinary Problems of Informatics, SPC RAS, fvb@dscs.pro

Artem Sabrekov, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), 14-th Linia, VI, 39, 199178, Saint Petersburg, Russia

Junior Researcher, Laboratory of Theoretical and Interdisciplinary Problems of Informatics, SPC RAS, sabrekov2@gmail.com

References

A. A. Moskalenko, O. R. Laponina, and V. A. Sukhomlin, “Developing a Web Scraping Application with Bypass Blocking,” Sovremennye informacionnye tehnologii i IT-obrazovanie, vol. 15, no. 2, pp. 413-420, 2019 (in Russian); doi: 10.25559/SITITO.15.201902.413-420

A. Y. L. Chong, E. Ch’ng, M. J. Liu, and B. Li, “Predicting consumer product demands via Big Data: the roles of online promotional marketing and online reviews,” International Journal of Production Research, vol. 55, no. 17, pp. 5142–5156, 2017; doi: 10.1080/00207543.2015.1066519

A. Yu. Basalaeva, G. A. Gareeva, and D. R. Grigor’eva, “Web-scraping i klassifikatsiya tekstov metodom naive Bayes,” Innovation Science, no. 5 (1), pp. 11–14, 2018 (in Russian).

R. Barbado, O. Araque, and C. A. Iglesias, “A framework for fake review detection in online consumer electronics retailers,” Information Processing and Management, vol. 56, no. 4, pp. 1234–1244, 2019; doi: 10.1016/j.ipm.2019.03.002

A. De Mauro, M. Greco, M. Grimaldi, and P. Ritala, “Human resources for Big Data professions: A systematic classification of job roles and required skill sets,” Information Processing and Management, vol. 54, no. 5, pp. 807–817, 2018; doi: 10.1016/j.ipm.2017.05.004

A. A. Korepanova and M. V. Abramov, “Application Of Random Forest In Choosing A Method For The Age Of A Social Network User Recovery,” Artificial Intelligence and Decision Making, no. 2., pp. 66–77, 2021 (in Russian); doi: 10.14357/20718594210207

V. D. Oliseenko, M. V. Abramov, and A. L. Tulupyev, “Identification of user accounts by image comparison: The phash-based approach,” Scientific and Technical Journal of Information Technologies, Mechanics and Opticsthis, vol. 21, no. 4, pp. 562–570, 2021; doi: 10.17586/2226-1494-2021-21-4-562-570

F. Bushmelev, M. Abramov, and T. Tulupyeva, “Adaptive method of color selection in application to social media images,” CEUR Workshop Proceedings, vol. 2782, pp. 252–257, 2020.

A. V. Toropova, “Collecting data about the latest episodes and the rate of posting on the social network Vkontakte,” in Proc. Regional informatics and information security, St. Peterburg, Russia, 2020, vol. 9, St. Peterburg, Russia: SPOISU, 2020, pp. 228–230 (in Russian).

V. F. Stoliarova and A. L. Tulupyev, “Regression Model for the Problem of Parameter Estimation in the Gamma Poisson Model of Behavior: An Application to the Online Social Media Posting Data,” in Proc. of 2021 24th International Conference on Soft Computing and Measurements, SCM, 2021, 2021 pp. 24–27; doi: 10.1109/SCM52931.2021.9507187

J. Seitz, “Building a Keyword Monitoring Pipeline with Python, Pastebin and Searx,” in Bellingcat, 21 Apr. 2017. [Online]. Available: https://www.bellingcat.com/resources/2017/04/21/building-keyword-monitoring-pipeline-python-pastebin-searx/

M. Kul’gin, “Site parsing. Russia and the world. What does one of the most useful tools look like from a legal point of view?,” in vc.ru, 30 Jul. 2019. [Online] (in Russian). Available: https://vc.ru/legal/64328-parsing-saytov-rossiya-i-mir-kak-s-tochki-zreniya-zakona-vyglyadit-odin-iz-samyh-poleznyh-instrumentov

“The last mil for internet trading,” in logistics.datainsight, 26 Aug. 2021. [Online] (in Russian). Available: http://logistics.datainsight.ru/poslednyaya-milya-dlya-internet-torgovli

D. Voroshilov, “Experts called the volume of orders executed by Russian couriers for the year,” in RBC, 24 Nov. 2021. [Online] (in Russian). Available: https://www.rbc.ru/business/24/11/2021/619cffaa9a79473c95fc20d0

A. Balster, O. Hansen, H. Friedrich, and A. Ludwig, “An ETA Prediction Model for Intermodal Transport Networks Based on Machine Learning,” Business and Information Systems Engineering, vol. 62, no. 5, pp. 403–416, 2020; doi:10.1007/s12599-020-00653-0

U. Atak and Y. Arslano ¨ ˘glu, “Machine learning methods for predicting marine port accidents: a case study in container terminal,” Ships and Offshore Structures, pp. 1–8, 2021; doi: 10.1080/17445302.2021.2003067

I. Jurdana, A. Krylov, and J. Yamnenko, “Use of artificial intelligence as a problem solution for maritime transport,” Journal of Marine Science and Engineering, vol. 8, no. 3, article 201, 2021; doi: 10.3390/jmse8030201

El. Al. Gal, “Web Scraping With Python: A Beginner’s Guide,” in Brightdata, 10 Sep. 2020. [Online]. Available: https://brightdata.com/blog/how-tos

NTA, “Web Parsing. Basics in Python,” in vc.ru, 27 Feb. 2020. [Online] (in Russian). Available: https://vc.ru/newtechaudit/109368-web-parsing-osnovy-na-python

A. Nagaikin, “Web Scrapin,” in vc.ru, 17 Feb. 2020. [Online] (in Russian). Available: https://habr.com/ru/post/488720/

A. Rawat, “Web Scraping in Node.js with Multiple Examples,” in Tproger, 29 Apr. 2017. [Online] (in Russian). Available: https://tproger.ru/translations/web-scraping-node-js/

A. Rawat, “Web Scraping in Node.js with Multiple Examples,” in hackprogramming.com, 08 Mar. 2017.[Online]. Available: https://hackprogramming.com/web-scraping-in-node-js-with-multiple-examples/

D. Ni, “The Ultimate Guide to Web Scraping with Node.js,” in freecodecamp.org, 08 Aug. 2018. [Online]. Available: https://www.freecodecamp.org/news/the-ultimate-guide-to-web-scraping-with-node-js-daa2027dcd3/

A. Jain, “How to Perform Web-Scraping using Node.js,” in Bits and Pieces, 25 Dec. 2018. [Online]. Available: https://blog.bitsrc.io/https-blog-bitsrc-io-how-to-perform-web-scraping-using-node-js-5a96203cb7cb

E. Levada, “Puppeteer: site parsing with JavaScript,” in Proglib, 16 Jun. 2020. [Online] (in Russian). Available: https://proglib.io/p/puppeteer-parsing-saytov-na-javascript-2020-06-16

M. Kumar, R. Bhatia, and D. Rattan, “A survey of Web crawlers for information retrieval,” WIREs Data Mining Knowl Discov, vol. 7, no. 6, p. e1218, 2017; doi: 10.1002/widm.1218

Scrapy, [Official Site], 2022. [Online]. Available: https://scrapy.org/

Cloudflare, [Official Site], 2022. [Online]. Available: https://www.cloudflare.com/

ZennoPoster. Automate any tasks on the Internet, [Official Site], 2022. [Online]. Available: https://zennolab.com/en/products/zennoposter/

“An overview of HTTP,” in developer.mozilla.org, 04 Dec. 2021. [Online]. Available: https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview

“HTTP request methods,” in developer.mozilla.org, 03 Oct. 2021. [Online]. Available: https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods

“Evolution of HTTP,” in developer.mozilla.org, 21 Nov. 2021. [Online]. Available: https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Evolution_of_HTTP

“How Google uses cookies,” in policies.google.com, 2022. [Online]. Available: https://policies.google.com/technologies/cookies?hl=en

“General asynchronous programming concepts,” in developer.mozilla.org, 22 Jan. 2022. [Online]. Available: https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Asynchronous/Concepts

“Bugs. Happen. Debugging and Troubleshooting Reimagined,” in telerik.com, 2022. [Online]. Available: https://www.telerik.com/fiddler

Kr. Lewandowski, “Growth in the Size of Unit Loads and Shipping Containers from Antique to WWI,” Packag. Technol. Sci., vol. 29, no. 8–9, pp. 451–478, 2016; doi: 10.1002/pts.2231

Yandex.Cloud, [Official Site], 2022. [Online]. Available: URL: https://cloud.yandex.com/en/

Published
2022-03-09
How to Cite
Корепанова, А. А., Bushmelev, F., & Sabrekov, A. (2022). Node.js parsing technologies in the task of aggregating information and evaluating the parameters of cargo routes by extracting data from open sources. Computer Tools in Education, (3), 41-56. https://doi.org/10.32603/2071-2340-2021-3-41-56
Section
Informational systems