Journal Home > Volume 4 , issue 4

Data are crucial to the growth of e-commerce in today’s world of highly demanding hyper-personalized consumer experiences, which are collected using advanced web scraping technologies. However, core data extraction engines fail because they cannot adapt to the dynamic changes in website content. This study investigates an intelligent and adaptive web data extraction system with convolutional and Long Short-Term Memory (LSTM) networks to enable automated web page detection using the You only look once (Yolo) algorithm and Tesseract LSTM to extract product details, which are detected as images from web pages. This state-of-the-art system does not need a core data extraction engine, and thus can adapt to dynamic changes in website layout. Experiments conducted on real-world retail cases demonstrate an image detection (precision) and character extraction accuracy (precision) of 97% and 99%, respectively. In addition, a mean average precision of 74%, with an input dataset of 45 objects or images, is obtained.


menu
Abstract
Full text
Outline
About this article

Intelligent and Adaptive Web Data Extraction System Using Convolutional and Long Short-Term Memory Deep Learning Networks

Show Author's information Sudhir Kumar Patnaik( )C. Narendra BabuMukul Bhave
Department of Computer Science and Engineering, M. S. Ramaiah University of Applied Sciences, Bangalore 560054, India
Gibraltar India Solutions LLP, Bangalore 560103, India

Abstract

Data are crucial to the growth of e-commerce in today’s world of highly demanding hyper-personalized consumer experiences, which are collected using advanced web scraping technologies. However, core data extraction engines fail because they cannot adapt to the dynamic changes in website content. This study investigates an intelligent and adaptive web data extraction system with convolutional and Long Short-Term Memory (LSTM) networks to enable automated web page detection using the You only look once (Yolo) algorithm and Tesseract LSTM to extract product details, which are detected as images from web pages. This state-of-the-art system does not need a core data extraction engine, and thus can adapt to dynamic changes in website layout. Experiments conducted on real-world retail cases demonstrate an image detection (precision) and character extraction accuracy (precision) of 97% and 99%, respectively. In addition, a mean average precision of 74%, with an input dataset of 45 objects or images, is obtained.

Keywords:

adaptive web scraping, deep learning, Long Short-Term Memory (LSTM), Web data extraction, You only look once (Yolo)
Received: 18 April 2021 Revised: 30 May 2021 Accepted: 28 June 2021 Published: 26 August 2021 Issue date: December 2021
References(43)
[1]
Y. B. Zhang, Image feature extraction algorithm in big data environment, Journal of Intelligent and Fuzzy Systems, vol. 39, no. 4, pp. 5109-5118, 2020.
[2]
L. Xie, J. L. Tao, Q. N. Zhang, and H. Y. Zhou, CNN and KPCA-based automated feature extraction for real time driving pattern recognition, IEEE Access, vol. 7, pp. 123765-123775, 2019.
[3]
J. Tao, H. B. Wang, X. Y. Zhang, X. Y. Li, and H. W. Yang, An object detection system based on YOLO in traffic scene, in Proc. of 2017 6th Int. Conf. Computer Science and Network Technology (ICCSNT), Dalian, China, 2017, pp. 315-319.
[4]
F. Ali, A. Ali, M. Imran, R. A. Naqvi, M. H. Siddiqi, and K. S. Kwak, Traffic accident detection and condition analysis based on social networking data, Accident Analysis & Prevention, vol. 151, p. 105973, 2021.
[5]
N. Islam, Z. Islam, and N. Noor, A survey on optical character recognition system, Journal of Information & Communication Technology-JICT, vol. 10, no. 2, pp. 1-4, 2016.
[6]
H. Rao and D. R. M. Sashikumar, A survey on automated web data extraction techniques for product specification from e-commerce web sites, International Journal of Advanced Research in Computer Science and Software Engineering, vol. 6, no. 8, pp. 310-316, 2016.
[7]
E. Uzun, A novel web scraping approach using the additional information obtained from web pages, IEEE Access, vol. 8, pp. 61726-61740, 2020.
[8]
M. Salah, B. Al Okush, and M. Al Rifaee, A comparison of web data extraction techniques, in Proc. of 2019 IEEE Jordan Int. Joint Conf. Electrical Engineering and Information Technology (JEEIT), Amman, Jordan, 2019, pp. 785-789.
[9]
S. L. Li, C. Chen, K. W. Luo, and B. Song, Review of deep web data extraction, in Proc. of 2019 IEEE Symp. Series on Computational Intelligence (SSCI), Xiamen, China, 2019, pp. 1068-1070.
[10]
W. Nadee and K. Prutsachainimmit, Towards data extraction of dynamic content from JavaScript web applications, in Proc. of 2018 Int. Conf. Information Networking (ICOIN), Chiang Mai, Thailand, 2018, pp. 750-754.
[11]
B. V. S. Ujwal, B. Gaind, A. Kundu, A. Holla, and M. Rungta, Classification-based adaptive web scraper, in Proc. of 16th IEEE Int. Conf. Machine Learning and Applications, Cancun, Mexico, 2017, pp. 125-132.
[12]
J. Park and D. Barbosa, Adaptive record extraction from web pages, in Proc. of WWW 2007, Banff, Canada, 2007, pp. 1335-1336.
[13]
C. J. Liu, Y. F. Tao, J. W. Liang, K. Li, and Y. H. Chen, Object detection based on YOLO network, in Proc. of 2018 IEEE 4th Information Technology and Mechatronics Engineering Conf. (ITOEC), Chongqing, China, 2018, pp. 799-803.
[14]
J. L. Hong, Deep web data extraction, in Proc. of 2010 IEEE Int. Conf. Systems, Man and Cybernetics, Istanbul, Turkey, 2010, pp. 3420-3427.
[15]
F. Ali, P. Khan, K. Riaz, D. Kwak, T. Abuhmed, D. Park, and K. S. Kwak, A fuzzy ontology and SVM-based web content classification system, IEEE Access, vol. 5, pp. 25781-25797, 2017.
[16]
W. Li, W. Shao, S. X. Ji, and E. Cambria, BiERU: Bidirectional emotional recurrent unit for conversational sentiment analysis, arXiv preprint arXiv: 2006.00492, 2021.
[17]
K. M. He, G. Gkioxari, P. Dollár, and R. Girshick, Mask R-CNN, in Proc. of 2017 IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017, pp. 2980-2988.
[18]
S. Nagarajan and K. Perumal, A deep neural network for information extraction from web pages, in Proc. of 2017 IEEE Int. Conf. Power, Control, Signals and Instrumentation Engineering (ICPCSI), Chennai, India, 2017, pp. 918-922.
[19]
T. Gogar, O. Hubacek, and J. Sedivy, Deep neural networks for web page information extraction, in Artificial Intelligence Applications and Innovations. IFIP Advances in Information and Communication Technology, vol. 475, L. Iliadis and I. Maglogiannis, eds. Thessaloniki, Greece: Springer, 2016, pp. 154-163.
[20]
R. Baumgartner, M. Ceresna, and G. Ledermuller, DeepWeb navigation in web data extraction, in Proc. of Int. Conf. Computational Intelligence for Modelling, Control and Automation and Int. Conf. Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’06), Vienna, Austria, 2005, pp. 698-703.
[21]
D. Liu, L. Ma, and X. Liu, Research on adaptive wrapper in deep web data extraction, in Internet of Vehicles-Safe and Intelligent Mobility. IOV 2015. Lecture Notes in Computer Science, vol. 9502, C. H. Hsu, F. Xia, X. Liu, and S. Wang, eds. Chengdu, China: Springer, 2015, pp. 409-423.
[22]
R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, arXiv preprint arXiv: 1311.2524v5, 2014.
[23]
M. E. Basiri, S. Nemati, M. Abdar, E. Cambria, and U. R. Acharya, ABCDM: An attention-based bidirectional CNN-RNN deep model for sentiment analysis, Future Generation Computer Systems, vol. 115, pp. 279-294, 2021.
[24]
J. Redmon and A. Farhadi, YOLO9000: Better, faster, stronger, in Proc. of 2017 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 6517-6525.
[25]
R. Girshick, Fast R-CNN, in Proc. of 2015 IEEE Int. Conf. Computer Vision (ICCV), Santiago, Chile, 2015, pp. 1440-1448.
[26]
S. Q. Ren, K. M. He, R. Girshick, and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, arXiv preprint arXiv: 1506.01497v3, 2016.
[27]
R. Huang, J. Pedoeem, and C. X. Chen, YOLO-LITE: A real-time object detection algorithm optimized for Non-GPU computers, in Proc. of 2018 IEEE Int. Conf. Big Data (Big Data), Seattle, WA, USA, 2018, pp. 2503-2510.
[28]
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You only look once: Unified, real-time object detection, in Proc. of 2016 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 779-788.
[29]
J. Hammer, J. McHugh, and H. Garcia-Molina, Semistructured data: The TSIMMIS experience, in Proc. of 1st East-European Symp. Advances in Databases and Information Systems (ADBIS), St. Petersburg, Russia, 1997, pp. 1-13.
[30]
G. O. Arocena and A. O. Mendelzon, WebOQL: Restructuring documents, databases and webs, in Proc. of 14th IEEE Int. Conf. Data Engineering, Orlando, FL, USA, 1998, pp. 24-33.
[31]
S. Soderland, Learning information extraction rules for semi-structured and free text, Machine Language, vol. 34, nos. 1-3, pp. 233-272, 1999.
[32]
M. E. Califf and R. J. Mooney, Bottom-up relational learning of pattern matching rules for information extraction, The Journal of Machine Learning Research, vol. 4, pp. 177-210, 2003.
[33]
D. Freitag, Information extraction from HTML: Application of a general machine learning approach, in Proc. of 15th National/Tenth Conf. Artificial Intelligence/Innovative Applications of Artificial Intelligence, Madison, WI, USA, 1998, pp. 517-523.
[34]
C. N. Hsu and M. T. Dung, Generating finite-state transducers for semi-structured data extraction from the web, Information Systems, vol. 23, no. 8, pp. 521-538, 1998.
[35]
A. Manjaramkar and R. L. Lokhande, DEPTA: An efficient technique for web data extraction and alignment, in Proc. of Int. Conf. Advances in Computing, Communications and Informatics, Jaipur, India, 2016, pp. 2307-2310.
[36]
H. A. Sleiman and R. Corchuelo, Trinity: On using Trinary trees for unsupervised web data extraction, IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 6, pp. 1544-1556, 2014.
[37]
J. Y. Wang and F. H. Lochovsky, Data extraction and label assignment for web databases, in Proc. of the 12th Int. Conf. World Wide Web, Budapest, Hungary, 2003, pp. 187-196.
[38]
C. H. Chang and S. C. Kuo, OLERA: Semisupervised web-data extraction with visual support, IEEE Intell. Syst., vol. 19, no. 6, pp. 56-64, 2004.
[39]
Y. Wang, A new concept using LSTM Neural Networks for dynamic system identification, in Proc. of 2017 American Control Conf. (ACC), Seattle, WA, USA, 2017, pp. 5324-5329.
[40]
E. Ferrara, P. De Meo, G. Fiumara, and R. Baumgartner, Web data extraction, applications and techniques: A survey, Knowledge-Based Systems, vol. 70, pp. 301-323, 2014.
[41]
Y. H. Zhai and B. Liu, Web data extraction based on partial tree alignment, in Proc. 14th Int. Conf. World Wide Web, Chiba, Japan, 2005, pp. 76-85.
[42]
S. Kuamri and C. N. Babu, Real time analysis of social media data to understand people emotions towards national parties, in Proc. of 8th Int. Conf. Computing, Communication and Networking Technologies (ICCCNT), Delhi, India, 2017, pp. 1-6.
[43]
D. G. Gregg and S. Walczak, Adaptive web information extraction, Communications of the ACM, vol. 49, no. 5, pp. 78-84, 2006.
Publication history
Copyright
Rights and permissions

Publication history

Received: 18 April 2021
Revised: 30 May 2021
Accepted: 28 June 2021
Published: 26 August 2021
Issue date: December 2021

Copyright

© The author(s) 2021

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Reprints and Permission requests may be sought directly from editorial office.

Return