Intelligent and Adaptive Web Data Extraction System Using Convolutional and Long Short-Term Memory Deep Learning Networks

Sudhir Kumar Patnaik; C. Narendra Babu; Mukul Bhave

doi:10.26599/BDMA.2021.9020012

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Journals A - Z

About Us

Publish with Us

Support

PDF (4.5 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

Intelligent and Adaptive Web Data Extraction System Using Convolutional and Long Short-Term Memory Deep Learning Networks

Sudhir Kumar Patnaik(

), C. Narendra Babu, Mukul Bhave

Department of Computer Science and Engineering, M. S. Ramaiah University of Applied Sciences, Bangalore 560054, India

Gibraltar India Solutions LLP, Bangalore 560103, India

Show Author Information

Abstract

Data are crucial to the growth of e-commerce in today’s world of highly demanding hyper-personalized consumer experiences, which are collected using advanced web scraping technologies. However, core data extraction engines fail because they cannot adapt to the dynamic changes in website content. This study investigates an intelligent and adaptive web data extraction system with convolutional and Long Short-Term Memory (LSTM) networks to enable automated web page detection using the You only look once (Yolo) algorithm and Tesseract LSTM to extract product details, which are detected as images from web pages. This state-of-the-art system does not need a core data extraction engine, and thus can adapt to dynamic changes in website layout. Experiments conducted on real-world retail cases demonstrate an image detection (precision) and character extraction accuracy (precision) of 97% and 99%, respectively. In addition, a mean average precision of 74%, with an input dataset of 45 objects or images, is obtained.

Keywords

deep learning Long Short-Term Memory (LSTM)adaptive web scraping Web data extraction You only look once (Yolo)

References

[1]

Y. B. Zhang, Image feature extraction algorithm in big data environment, Journal of Intelligent and Fuzzy Systems, vol. 39, no. 4, pp. 5109-5118, 2020.

Crossref Google Scholar

[2]

L. Xie, J. L. Tao, Q. N. Zhang, and H. Y. Zhou, CNN and KPCA-based automated feature extraction for real time driving pattern recognition, IEEE Access, vol. 7, pp. 123765-123775, 2019.

Crossref Google Scholar

[3]

J. Tao, H. B. Wang, X. Y. Zhang, X. Y. Li, and H. W. Yang, An object detection system based on YOLO in traffic scene, in Proc. of 2017 6th Int. Conf. Computer Science and Network Technology (ICCSNT), Dalian, China, 2017, pp. 315-319.

Crossref

[4]

F. Ali, A. Ali, M. Imran, R. A. Naqvi, M. H. Siddiqi, and K. S. Kwak, Traffic accident detection and condition analysis based on social networking data, Accident Analysis & Prevention, vol. 151, p. 105973, 2021.

Crossref Google Scholar

[5]

N. Islam, Z. Islam, and N. Noor, A survey on optical character recognition system, Journal of Information & Communication Technology-JICT, vol. 10, no. 2, pp. 1-4, 2016.

Google Scholar

[6]

H. Rao and D. R. M. Sashikumar, A survey on automated web data extraction techniques for product specification from e-commerce web sites, International Journal of Advanced Research in Computer Science and Software Engineering, vol. 6, no. 8, pp. 310-316, 2016.

Google Scholar

[7]

E. Uzun, A novel web scraping approach using the additional information obtained from web pages, IEEE Access, vol. 8, pp. 61726-61740, 2020.

Crossref Google Scholar

[8]

M. Salah, B. Al Okush, and M. Al Rifaee, A comparison of web data extraction techniques, in Proc. of 2019 IEEE Jordan Int. Joint Conf. Electrical Engineering and Information Technology (JEEIT), Amman, Jordan, 2019, pp. 785-789.

Crossref

[9]

S. L. Li, C. Chen, K. W. Luo, and B. Song, Review of deep web data extraction, in Proc. of 2019 IEEE Symp. Series on Computational Intelligence (SSCI), Xiamen, China, 2019, pp. 1068-1070.

Crossref

[10]

W. Nadee and K. Prutsachainimmit, Towards data extraction of dynamic content from JavaScript web applications, in Proc. of 2018 Int. Conf. Information Networking (ICOIN), Chiang Mai, Thailand, 2018, pp. 750-754.

Crossref

[11]

B. V. S. Ujwal, B. Gaind, A. Kundu, A. Holla, and M. Rungta, Classification-based adaptive web scraper, in Proc. of 16th IEEE Int. Conf. Machine Learning and Applications, Cancun, Mexico, 2017, pp. 125-132.

Crossref

[12]

J. Park and D. Barbosa, Adaptive record extraction from web pages, in Proc. of WWW 2007, Banff, Canada, 2007, pp. 1335-1336.

Crossref

[13]

C. J. Liu, Y. F. Tao, J. W. Liang, K. Li, and Y. H. Chen, Object detection based on YOLO network, in Proc. of 2018 IEEE 4th Information Technology and Mechatronics Engineering Conf. (ITOEC), Chongqing, China, 2018, pp. 799-803.

Crossref

[14]

J. L. Hong, Deep web data extraction, in Proc. of 2010 IEEE Int. Conf. Systems, Man and Cybernetics, Istanbul, Turkey, 2010, pp. 3420-3427.

[15]

F. Ali, P. Khan, K. Riaz, D. Kwak, T. Abuhmed, D. Park, and K. S. Kwak, A fuzzy ontology and SVM-based web content classification system, IEEE Access, vol. 5, pp. 25781-25797, 2017.

Crossref Google Scholar

[16]

W. Li, W. Shao, S. X. Ji, and E. Cambria, BiERU: Bidirectional emotional recurrent unit for conversational sentiment analysis, arXiv preprint arXiv: 2006.00492, 2021.

Google Scholar

[17]

K. M. He, G. Gkioxari, P. Dollár, and R. Girshick, Mask R-CNN, in Proc. of 2017 IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017, pp. 2980-2988.

Crossref

[18]

S. Nagarajan and K. Perumal, A deep neural network for information extraction from web pages, in Proc. of 2017 IEEE Int. Conf. Power, Control, Signals and Instrumentation Engineering (ICPCSI), Chennai, India, 2017, pp. 918-922.

Crossref

[19]

T. Gogar, O. Hubacek, and J. Sedivy, Deep neural networks for web page information extraction, in Artificial Intelligence Applications and Innovations. IFIP Advances in Information and Communication Technology, vol. 475, L. Iliadis and I. Maglogiannis, eds. Thessaloniki, Greece: Springer, 2016, pp. 154-163.

Crossref

[20]

R. Baumgartner, M. Ceresna, and G. Ledermuller, DeepWeb navigation in web data extraction, in Proc. of Int. Conf. Computational Intelligence for Modelling, Control and Automation and Int. Conf. Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’06), Vienna, Austria, 2005, pp. 698-703.

[21]

D. Liu, L. Ma, and X. Liu, Research on adaptive wrapper in deep web data extraction, in Internet of Vehicles-Safe and Intelligent Mobility. IOV 2015. Lecture Notes in Computer Science, vol. 9502, C. H. Hsu, F. Xia, X. Liu, and S. Wang, eds. Chengdu, China: Springer, 2015, pp. 409-423.

Crossref

[22]

R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, arXiv preprint arXiv: 1311.2524v5, 2014.

Google Scholar

[23]

M. E. Basiri, S. Nemati, M. Abdar, E. Cambria, and U. R. Acharya, ABCDM: An attention-based bidirectional CNN-RNN deep model for sentiment analysis, Future Generation Computer Systems, vol. 115, pp. 279-294, 2021.

Crossref Google Scholar

[24]

J. Redmon and A. Farhadi, YOLO9000: Better, faster, stronger, in Proc. of 2017 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 6517-6525.

Crossref

[25]

R. Girshick, Fast R-CNN, in Proc. of 2015 IEEE Int. Conf. Computer Vision (ICCV), Santiago, Chile, 2015, pp. 1440-1448.

Crossref

[26]

S. Q. Ren, K. M. He, R. Girshick, and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, arXiv preprint arXiv: 1506.01497v3, 2016.

Google Scholar

[27]

R. Huang, J. Pedoeem, and C. X. Chen, YOLO-LITE: A real-time object detection algorithm optimized for Non-GPU computers, in Proc. of 2018 IEEE Int. Conf. Big Data (Big Data), Seattle, WA, USA, 2018, pp. 2503-2510.

Crossref

[28]

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You only look once: Unified, real-time object detection, in Proc. of 2016 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 779-788.

Crossref

[29]

J. Hammer, J. McHugh, and H. Garcia-Molina, Semistructured data: The TSIMMIS experience, in Proc. of 1st East-European Symp. Advances in Databases and Information Systems (ADBIS), St. Petersburg, Russia, 1997, pp. 1-13.

Crossref

[30]

G. O. Arocena and A. O. Mendelzon, WebOQL: Restructuring documents, databases and webs, in Proc. of 14th IEEE Int. Conf. Data Engineering, Orlando, FL, USA, 1998, pp. 24-33.

[31]

S. Soderland, Learning information extraction rules for semi-structured and free text, Machine Language, vol. 34, nos. 1-3, pp. 233-272, 1999.

Crossref Google Scholar

[32]

M. E. Califf and R. J. Mooney, Bottom-up relational learning of pattern matching rules for information extraction, The Journal of Machine Learning Research, vol. 4, pp. 177-210, 2003.

Google Scholar

[33]

D. Freitag, Information extraction from HTML: Application of a general machine learning approach, in Proc. of 15th National/Tenth Conf. Artificial Intelligence/Innovative Applications of Artificial Intelligence, Madison, WI, USA, 1998, pp. 517-523.

[34]

C. N. Hsu and M. T. Dung, Generating finite-state transducers for semi-structured data extraction from the web, Information Systems, vol. 23, no. 8, pp. 521-538, 1998.

Crossref Google Scholar

[35]

A. Manjaramkar and R. L. Lokhande, DEPTA: An efficient technique for web data extraction and alignment, in Proc. of Int. Conf. Advances in Computing, Communications and Informatics, Jaipur, India, 2016, pp. 2307-2310.

Crossref

[36]

H. A. Sleiman and R. Corchuelo, Trinity: On using Trinary trees for unsupervised web data extraction, IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 6, pp. 1544-1556, 2014.

Crossref Google Scholar

[37]

J. Y. Wang and F. H. Lochovsky, Data extraction and label assignment for web databases, in Proc. of the 12th Int. Conf. World Wide Web, Budapest, Hungary, 2003, pp. 187-196.

Crossref

[38]

C. H. Chang and S. C. Kuo, OLERA: Semisupervised web-data extraction with visual support, IEEE Intell. Syst., vol. 19, no. 6, pp. 56-64, 2004.

Crossref Google Scholar

[39]

Y. Wang, A new concept using LSTM Neural Networks for dynamic system identification, in Proc. of 2017 American Control Conf. (ACC), Seattle, WA, USA, 2017, pp. 5324-5329.

[40]

E. Ferrara, P. De Meo, G. Fiumara, and R. Baumgartner, Web data extraction, applications and techniques: A survey, Knowledge-Based Systems, vol. 70, pp. 301-323, 2014.

Crossref Google Scholar

[41]

Y. H. Zhai and B. Liu, Web data extraction based on partial tree alignment, in Proc. 14th Int. Conf. World Wide Web, Chiba, Japan, 2005, pp. 76-85.

Crossref

[42]

S. Kuamri and C. N. Babu, Real time analysis of social media data to understand people emotions towards national parties, in Proc. of 8th Int. Conf. Computing, Communication and Networking Technologies (ICCCNT), Delhi, India, 2017, pp. 1-6.

Crossref

[43]

D. G. Gregg and S. Walczak, Adaptive web information extraction, Communications of the ACM, vol. 49, no. 5, pp. 78-84, 2006.

Crossref Google Scholar

Big Data Mining and Analytics

Volume 4 Issue 4,
December 2021

Pages 279-297

DOI: 10.26599/BDMA.2021.9020012

Cite this article:

Patnaik SK, Babu CN, Bhave M. Intelligent and Adaptive Web Data Extraction System Using Convolutional and Long Short-Term Memory Deep Learning Networks. Big Data Mining and Analytics, 2021, 4(4): 279-297. https://doi.org/10.26599/BDMA.2021.9020012

1916

Views

1160

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 18 April 2021

Revised: 30 May 2021

Accepted: 28 June 2021

Published: 26 August 2021

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).