Abstract

In today’s World there exists various source of data in various formats (file formats), different structure, different types and etc. which is a hug collection of unstructured over the internet or social media. This gives rise to categorization of data as unstructured, semi structured and structured data. Data that exist in irregular manner without any particular schema are referred as unstructured data which is very difficult to process as it consists of irregularities and ambiguities. So, we are focused on Intelligent Processing Unit which converts unstructured big data into intelligent meaningful information. Intelligent text extraction is a technique that automatically identifies and extracts text from file format. The system consists of different stages which include the pre-processing, keyphase extraction techniques and transformation for the text extraction and retrieve structured data from unstructured data. The system consists multiple method/approach give better result. We are currently working in various file formats and converting the file format into DOCX which will come in the form of the un-structure Form, and then we will obtain that file in the structure form with the help of intelligent Pre-processing. The pre-process stages that triggers the unstructured data/corpus into structured data converting into meaning full. The Initial stage is the system remove the stop word, unwanted symbols noisy data and line spacing. The second stage is Data Extraction from various sources of file or types of files into proper format plain text. The then in third stage we transform the data or information from one format to another for the user to understand the data. The final step is rebuilding the file in its original format maintaining tag of the files. The large size files are divided into sub small size file to executed the parallel processing algorithms for fast processing of larger files and data. Parallel processing is a very important concept for text extraction and with its help; the big file breaks in a small file and improves the result. Extraction of data is done in Bilingual language, and represent the most relevant information contained in the document. Key-phase extraction is an important problem of data mining, Knowledge retrieval and natural speech processing. Keyword Extraction technique has been used to abstract keywords that exclusively recognize a document. Rebuilding is an important part of this project and we will use the entire concept in that file format and in the last, we need the same format which we have done in that file. This concept is being widely used but not much work of the work has been done in the area of developing many functionalities under one tool, so this makes us feel the requirement of such a tool which can easily and efficiently convert unstructured files into structured one.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call