Real-Time Semantic Search Using Approximate Methodology for Large-Scale Storage Systems
ABSTRACT:
The
challenges of handling the explosive growth in data volume and
complexity cause the increasing needs for semantic queries. The semantic
queries can be interpreted as the correlation-aware retrieval, while
containing approximate results. Existing cloud storage systems mainly
fail to offer an adequate capability for the semantic queries. Since the
true value or worth of data heavily depends on how efficiently semantic
search can be carried out on the data in (near-) real-time, large
fractions of data end up with their values being lost or significantly
reduced due to the data staleness. To address this problem, we propose a
near-real-time and cost-effective semantic queries based methodology,
called FAST. The idea behind FAST is to explore and exploit the semantic
correlation within and among datasets via correlation-aware hashing and
manageable flat-structured addressing to significantly reduce the
processing latency, while incurring acceptably small loss of data-search
accuracy. The near-real-time property of FASTenables rapid
identification of correlated files and the significant narrowing of the
scope of data to be processed. FASTsupports several types of data
analytics, which can be implemented in existing searchable storage
systems. We conduct a real-world use case in which children reported
missing in an extremely crowded environment (e.g., a highly popular
scenic spot on a peak tourist day) are identified in a timely fashion by
analyzing 60 million images using FAST. FAST is further improved by
using semantic-aware namespace to provide dynamic and adaptive namespace
management for ultra-large storage systems. Extensive experimental
results demonstrate the efficiency and efficacy of FAST in the
performance improvements.
EXISTING SYSTEM:
- ISABELAQA is a parallel query processing engine that is designed and optimized for analyzing and processing spatiotemporal, multivariate scientific data. MixApart uses an integrated data caching and scheduling solution to allow MapReduce computations to analyze data stored on enterprise storage systems.
- The frontend caching layer enables the local storage performance required by data analytics. The shared storage back-end simplifies data management.
- Spyglass exploits the locality of file namespace and skewed distribution of metadata to map the namespace hierarchy into a multi-dimensional K-D tree and uses multilevel versioning and partitioning to maintain consistency.
- Glance, a just-in-time sampling-based system, can provide accurate answers for aggregate and top-k queries without prior knowledge.
DISADVANTAGES OF EXISTING SYSTEM:
- Existing content-based analysis tools not only cause high complexity and costs, but also fail to effectively handle the massive amounts of files.
- The high complexity routinely leads to very slow processing operations and very high and often unacceptable latency. Due to the unacceptable latency, the staleness of data severely diminishes the value of data.
- Existing approaches to unstructured data search and analytics rely on either system-based chunks of data files.
- Due to the long latency incurred in data processing and the resulting data staleness, the value/worth of data becomes diminished and eventually nullified.
PROPOSED SYSTEM:
- In the context of this paper, searchable data analytics are interpreted as obtaining data value/worth via queried results, such as finding a valuable record, a correlated process ID, an important image, a rebuild system log, etc.
- We propose a novel near-real-time methodology for analyzing massive data, called FAST, with a design goal of efficiently processing such data in a real-time manner.
- The key idea behind FAST is to explore and exploit the correlation property within and among datasets via improved correlation aware hashing and flat-structured addressing to significantly reduce the processing latency of parallel queries, while incurring acceptably small loss of accuracy.
- The approximate scheme for real-time performance has been widely recognized in system design and high-end computing. In essence, FAST goes beyond the simple combination of existing techniques to offer efficient data analytics via significantly increased processing speed. Through the study of the FAST methodology, we aim to make the following contributions for near real-time data analytics.
ADVANTAGES OF PROPOSED SYSTEM:
- Space-efficient summarization
- Energy efficiency via hashing
- Semantic-aware namespace
- Real system implementation
SYSTEM ARCHITECTURE:
SYSTEM REQUIREMENTS:
HARDWARE REQUIREMENTS:
- System : Pentium Dual Core.
- Hard Disk : 120 GB.
- Monitor : 15’’ LED
- Input Devices : Keyboard, Mouse
- Ram : 1GB.
SOFTWARE REQUIREMENTS:
- Operating system : Windows 7.
- Coding Language : JAVA/J2EE
- Tool : Netbeans 7.2.1
- Database : MYSQL
REFERENCE:
Yu
Hua, Senior Member, IEEE, Hong Jiang, Fellow, IEEE, and Dan Feng,
Member, IEEE, “Real-Time Semantic Search Using Approximate Methodology
for Large-Scale Storage Systems”, IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 4, APRIL 2016.
No comments:
Post a Comment