Identifying and managing patients most
at risk within the health care system is vital for governments, hospitals, and
health insurers but they use different metrics for identifying The patients
they perceive to be at most risk. Health insurers are mostly concerned with
insurance risk, because they agree to reimburse health-related services in
exchange for a fixed monthly premium. Poor risk measure could result in
exceeding a financial budget. “90% of the world’s data was generated in the
last few years.” Due to the advent of new technologies, devices, and
communication means like social networking sites, the amount of data produced
by mankind is growing rapidly every year. The amount of data produced by us
from the beginning of time till 2003 was 5 billion gigabytes. If you pile up
the data in the form of disks it may fill an entire football field. The same
amount was created in every two days in 2011, and in every ten minutes in 2013.
This rate is still growing enormously. Though all this information produced is
meaningful and can be useful when processed, it is being neglected. Big Data is
a collection of large datasets that cannot be processed using traditional
computing techniques. It is not a single technique or a tool rather it involves
many areas of business and technology.
Keywords: Health care, Big data , Patient ,insurance claim,
predicting days , analyzing database.
administrators worldwide are striving to lower the cost of care whilst
improving the quality of care given. Hospitalization is the largest component
of health expenditure. Therefore, earlier identification of those at higher
risk of being hospitalized would help healthcare administrators and health
insurers to develop better plans and strategies. In this paper, a method was
developed, using large-scale health insurance claims data, to predict the
number of hospitalization days in a population. We utilized a regression
decision tree algorithm, along with insurance claim data from 242,075
individuals over three years, to provide predictions of number of days in
hospital in the third year, based on hospital admissions and procedure claims
data from the first two years. The proposed method performs well in the general
population as well as in sub-
populations. Results indicate that the proposed
model significantly improves predictions over two established baseline methods
(predicting a constant number of days for each customer and using the number of
days in hospital of the previous year as the forecast for the following year).
A reasonable predictive accuracy (AUC = 0:843) was achieved for the whole
population. Analysis of two sub-populations – namely elderly persons aged over
63 years or older in 2011 and patients hospitalized for at least one day in the
previous year – revealed that the medical information made more contribution to
predictions of these two sub-populations, in comparison to the population as a
In 2009 and 2010, hospitals comprised by far
the largest component of health expenditure in Australia, consuming 40% of
regular health spending. Furthermore, the Australian Productivity Commission,
an independent research and advisory body of the Australian Government, pointed
out (in their report on government services in 2013) that around AUD
(Australian dollars) $3 billion were spent on unnecessary public hospital
admissions annually. Earlier identification of those at risk would also help
reduce unnecessary hospitalizations and potentially save taxpayers billions of
dollars every year. From various perspectives, better prediction of
hospitalizations will enable earlier intervention, reducing costs and improving
quality of life.
aim of this paper is to develop a model that predicts the total number of days
spent in hospital during a calendar year for individuals from a general
population, using large scale health insurance claims data. Since insurance
claims have strong socioeconomic characteristics, their power in predicting
clinical targets, such as hospitalizations, are seldom investigated. Relevant
recent work has been performed by the authors in a data mining competition
called the ‘Heritage Health Prize (HHP)’. This competition also aimed to
predict the number of days in hospital for a calendar year. It focused on
reducing prediction error calculated in a given equation by optimizing the
predictive power of various sophisticated algorithms. A final ranking of 40 out
of 1300+ teams was achieved was achieved by authors, and – in terms of the
measure for the quality of prediction – the performance (0.467) relative to the
best scores (0.461) was very small in absolute terms. For confidentiality
reasons, the data set was strongly pseudonymized and simplified with
considerable details missing.
In this paper, we present Healthcare
administrators worldwide are striving to lower the cost of care whilst
improving the quality of care given. Hospitalization is the largest component
of health expenditure. In this paper, a method was developed, using large-scale
health insurance Claims data, to predict the number of hospitalization days in
a population. The proposed method performs well in the general population as
well as in sub-populations. Results indicate that the proposed model
significantly improves Predictions over two established baseline methods which
may significantly contribute to reduce the development cost of the Domain
Modules. The describes about the
demographic statistics of age and gender in year 2015.
Fig. 1. Sample
Demographic Statistics Of Age And Gender In Year 2015
A) Patient registering details
In this module we design the windows for the
project. These windows are used to send a message from one peer to another. We
use the Swing package available in Java to design the User Interface. Swing is a widget toolkit for Java. It is
part of Sun Microsystems’ Java Foundation Classes an API for providing a
graphical user interface for Java programs. In this module mainly we are
focusing the login design page with the Partial knowledge information. Application
Users need to view the application they need to login through the User
Interface GUI is the media to connect User and Media Database and login screen
where user can input his/her user name, password and password will check in
database, if that will be a valid username and password then he/she can access
B) Admin maintaining the records
This is the second module of our project in
this with the advent of web applications. The admin only secure maintaining our
storage that is responsibility of each request and response services. BigData
supports include unlimited support and upgrades and is full managed by admin.
Whenever request receive from client in this module only response to them in our data based EHRs system maintaining different
department details like surgery Nursing etc. these are communicate in parallel
process on environment.
C) Data Migration Module With Sqoop
traditional application management system, that is, the interaction of
applications with relational database using RDBMS, is one of the sources that
generate Big Data. Such Big Data, generated by RDBMS, is stored in Relational
Database Servers in the relational database structure. When Big Data storages
and analyzers such as MapReduce, Hive, HBase, Cassandra, Pig, etc. of the
Hadoop ecosystem came into picture, they required a tool to interact with the
relational database servers for importing and exporting the Big Data residing
in them. Here, Sqoop occupies a place
in the Hadoop ecosystem to provide feasible interaction between relational
database server and Hadoop’s HDFS. Sqoop: “SQL to Hadoop and Hadoop to SQL”
Sqoop is a tool designed to transfer data between Hadoop and relational
database servers. It is used to import data from relational databases such as
MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational
databases. It is provided by the Apache Software Foundation. Now we are ready
with dataset. So now our aim is transfer the dataset into hadoop (HDFS), that
will be happen in this module. Sqoop is a command-line interface application
for transferring data between relational databases and Hadoop.In this module we
fetch the dataset into hadoop (HDFS) using sqoop Tool. Using sqoop we have to
perform lot of the function, such that if we want to fetch the particular
column or if we want to fetch the dataset with specific condition that will be
support by Sqoop Tool and data will be stored in hadoop (HDFS).
i) Sqoop Import:
import tool imports individual tables from RDBMS to HDFS. Each row in a table
is treated as a record in HDFS. All records are stored as text data in text
files or as binary data in Avro and Sequence files.
ii) Sqoop Export:
export tool exports a set of files from HDFS back to an RDBMS. The files given
as input to Sqoop contain records, which are called as rows in table. Those are
read and parsed into a set of records and delimited with user-specified
D) Data Analytic Module With Hive
Hive is a
data warehouse infrastructure tool to process structured data in Hadoop. It
resides on top of Hadoop to summarize Big Data, and makes querying and
analyzing easy. Initially Hive was developed by Facebook, later the Apache
Software Foundation took it up and developed it further as an open source under
the name Apache Hive. It is used by different companies. For example, Amazon
uses it in Amazon Elastic MapReduce. Hive is not A relational database A design for OnLine Transaction Processing
(OLTP) A language for real-time queries
and row-level updates.
i) Features of Hive:
stores schema in a database and processed data into HDFS.
2. It is
designed for OLAP.
provides SQL type language for querying called HiveQL or HQL.
4. It is
familiar, fast, scalable, and extensible.
Hive is a data ware house system for Hadoop.
It runs SQL like queries called HQL (Hive query language) which gets internally
converted to map reduce jobs. Hive was developed by Facebook. Hive supports
Data definition Language (DDL), Data Manipulation Language (DML) and user
defined functions. In this module we have to analysis the dataset using HIVE
tool which will be stored in hadoop (HDFS).For analysis dataset HIVE using HQL
Language. Using hive we perform Tables creations, joins, Partition, Bucketing
concept. Hive analysis the only Structure Language.
E) Data Analytic Module With Pig
Pig is an abstraction over MapReduce. It is a tool/platform which is used to
analyze larger sets of data representing them as data flows. Pig is generally
used with Hadoop; we can perform all the data manipulation operations in Hadoop
using Apache Pig. To write data analysis programs, Pig provides a high-level
language known as Pig Latin. This language provides various operators using
which programmers can develop their own functions for reading, writing, and
processing data. To analyze data using Apache Pig, programmers need to write
scripts using Pig Latin language. All these scripts are internally converted to
Map and Reduce tasks. Apache Pig has a component known as Pig Engine that
accepts the Pig Latin scripts as input and converts those scripts into
MapReduce jobs. Using Pig Latin, programmers can perform MapReduce tasks easily
without having to type complex codes in Java.
Apache Pig uses multi-query approach, thereby reducing the length of
codes. For example, an operation that would require you to type 200 lines of
code (LoC) in Java can be easily done by typing as less as just 10 LoC in
Apache Pig. Ultimately Apache Pig reduces the development time by almost 16
times. Pig Latin is SQL-like language and it is easy to learn Apache Pig when
you are familiar with SQL. Apache Pig provides many built-in operators to
support data operations like joins, filters, ordering, etc. In addition, it
also provides nested data types like tuples, bags, and maps that are missing
from MapReduce. Apache Pig is a high
level data flow platform for execution Map Reduce programs of Hadoop. The
language for Pig is pig Latin. Pig handles both structure and unstructured
language. It is also top of the map reduce process running background. In this
module also used for analyzing the Data set through Pig using Latin Script data
flow language.in this also we are doing all operators, functions and joins
applying on the data see the result.
is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and
Reduce. Map takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs). Secondly,
reduce task, which takes the output from a map as an input and combines those
data tuples into a smaller set of tuples. As the sequence of the name MapReduce
implies, the reduce task is always performed after the map job.
major advantage of MapReduce is that it is easy to scale data processing over
multiple computing nodes. Under the MapReduce model, the data processing
primitives are called mappers and reducers. Decomposing a data processing
application into mappers and reducers is sometimes nontrivial. But, once we
write an application in the MapReduce form, scaling the application to run over
hundreds, thousands, or even tens of thousands of machines in a cluster is
merely a configuration change. This simple scalability is what has attracted
many programmers to use the MapReduce model.
Generally MapReduce paradigm is based on sending
the computer to where the data resides. MapReduce program executes in three stages, namely
map stage, shuffle stage, and reduce stage.
i) Map stage:
The map or mapper’s job is to process the input
data. Generally the input data is in the form of file or directory and is
stored in the Hadoop file system (HDFS). The input file is passed to the mapper
function line by line. The mapper processes the data and creates several small
chunks of data.
This stage is the combination of the
Shuffle stage and the Reduce stage. The Reducer’s job is to process the data
that comes from the mapper. After processing, it produces a new set of output,
which will be stored in the HDFS
Fig. 3 Map Reduce Data Flow
The performance was measured on four
different sub-populations. Group 1 included the whole population. Group 2
included customers born on or after the year 1948, while Group 3 was composed
of customers born before the year 1948. These sub-groupings were chosen since
the average number of DIH increased substantially between the ages of 61 and 65
years, as Fig. 2 indicates, and the median age of 63 years in year 2011 was
taken, corresponding to a birth year 1948 in this data set.
Fig. 2 Average days in hospital
per person by age for each of the three years of HCF data
In the whole population, 85% of
customers were born on or after 1948 and 15% of customers were born before
1948. In addition, the model was evaluated on a sub-population (Group 4) in
which customers had at least one day (1+ days) in hospital in the year before
the prediction year (2012). It needs to pointed out that, different to the
other three groups, 1+ days group was the only group in which the cohort of
customers used for training and prediction were different. In this group, training
was performed using those customers who had at least 1 day in hospital in 2010
and prediction was performed on customers who have at least 1+ days in the year
2011. These two subsets of customers were not the same cohort, but there was
overlap. In the prediction data set of 1+ days group, 70% of customers were
born on or after 1948 and 30% of customers were born before 1948.
A method for predicting future
days in hospital has been developed using features extracted from customer
demographics, past hospital admission and hospital procedure claim data. The
model was developed using data from an observation period of two years and was
later evaluated based on data, we are also interested in forecasting days in
hospital on shorter time scales such as by season or month by using Big data.
The accuracy of forecasting would greatly depend on the density of information
available. If the claim information gets too sparse, the prediction accuracy is
expected to decrease when doing shorter time scale forecasting, such as weekly
predictions. It would also be interesting to know what kind of time resolution
of forecasting is able to be supported by claim data sets. From this point of
view, tele health data, such as physiological monitoring or self-reported data
on a weekly or daily basis, would significantly increase the temporal data
1 Yang Xie, Student Member,
IEEE, G¨unter Schreier, Senior Member, IEEE, David C.W. Chang, Sandra Neubauer,
Ying Liu, Stephen J. Redmond, Senior Member, IEEE, Nigel H. Lovell, Fellow,
IEEE “Predicting Days in Hospital Using Health Insurance Claims”,2015.
2 J. Donze, D. Aujesky, D.
Williams, and J. L. Schnipper, “Potentially Avoidable 30-Day Hospital
Readmissions in Medical Patients: Derivation and Validation of a Prediction
Model,” JAMA Internal Medicine, vol. 173, pp. 632–638, 2013.
3 O. Hasan, D. O. Meltzer, S.
A. Shaykevich, C. M. Bell, P. J. Kaboli, A. D. Auerbach, T. B. Wetterneck, V.
M. Arora, J. Zhang, and J. L. Schnipper, “Hospital readmission in general
medicine patients: A prediction model,” Journal of General Internal Medicine,
vol. 25, pp. 211– 219, 2010.
4 E. Coiera, Y. Wang, F.
Magrabi, O. P. Concha, B. Gallego, and W. Runciman, “Predicting the cumulative
risk of death during hospitalization by modeling weekend, weekday and diurnal
mortality risks,” BMC Health Services Research, vol. 14, 2014.
5 R. B. Cumming, D. Knutson, B.
A. Cameron, and B. Derrick, “A comparative analysis of claims-based methods of
health risk assessment for commercial populations,” Final Report to the Society
of Actuaries, 2002.
6 Y. Zhao, A. S. Ash, R. P.
Ellis, J. Z. Ayanian, G. C. Pope, B. Bowen, and L. Weyuker, “Predicting
pharmacy costs and other medical costs using diagnoses and drug claims,” Medical
Care, vol. 43, pp. 34–43, 2005.
7 D. Bertsimas, M. V.
Bjarnadottir, M. A. Kane, J. C. Kryder, R. Pandey, S. Vempala, and G. Wang,
“Algorithmic prediction of health-care costs,” Operations Research, vol. 56,
pp. 1382–1392, 2008.
8 K. Pietz, C. M. Ashton, M.
McDonell, and N. P. Wray, “Predicting healthcare costs in a population of
veterans affairs beneficiaries using diagnosis-based risk adjustment and
self-reported health status,” Medical Care, vol. 42, pp. 1027–1035, 2004.
9 C. A. Powers, C. M. Meyer, M.
C. Roebuck, and B. Vaziri, “Predictive modeling of total healthcare costs using
pharmacy claims data – A comparison of alternative econometric cost modeling
techniques,” Medical Care, vol. 43, no. 11, pp. 1065–1072, 2005.
10 (2014) How much do we spend
on health? Australian Institute of Health and Welfare, Australian Government.
11 H. G. Dove, I. Duncan, and
A. Robb, “A prediction model for targeting low cost, high-risk members of
managed care organizations,” American Journal of Managed Care, vol. 9, no. 5,
pp. 381–389, 2003.
12 B. Fireman, J. Bartlett, and
J. Selby, “Can disease management reduce health care costs by improving
quality?” Health Affairs, vol. 23, no. 6, pp. 63–75, 2004.
13 E. Seto, “Cost comparison
between telemonitoring and usual care of heart failure: A systematic review,”
Telemedicine Journal and E-health, vol. 14, no. 7, pp. 679–686, 2008.
14 J. Polisena, D. Coyle, K.
Coyle, and S. McGill, “Home telehealth for chronic disease management: A
systematic review and an analysis of economic evaluations,” International
Journal of Technology Assessment in Health Care, vol. 25, pp. 339–349, 7 2009.
15 “Report on government
services 2013 volume 2: Health; community services; housing and homelessness,”
Steering Committee for the Review of Government Service Provision, Canberra:
Productivity Commission, 2013.
16 P. Brierley, D. Vogel, and
R. Axelrod. (2014) Heritage provider network health prize round 1 milestone
prize: How we did it – Team ‘Market Makers’. Online. Available:
17 (2014) Heritage provider
network health prize private leaderboard – heritage health prize. Online.
18 “The international
statistical classification of diseases and related health problems, 10th
revision, australian modification (ICD-10-AM),” National Centre for
Classification in Health, 1998.
19 “The australian
classification of health interventions (ACHI) seventh edition – tabular list of
interventions and alphabetic index of interventions,” National Centre for
Classification in Health (NCCH), 2010.
20 (2013) Australian refined
diagnosis-related groups (AR-DRG) data cubes. Australian Institute of Health
and Welfare, Australian Government. Online. Available:
21 (2014, Sep) Medicare
benefits schedule (MBS) online. Department of Health, Australian Government.
Online. Available: http://www.mbsonline.gov.au/
22 V. Sundararajan, T.
Henderson, C. Perry, A. Muggivan, H. Quan, and W. A. Ghali, “New ICD-10 version
of the Charlson comorbidity index predicted in-hospital mortality,” Journal of
Clinical Epidemiology, vol. 57, no. 12, pp. 1288–1294, 2004.
23 H. Quan, B. Li, C. M.
Couris, K. Fushimi, P. Graham, P. Hider, J.- M. Januel, and V. Sundararajan,
“Updating and validating the Charlson comorbidity index and score for risk
adjustment in hospital discharge abstracts using data from 6 countries,”
American Journal of Epidemiology, vol. 173, no. 6, pp. 676–682, 2011.
24 (2010) International
statistical classification of diseases and related health problems 10th revision.
World Health Organization. Online. Available: