Tuesday, April 28 2015
10:45 a.m. – 11:30 a.m.
Our information environment is rapidly changing. With the collection of large-scale datasets, the tools and methods related to large scale data are changing as well. While older technologies can be adapted for some purposes, new tools such as NoSQL databases, the Hadoop processing environment, and programming languages such as Pig are becoming important tools for the data and information analyst. In this session, learn what all the terminology means and what tools to use to begin to develop your Big Data and analytics environment.
Dr. Frank Cervone – A201_Cervone.pdf (2 MB) – Username/Password — CIL2015
Me: My intent on attending this workshop is to the learn the jargon and terminology used in Big Data. I support researchers and scientists creating tools for analysis and visualization of disparate sources of Big Data. I really need tools for collection development in this area.
Will geek out this morning.
Confusion in Big Data and Analytics.
Big Data – incomprehensible amounts of data.
No consistency in definition.
New kinds of data and analysis – best definition of Big Data
Neither snake oil nor silver bullet.
Big Data/Analytics tells us what is happening not why.
Can find hidden correlations. Visualizations are valuable Google Flu Trends good example. Does not say why!!
Apache-Hadoop Ecosystem – open source working together to process big data. Not traditional relational database system.
Querying Big Data in Distributed File System (HDFS).
Hadoop Distributed File System – main node distributes among clusters that are available. Creates index of where data is sent.
Zookeeper Service – distributes the work within the cluster. More than one master server. Redundancy.
Core of big data analysis – Yarn Map Reduce-
- splits the input
- process local data and create key-value pairs.
- organizes and merges the data via reducers.
- Map => Combine (data on individual cluster) => Shuffle (combines results from clusters) => Reduce => Output
- iterative and segmented.
Components of Hadoop environment:
- Flume – acts as Agent ( consume – hold – deliver) – real time unstructured data
- Scoop – collects data in batch process. HBASE – for structured data.NoSQL – for unstructured data
- HBASE – relational database component
- PIG Latin– data language for unstructured data – used for loading into MapReduce
- Oozie – decision making for routing to different types of analysis
- Mahoot – machine learning process – Math Library for statistical uses
- Ambari – control panel for the Hadoop environment.
- rapidminer – open source
- KNIME – open source
Sources for big data researchers and developers
- ACM and IEEE – already use those. Looking for more focused stuff.
- Frustrated with the answer when ever I ask this question.