A201: Analytics & Big Data: Terms & Tools for Info Pros #CILDC

Tuesday, April 28 2015
10:45 a.m. – 11:30 a.m.

Cervone_H_Frank_01_small_squareFrank Cervone, Director of Information Technology and CISO, School of Public Health, University of Illinois at Chicago Lecturer, School of Information, San Jose State University

Our information environment is rapidly changing. With the collection of large-scale datasets, the tools and methods related to large scale data are changing as well. While older technologies can be adapted for some purposes, new tools such as NoSQL databases, the Hadoop processing environment, and programming languages such as Pig are becoming important tools for the data and information analyst. In this session, learn what all the terminology means and what tools to use to begin to develop your Big Data and analytics environment.

Dr. Frank Cervone – A201_Cervone.pdf (2 MB) – Username/Password — CIL2015

Me: My intent on attending this workshop is to the learn the jargon and terminology used in Big Data.  I support researchers and scientists creating tools for analysis and visualization of disparate sources of Big Data.  I really need tools for collection development in this area.


Will geek out this morning.

Confusion in Big Data and Analytics.

Big Data – incomprehensible amounts of data.

No consistency in definition.

New kinds of data and analysis – best definition of Big Data

Neither snake oil nor silver bullet.

Big Data/Analytics tells us what is happening not why.

Can find hidden correlations. Visualizations are valuable Google Flu Trends good example. Does not say why!!

Apache-Hadoop Ecosystem – open source working together to process big data.  Not traditional relational database system.

Querying Big Data in Distributed File System (HDFS).

Hadoop Distributed File System – main node distributes among clusters that are available.  Creates index of where data is sent.

Zookeeper Service – distributes the work within the cluster.  More than one master server.  Redundancy.

Core  of big data analysis – Yarn Map Reduce-

  • splits the input
  • process local data and create key-value pairs.
  • organizes and merges the data via reducers.
  • Map => Combine (data on individual cluster) => Shuffle (combines results from clusters) => Reduce => Output
  • iterative and segmented.

Components of Hadoop environment:

  • Flume – acts as Agent ( consume – hold – deliver) – real time unstructured data
  • Scoop – collects data in batch process. HBASE – for structured data.NoSQL – for unstructured data
  • HIVE-
  • HBASE – relational database component
  • PIG Latin– data language for unstructured data – used for loading into MapReduce
  • Oozie – decision making for routing to different types of analysis
  • Mahoot – machine learning process – Math Library for statistical  uses
  • Ambari – control panel for the Hadoop environment.

Analytics software:

  • tableau
  • rapidminer – open source
  • KNIME – open source

Sources for big data researchers and developers

  • ACM and IEEE – already use those. Looking for more focused stuff.
  • Frustrated with the answer when ever I ask this question.