Posts

Different types of clustering for textual documents

Clustering Model 1: Nearest Neighbour Algorithms: KD-Trees and Locality Sensitive Hashing Core Machine Learning Concepts: Distance Metrics and Approximation Algorithms Problem Domain: Finding similar documents Nearest neighbour: Find distance between all other documents and query document. Retrieve the document which is closest Critical Component Document Representation Computing Distances Scaling to large dataset: KD Trees - Not appropriate for very high dimensional documents Thus, Locality sensitive hashing is used for approximate nearest neighbour search. Clustering Model 2: Capturing uncertainty in clustering: Mixture Models Any document can be related to t i topic by x%. Can learn users topic preferences Making soft assignments to clusters -Expectation Maximization Clustering Model 3: Latent dirichlet allocation Mixed membership Probability of words in vocabulary Unsupervised learning task

Schedule for Machine Learning

Books to be followed hands on machine learning with scikit learn and tensorflow 1.4.2018 Introduction to machine learning 2.4.2018 Introduction to statistics and probability 3.4.2018 Understanding Basics of Python and modules required 4.4.2018 Regression (Linear and Logistic) 5.4.2018 Implementing regression techniques 6.4.2018 Other regression techniques and differences 7.4.2018 Implementing other techniques 8.4.2018 Classification (Decision Tree, Naïve Bayes and SVM) 9.4.2018 Implementing 10.4.2018 Other classification techniques 11.4.2018 Implementing 12.4.2018 Discussions with doubts 13.4.2018 Clustering techniques (k means and hierarchiacal) 14.4.2018 Implementing 15.4.2018 Other cl

Significance of Woman in Data Science

Image
Research scientists, regardless of gender, are doing great in this field. This has been evidenced by a great conference " Global Women in Data Science " hosted by Stanford University (livestream) for 75+ locations all over the world on 3rd February 2017. Many talented academic researchers and practitioners have successful connected to make their dreams and goals burst into reality. It has been sponsored by Microsoft, Google, WalmartLabs, SAP and many more A Grade firms.  Fig 1: Global Women in Data Science Another great initiative, Women in Machine Learning   has been hosted on Facebook. Many female potential researchers join hands for research in machine learning. Many of them have achieved great milestones and many are still on the way.  The world is full of talented, successful and fantastic female speakers Women in Machine Learning and Data science are often un-noticable by traditional recruiters and we can turn it down. Another initiative at Meetup.c

Analysis and Research trends using Word Co-occurrence Network

Image
Co-occurrence networks   are basically used to provide and give a graphic visualization of potential relationships between people, community, organizations, concepts or other entities represented within the form of written material. The generation and visualization of   co-occurrence   networks has become practical with the advent of electronically stored and save the text amenable to   text mining . By way of definition, co-occurrence networks are the collective interconnection of terms based on their paired presence within a specified unit of text. Networks are generated by connecting pairs of terms using a set of criteria defining co-occurrence. For example, terms A and B may be said to “co-occur” if they both appear in a particular article. Another article may contain terms B and C. Linking A to B and B to C creates a co-occurrence network of these three terms. Rules to define co-occurrence within a   text corpus   can be set according to desired criteria. For example, more str

Statistics and Graphical Models in Data Science

Image
Statistics: For any aspiring data scientist, I would highly recommend   learning statistics with a heavy focus on coding up examples, preferably in Python or R. Mostly favorite series is the   Statistical Learning series. The SL Series is a   great primer on statistical modeling / machine learning with applications in R.(Reference by qoura.com) ·         Crucial component: Statistics is a crucial component of data science. At Twitch, The Professional data science team brings together three things: the first is statistics, second is  programming, and  third last is product knowledge. And we would never hire someone who wasn’t strong in stats. You can be a great programmer, but if you don’t know what Byes Rule is, then we have an engineering department I can point you to.” The origin in statistics is mostly undeniable. ·         Programming: Python:   Python is a mostly used high-level programming language used for general-purpose programming the it is created by Gui

Different Fields for Data Science

Image
Data science , also known as   data-driven science , is an interdisciplinary field about scientific methods, processes and systems to extract   knowledge   or insights from   data   in various forms, either structured or unstructured,similar to   Knowledge Discovery in Databases   (KDD). Data science is a " concept to unify statistics, data analysis and their related methods " in order to "understand and analyze actual phenomena" with data.   It employs techniques and theories drawn from many fields within the broad areas of   mathematics,   statistics,   information science, and   computer science, in particular from the subdomains of   machine learning,   classification,   cluster analysis,   data mining,   databases, and   visualization. Turing award   winner   Jim Gray   imagined data science as a " fourth paradigm " of science (empirical,   theoretical, computational and now data-driven) and asserted that "everything about science is chan

Big Boss 10 Bani J and Manveer Social Media Analysis

Image
This has been the most interesting analysis that I have been through. Twitter tweets have been extracted for Bani Judge and Manveer Gurjar, top 2 contestant of Big Boss Season 10 of India. The user-generated social media data has been analyzed and it has been statistically observed that Bani Judge is the topic of discussion in 68.45% of the tweets whereas Manveer Gurjar is in 61.23% of tweets. This has been analyzed using topic detection and tracking. This information clearly indicates that Bani Judge is in trend. More of analysis is done using sentimental analysis. Number of positive and negative twitter feeds have been observed. Out of total number of tweets, Bani Judge has received 89.5% of positive feeds and Manveer Gurjar has recieved 62.19% of positive feeds. The remaining section remains neutral. This clearly indicates that Bani Judge has won Indian hearts whereas Manveer Gurjar has achieved BB10 Trophy. As per analysis, many people were in the favor of the fact tha