Python Programming Learning Machine Learning, Data Science and Analysis:
1- quickly discover the essential concepts of Python programming.
2- to learn to master the libraries most used for data analysis (Data Science).
3- and familiarize yourself with the basic concepts of machine learning.
This course is therefore mainly aimed at people who want to quickly get their foot in the stirrup in this exciting field, or at anyone whose professional activity, related to the data sciences, requires a concrete overview of this area and what it does.
Data Analysis – Basics of Python Programming:
This course is intended for beginners who wish to learn the basics of data analysis and machine learning.
1- Understanding the basics of big data, machine learning, and data science.
2- Learn the basics of Python programming, NumPy multidimensional data management, and Pandas data preprocessing.
3- Understand how to manage different data formats in Python.
4- Learn effective data visualization methods and the key concepts of machine learning, to reach the intermediate level in data science.
Introduction to the Python programming language:
Why is it Python?
A- It is an interpreted language: when you write an instruction, you can directly execute it. On the other hand, a compiled language requires the writing (and compilation) of an entire program before it can be tested. An interpreted language makes it easier to interactively program and test a piece of code. It is therefore an ideal language to quickly develop and test prototypes, or more particularly, algorithms for data analysis. B- It is a high-level language: a single line of code allows in some cases to carry out complex processes, hiding the details related in particular to the management or the representation of the data in the memory of the computer, or low-level arithmetic or binary operations. A high-level language makes it easier to develop a program, and thus increases productivity (usually at the expense of speed).
C- Python is open source: and therefore free for everyone, including businesses.
D- It is a popular language: the developer community is very active, and there is easily documentation on the internet and many examples of use in any type of application. In addition, the community develops and maintains many libraries developed and/or compatible with Python, allowing to increase productivity when writing a program, especially in the fields of mathematics and data sciences.
Introduction to the Python programming language:
Python libraries for data analysis:
One of the great strengths of the Python programming language is the huge community of developers who develop and maintain a large number of libraries, making life easier for programmers. These libraries make it possible to use new types of specialized objects with a wide variety of applications, such as import/export/processing/visualization of specific data, advanced mathematical/statistical/scientific analysis, interfacing and connectivity with other IT tools, etc.
In the field of data analysis in particular, here is a list of uncontroversial libraries that we will see in the course:
numpy: library of scientific calculus (algebra, matrix calculus, stochastic, etc.) scipy: scientific and statistical computation library (based on numpy) pandas: library for the representation, manipulation and analysis of data in the form of tables (or DataFrames) matplotlib: graphical data visualization library (see Chapter 6) To import a package, it must first be installed on the machine (which is already done most of the time in Google Colab). Next, use the following keywords:
import package as alias or: from package import subpackage as alias
Numpy:
Numpy is a library of mathematical calculus. It includes the ndarray (n-dimensional array) object, which is a representation of a multidimensional array (a tensor). This object can be used to represent a matrix or a vector. The package has many functions and methods, including algebraic operations on these objects, or the generation of random numbers. Introduction to Machine Learning An important sub-area of data analysis is machine learning. Machine learning brings together a set of techniques which, from the data, automatically learn different types of relationships between variables.
The following are generally distinguished: Supervised machine learning, which allows to learn relationships between data and labels. More specifically, we usually train an algorithm to predict labels (also called dependent variables, or targets) a set of variables (called predictive variables, or features). A model consists of an algorithm with parameters and which, based on operations between parameters and predictive variables, predicts a new label. Learning then consists in modifying the model parameters so that the predicted labels are the closest to the actual labels. In other words, the mean error (absolute or quadratic) of the predictions is minimized. This error minimization is usually done iteratively, by gradient descent.
There are two main applications of supervised learning:
a- Classification: we have a set of data, for example images, from which we would like to predict a class: the label of each image is then a categorical variable, for example "cat" or "dog". The predictive variables used to drive the algorithm are for example the pixels of the images, or the relationships between these pixels.
b- Regression: the label to be predicted from the data is a continuous value: for example the price of a house that can be predicted from different characteristics, or the age of a person to be predicted from an image.
The non-supervised machine learning, which allows to learn the structure of data, or relationships between different data, such as similarities between different variables, or different data (e.g.: Clustering, Dimension reduction, Anomaly detection). These techniques do not require data labelling.