Make this user your buddy

This feature requires you to be logged on autoviva

You can login to your account or create a new account.
Remove jamesmarshall202012 from your friends?
Write this user a private message

This feature requires you to be logged on autoviva

You can login to your account or create a new account.
Make this member your fan

This feature requires you to be logged on autoviva

You can login to your account or create a new account.
Send this page to a friend!
Fill in the form bellow

your name:
your email:
friend name:
friend email:
your comments:
1 car


What is the difference between studying Data Science and a real job

The main work is done on a remote server

Most people begin their Data Science journey on personal computers. However, real-world projects often require much more processing power that neither a laptop nor even a gaming PC can provide. Therefore, Data Science researchers use their computers to access a remote server via SSH (Secure Shell). SSH allows a secure connection to a computing machine. Once the connection is established, the remote server can be used as a command shell for your computer. So it helps to know the basic Linux commands and to be experienced in using a terminal.


SQL is the king of data

Every Data Science project starts with data. And most of the time, the data used to solve a problem is not very easy to get - you have to collect it from separate datasets and put it into several database tables.

SQL is the standard query language for databases. It is used to quickly aggregate, aggregate, and retrieve the information you need and makes it easy to work with datasets. The problem is that most Data Science enthusiasts don't work with databases, because training datasets are usually already created by someone else. In fact, 90% of the time is spent collecting and preparing data. Yes, it sounds frustrating, but without data, there would be no data science. It should be noted that there are many dialects in SQL, but they are similar to each other - knowing one, you can easily adapt to another. Just pick any dialect and start learning it.


Traits are more important than the model

Linear models are usually considered too simple and unsuitable for real-world machine learning tasks. Is it possible to get the right result just by linearly increasing the number of features? Actually, you can.

More sophisticated models, such as Random Forest, xgboost, SVM and DNN, look for nonlinear boundaries in the feature space. This is done either by partitioning the space into small areas, or by mapping the features to a higher dimensional space where the boundaries look linear. Simply put, the model-building process can be seen as fitting a straight line to the newly generated data points. Since the models do not know the true values of the features, they try to create these new points based on some kernel or by optimizing a pseudo likelihood function.

Sounds pretty complicated, right? That's why such models are often called gray or black boxes. On the other hand, if you know the real values of the features, you can generate new constructive features with the data. The process of generating, transforming, and pre-processing features is called feature construction. Its basic approaches include finding standard deviation, discretization, feature aggregation, and so on.

With a properly designed model, excellent results can be achieved. Linear algorithms are usually better interpreted - you can see the significance of the generated features and understand from their coefficients how reliable the model is. If a coefficient that logically should be positive turns out to be negative, there's probably something wrong with the model, the data, or the initial assumptions.


The experiment and the product launch are completely different things

More often than not, Data Scientists work in Jupyter Notebook, which is a simple and easy application for experimentation and visualization. You can quickly try something new, train a model, or see the result of a calculation just by opening a cell and running a few lines of code.

But once the model is ready to go into production, the reign of the Jupyter Notebook ends, and the power goes to the Python files. Production is the work of your algorithms in real life. The end user is always looking at the quality of the final product, so the production code must be fast, clean, readable, fault-tolerant, and easy to debug.

Code speed is not so important if you are just experimenting and running the program once or twice. However, in production, your code will probably run several times a day and affect other parts of the product, so speed will become important.

Let's face it, most of your .ipynb files probably have a lot of unordered lines, unused imports, and unnecessary cells. And that's fine when you're just experimenting. But you need to "clean up" your code before releasing it for production. Consider your code clean enough if someone on your team can review it and easily understand the purpose of each line. This is why you should give your variables and functions the right names.

It is worth displaying important steps of code execution on the shell screen as well as in a log file. This will help you quickly identify possible problems. A good log file should contain start and end times, results, and exception records.

Thanks to the CLLAX - Business Software Reviews for writing this article.