Comment sélectionner les compétences en science des données
Data science. A modern-day buzzword. In our present-day digital world, it’s common to discover titles assigned to roles and disciplines that are not yet universally defined and accepted. None are more so prolific than data science and the data scientist skills that are attributed to them.
In this article, we’re going to break down the meaning of data science, data scientist skills and give you our advice on how to best screen for a data science position.
The down-low on data science
According to market research company Forrester, by 2021, insight-driven businesses will be collectively worth $1.8 trillion, which is up from $333 billion in the year 2015. These ‘insights’ are derived from data, which plays a pivotal role in helping the world’s most successful companies become more profitable. The same report found that data-driven organizations are growing 8x faster than the global GDP. Food for thought.
The ability to interpret data and harness its usefulness is clearly a pretty serious job. But there is more or less a consensus about the lack of consensus regarding a clear definition of data science.
Despite the field’s difficulties in defining itself, it hasn’t slowed down the creation of new graduate programs with “data science” in their names. To confirm that, a recent survey analysis by KDNuggets has shown graduate degrees with the name ‘data science’ began to emerge in 2007, with an enormous spike of enrolments 2012.
It’s evident that data science positions are on a critical trajectory of their lifespan. Due to the field’s scalability, it’s receiving the attention it demands. But without being able to properly understand what it is, how are we supposed to hire for it?
DevSkiller’s got you covered on both fronts.
What is Data Science?
In its simplest form, data science is the discipline of making data useful. The concept of data science is ‘to unify statistics, data analysis, machine learning, and their related methods’ in order to ‘understand and analyze actual phenomena’ with data.
Traditionally, the data we could evaluate was mostly structured and small in size, and able to be analyzed by using simple BI tools. Unlike data in the traditional systems which was mostly structured, today most of the data is unstructured or semi-structured. This demand has accelerated the role of the data scientist.
1.1 What is the role of a data scientist?
A data scientist should be setting the data strategy of the company which involves setting everything up from the engineering and infrastructure for collecting data and logging, to privacy concerns. They decide what data will be user-facing, how data is going to be used to make decisions, and how it’s going to be built back into the product. They will also be concerned with patenting innovative solutions and setting research goals. A list of their basic responsibilities include:
- Synthesizing all available information, statistics, and data of an organization,
- Compiling information about the AI needs in an organization,
- Analyze data and find potential uses with AI (sometimes called Exploratory Data Analysis),
- Explain data patterns to business-oriented colleagues and clients (a process known as data storytelling),
- Design and prepare machine learning models,
- Evaluate models’ efficacy in the production environment.
In case you didn’t know, a machine learning model is a program that has been trained to recognize certain types of patterns. It’s possible to train a model over a set of data, providing it an algorithm that it can use to reason over and learn from those data.
A chief data scientist should manage a team of engineers, scientists, and analysts and should communicate with leadership across the company, including the CEO, CTO, and product leadership. She’ll also be concerned with patenting innovative solutions and setting research goals.
A popular Twitter definition has described a data scientist as ‘someone who is better at statistics than any software engineer and better at software engineering than any statistician’.
1.2 Is a data scientist similar to any other positions?
Many different kinds of analysts are able to ‘make data useful’, starting from a data engineer, all the way to a qualitative expert. While all these roles participate in data science, to refer to someone as a data scientist they should have expertise in all three areas (analytics, statistics, and ML/IA).
To offer an example, a machine learning developer does a subset of the data scientist’s tasks but focuses only on Machine Learning Models. The position of data scientist really is an umbrella term although job titles have never really been an accurate reflection of one’ responsibilities
What is important for an IT recruiter to know about Data Science?
2.1 How often does the environment/challenges faced change?
One thing an IT recruiter should note is that the landscape is changing constantly. The data is always getting bigger, and problems are getting harder; so new techniques are developed and new frameworks are sure to follow.
2.2 Are there many resources/tools/technologies (libraries, frameworks, etc.) available?
Being familiar with certain resources and tools will certainly be a big advantage. Currently, a lot of tools are available in the Python language, however, there are a lot fewer available for R (another programming language). Some deep-learning frameworks are available in C++, as it’s faster and more memory-efficient than Python. In Python, some of the most popular libraries include: pandas, Seaborn, plotly, scikit-learn, PyTorch, TensorFlow.
2.3 What should a data scientist know about and what are the most important data scientist skills?
Data scientists are expected to know a lot — machine learning, computer science, statistics, mathematics, data visualization, communication, and deep learning. Within those areas, there are dozens of languages, frameworks, and technologies data scientists can learn.
Data science requires statistics and computer science skills — no surprise there. It is interesting that communication is mentioned in nearly half of the data science job listings these days. Data scientists need to be able to communicate insights and work with others. A basic list of what makes a good data scientist is below:
- Data analysis capability
- Skilled at machine learning
- Has good communication skills
- Has mastered a deep learning framework
- Is fluent in Python or R
2.4. What type of experience is important to look for in a data scientist (commercial, open-source, scientific, academic)?
For research, only projects — academic or scientific experience will be the most crucial and well-rounded. But in terms of creating production models — previous experience with working with other models of production will give you the best insight.
How to verify data scientist skills in the screening phase?
Growing data means growing opportunities — it all just needs good management. Verifying skills in the screening phase is tricky but focusing on a candidate’s soft-skills can also help weed out talent in a unique way. Finding data scientists who are already great decision-
makers can save a lot of hassle for your business.
3.1 What to take into account when screening a CV?
The most important thing to consider is whether the candidate has a detailed background in the most relevant areas. A history of exposure to mathematics, statistics, computer science, programming, and machine learning libraries are absolutely key here. Previous experience with data science analytics and programming are vital too.
What will separate a good data scientist from a great one are interpersonal communication skills, i.e the ability to converse and cooperate with a wide variety of people. The candidate should also have a good business acumen or a well-rounded understanding of business fundamentals and principles.
Be sure to check whether the candidate has indicated how their work positively affected an increase in sales, ROI, etc. It’s quite essential for top candidate’s to include quantitative evidence of their achievements.
If the candidate you’re looking for is a recent graduate, focus on their skills and relevant coursework or internships they may have done to assess their breadth of knowledge.
3.2 What glossary terms are important to know?
- Exploratory data analysis – this consists of data cleanup, exploration of data patterns, and the manual discovery of patterns in data
- Data storytelling – this refers to the description and visualization of data patterns for persons without the technical knowledge
- Classical Machine Learning – solving tasks using models like linear or logistic regression, decision trees, random forests, boosting, support vector machines, non-negative matrix factorization, K-means, k-nearest neighbors
- Deep Learning – solving tasks using neural networks. Some types of neural networks include Convolutional Neural Networks and Recurrent Neural Networks
Data analysis and manipulation libraries | In Python: NumPy, pandas In R: dyplr, tidyr |
Distributed data analysis and manipulation libraries | In Python: Dask In Scala, Java, and Python: Spark |
Data visualization libraries | In Python: Seaborn, Plotly, Matplotlib In R: ggplot2 |
General Machine Learning libraries | In Python: scikit-learn In R: caret, e1071 |
Deep Learning libraries | In Python: Keras, Tensorflow, PyTorch In R: Nnet In C++: Caffe |
3.3 Which certifications are available and respected? How useful are they in determining data scientist skills?
Let’s get one thing clear upfront: you do not need any kind of data science certificate to get a job in data science. It helps, but recruiters aren’t overly fussed.
However, around half of machine learning knowledge is theoretical so certifications in this area are highly applicable. The other 50% comes from experience, so any kind of production model created, or Kaggle competitions. Certifications usually don’t check for business analysis skills or general people skills. The top courses we have found are below.
- Certified Analytics Professional (CAP)
- Cloudera Certified Associate: Data Analyst
- Cloudera Certified Professional: CCP Data Engineer
- Data Science Council of America (DASCA) Senior Data Scientist (SDS)
- Data Science Council of America (DASCA) Principle Data Scientist (PDS)
- Dell EMC Data Science Track
- Google Certified Professional Data Engineer
- Google Data and Machine Learning
- IBM Data Science Professional Certificate
- Microsoft MCSE: Data Management and Analytics
- Microsoft Certified Azure Data Scientist Associate
- Open Certified Data Scientist (Open CDS)
- SAS Certified Advanced Analytics Professional
- SAS Certified Big Data Professional
- SAS Certified Data Scientist
Certifications obtained from Coursera, edX, or Udacity are also highly respected.
3.4 What other lines on a CV can show data scientist skills?
Taking note of the candidates’ participation in conferences as a speaker can indicate a necessary skill to be an adequate storyteller, an important requirement in data science. It is obviously imperative to be an expert on the technical side of things, but having the ability to explain your findings to those without your technical knowledge is just as crucial.
Taking part in machine learning competitions can also be a great advantage. Platforms such as Kaggle.com, topcoder.com, crowdai.org, and knowledgepit.ml all offer the chance to compete for awards in the space.
In today’s world, having a good resume alone might not be enough to land that coveted interview call. Especially if you are applying for a data scientist role. As we are living and thriving in the midst of a digital revolution, it stands to reason that the recruiting process would incorporate that as well.
Browsing a candidate’s LinkedIn and GitHub accounts can be useful to gauge the outline of a candidate as well to view their proficiency in open-source projects. You can decide whether the projects are relevant to the current role. This helps you to visualize the candidate’s profile so you are able to structure questions in a certain manner. You will also be able to determine whether the data scientist skills mentioned by the candidate in his/her resume are reflective in their GitHub profile.
Technical screening of data science skills during a phone/video technical interview
It’s difficult to rely on just the words of a resume. After all, it’s important to challenge the candidate to determine whether they really have the skills they claim to have. Even if it’s just a phone interview, it can help you understand how the candidate thinks and goes about solving problems related to their craft.
4.1 Questions that you should ask about a data scientist’s expérience. Pourquoi poser chacune de ces questions ?
- What kind of DS projects did you do, and what was the extent of your engagement in the projects?
Reason: As data science is an extremely broad position, oftentimes with differing responsibilities; some candidates may only work in data analysis and storytelling or only gather requirements and create machine learning models. The candidate’s experience should match the responsibilities of the position you’re recruiting for. This question is really aimed at checking the extent of the candidate’s skills. - How did your work have a positive financial impact on the organization with the projects you played a part in?Reason: The data scientist role is a position that requires a good understanding of business requirements and conditions. Look for answers that show specific measurements, such as ‘the marketing team was able to cut costs by 10% due to our results’, or ‘we have lowered customer turnover by 5% due to our new retention capabilities’.
- What kinds of libraries and programming techniques did you use?
Reason: Data scientists can use a wide variety of tools to achieve the same results. These can depend on the programming language one chooses, the internal company infrastructure, and the size of the dataset the candidate has worked with. The candidate will likely perform best with tools they have previous experience with.
4.2 Questions that you should ask about a data scientist’s connaissances et opinions. Pourquoi poser chacune de ces questions ?
- How would you check that a model is functioning properly?
Reason: The ideal methodology is to split the dataset into sections: training set, validation set, and test set. The training set is the only one available to the model and is the basis of the training process. The model’s parameters are set using the validation set and model efficiency is tested on the test set. - How would you check if the data in the dataset is of good quality?
Reason: A data scientist will most likely have to work with a dataset collected within the company that might contain missing values, errors or inconsistencies – these are the signs of messy data. To find such problems, a data scientist should perform Exploratory Data Analysis to summarize their main characteristics. - What is boosting and what are the benefits or using it?
Reason: Boosting models are tree-based models consisting of groups of trees that are trained sequentially. Boosting models are currently the most efficient ones with great accuracy, relatively short training times, reduced memory usage, and medium sized required training datasets (in comparison to deep learning techniques).
A tip from our expert is to ask questions that are related to business problems you’re currently recruiting for. Like anyone, data scientists will work best in areas they’re familiar with.
For example, not every candidate may have a “feel” for (or be interested in, or willing to learn) the inner-workings of factory equipment (problems of predictive maintenance), medical terms (creating AI for the medical industry), or client preferences (recommender systems for e-commerce).
4.3 Comportementale questions that you should ask a data scientist. Why should you ask each of those questions?
- How do you deal with differences of opinion with colleagues?
Reason: A data scientist must have good communication and interpersonal skills (i.e empathy) as their role is based on compiling data from colleagues and finding areas for improvement within their organization or society. - Where do you find information about new data science techniques or cases?
Reason: As the data science field is constantly evolving and growing, the role requires constant research to stay up to date with the latest updates and to problem solve in the most efficient manner. Any of these sources are worthy: conference papers, workshop papers, MOOCs, blogs of companies dealing with DSs, meetups of DS community, Facebook or mail groups with a DS theme, or learning from a mentor. - What do you consider to be your greatest success and biggest failure in the DS field?
Reason: This is a pretty generic question but it shows the self-recognition and self-reflection skills of the candidate. Both are necessary in the learning process which is a major part of being a great data scientist.
Technical screening of a data scientist’s skills using an online coding test
Hiring a data scientist can be a tricky process. The actual definition of a data scientist is vague, and the day-to-day job of someone with ‘data scientist’ in their job title varies dramatically between organizations. Also, people come to the field from a wide variety of backgrounds. Examining the past of a data scientist candidate is a science in itself, one worthy of a blog post of its own. We’re going to stick to showing you how best to screen for a data scientist!
5.1 Which online test for data scientist skills should you choose?
Lors de la recherche de la bonne data science skills test vous devez vous assurer qu'il répond aux critères suivants :
- Le test reflète la qualité du travail professionnel effectué
- La durée n'est pas trop longue, une à deux heures maximum
- Le test peut être envoyé automatiquement et est de nature simple
- Le niveau de difficulté correspond aux capacités du candidat
- Le test ne se limite pas à vérifier si la solution fonctionne - il vérifie la qualité du code et son fonctionnement dans les cas extrêmes
- Il est aussi proche que possible de l'environnement de programmation naturel et permet au candidat d'accéder aux ressources pertinentes.
- Il donne au candidat l'occasion d'utiliser toutes les bibliothèques, tous les cadres et tous les autres outils qu'il rencontre régulièrement.
5.2 DevSkiller ready-to-use online data science skills tests
Les tests de codage DevSkiller utilisent notre méthodologie RealLifeTesting™ pour refléter l'environnement de codage réel dans lequel votre candidat travaille. Plutôt que d'utiliser des algorithmes obscurs, les tests DevSkiller demandent aux candidats de construire des applications ou des fonctionnalités. Ils sont notés de manière totalement automatique et peuvent être passés n'importe où dans le monde. En même temps, le candidat a accès à toutes les ressources qu'il utiliserait normalement, y compris les bibliothèques, les frameworks, StackOverflow et même Google.
Les entreprises utilisent DevSkiller pour tester les candidats en utilisant leur propre base de code depuis n'importe où dans le monde. Pour faciliter la tâche, DevSkiller propose également un certain nombre de tests de compétences en science des données, comme ceux présentés ici :
- Compétences testées
- La durée
- 110 minutes max.
- L'évaluation
- Automatique
- Aperçu du test
-
Questions à choix
l'évaluation des connaissances Python 3.x, Pensée logique, Séquence, Compétences non techniques
Tâche de programmation - Niveau : Difficile
Python | NumPy | Réseaux convolutifs graphiques - Implémenter un réseau convolutif graphique simple.
- Compétences testées
- La durée
- 70 minutes max.
- L'évaluation
- Automatique
- Aperçu du test
-
Questions à choix
l'évaluation des connaissances Python, Etincelle
Tâche de programmation - Niveau : Moyen
Python | PySpark | Modèle de préférence client - Implémentation d'une application d'ingénierie des données pour le prétraitement des données marketing.
- Compétences testées
- La durée
- 65 minutes max.
- L'évaluation
- Automatique
- Aperçu du test
-
Questions à choix
l'évaluation des connaissances Python
Tâche de programmation - Niveau : Facile
Python | PySpark | ML Logs Transformer - Terminer l'implémentation du pipeline de transformation des logs.
- Compétences testées
- La durée
- 66 minutes max.
- L'évaluation
- Automatique
- Aperçu du test
-
Questions à choix
l'évaluation des connaissances Scala
Tâche de programmation - Niveau : Facile
Scala | Spark | ML Logs Transformer - Compléter l'implémentation du pipeline de transformation des logs.
- Compétences testées
- La durée
- 45 minutes max.
- L'évaluation
- Automatique
- Aperçu du test
-
Tâche - Niveau : Facile
SQL | Catalogue de timbres | Les trois prix les plus élevés - Sélectionne trois timbres (prix et nom) ayant le prix le plus élevé.
Tâche de programmation - Niveau : Facile
Python | Pandas | HTML table parser - Implémentation d'une fonction pour convertir un tableau HTML en un fichier au format CSV.
- Compétences testées
- La durée
- 35 minutes max.
- L'évaluation
- Automatique
- Aperçu du test
-
Questions à choix
l'évaluation des connaissances Python
Tâche de programmation - Niveau : Facile
Python | Pandas | HTML table parser - Implémentation d'une fonction pour convertir un tableau HTML en un fichier au format CSV.
- Compétences testées
- La durée
- 120 minutes max.
- L'évaluation
- Automatique
- Aperçu du test
-
Questions à choix
l'évaluation des connaissances Python
Tâche de programmation - Niveau : Moyen
Python | Rapport sur les ventes de véhicules - Mettre en œuvre une application permettant de créer des rapports basés sur l'entrepôt de données sur les ventes de véhicules.
- Compétences testées
- La durée
- 96 minutes max.
- L'évaluation
- Automatique
- Aperçu du test
-
Questions à choix
l'évaluation des connaissances Python
Tâche de programmation - Niveau : Moyen
Python | Pandas | Une startup de livraison de nourriture - Transformer une base de données de commandes en réduisant sa dimensionnalité et en créant une table analytique supplémentaire.
- Compétences testées
- La durée
- 45 minutes max.
- L'évaluation
- Automatique
- Aperçu du test
-
Questions à choix
l'évaluation des connaissances Python
Tâche de programmation - Niveau : Facile
Python | Client Base Creator - Implémenter l'application pour récupérer les données de contact des clients à partir des messages de chat.
- Compétences testées
- La durée
- 70 minutes max.
- L'évaluation
- Automatique
- Aperçu du test
-
Questions à choix
l'évaluation des connaissances Apprentissage automatique, Python
Tâche de programmation - Niveau : Moyen
Python | Analyseur d'ADN | Créer et nettoyer des brins d'ADN - Implémenter 2 méthodes en Python qui créent et nettoient des brins d'ADN.
Test Data Science skills with our built-in PyCharm IDE
You can now assess your candidates’ Data Science skills with the use of our built-in PyCharm IDE.
Given how hard it is to attract skilled data scientists, creating the most candidate-friendly assessment environment possible is a huge asset. Letting Data Scientists work exactly the way they normally do during the recruitment process is a game-changer.
What this means for you and your candidates:
- Your candidates can now work directly in the browser, without having to download any components or wait for the program to load,
- They no longer have to clone the code, wait for the dependencies to install or indexes to build,
- Instead, they can literally start coding as soon as they open the test invitation. This quickens up the process, resulting in lower candidate drop-off and a more positive candidate experience overall. Our PyCharm IDE is hosted by our own server within the cloud. Candidates can run tests, preview and play their solutions and run their code.
We aim to make the screening process as close to a Data Scientists normal working environment as possible.
This is the second in-browser IDE from JetBrains that we’ve added to our platform, following the addition of IntelliJ IDEA for all Java tests earlier this year.
We’ll soon be rolling out more IDEs to the platform to make the testing environment universally enjoyable to candidates across all tech stacks.