Famous in-memory data format
Apache Arrow is a sacred grail of analytics that was invented not so long ago. It is a special format for column data storage in memory. It allows you to copy objects from one process to another very quickly - from pandas to PyTorch, from pandas to TensorFlow, from Cuda to PyTorch, from one node to another node, etc.. This makes it the horse of a large number of frameworks for both analytics and big data.
I actually don't know any other in-memory format with complex data, dynamic schemas, performance, and platform support.
Apache Arrow itself is not a storage or execution engine. It is designed to serve as a foundation for the following types of systems:
- SQL execution engines (Drill, Impala etc)
- Data analysis systems (Pandas, Spark etc)
- Streaming and queueing systems (Kafka, Storm etc)
- Storage systems (Parquet, Kudu, Cassandra etc)
- Machine Learning libraries(TensorFlow, Petastorm, Rapids etc)
Please do not think that this is part of Parquet format or part of PySpark. This is a separate self-contained format which I think is a bit undervalued and should be taught with all other big data formats.
https://arrow.apache.org/overview/
#big_data
Apache Arrow is a sacred grail of analytics that was invented not so long ago. It is a special format for column data storage in memory. It allows you to copy objects from one process to another very quickly - from pandas to PyTorch, from pandas to TensorFlow, from Cuda to PyTorch, from one node to another node, etc.. This makes it the horse of a large number of frameworks for both analytics and big data.
I actually don't know any other in-memory format with complex data, dynamic schemas, performance, and platform support.
Apache Arrow itself is not a storage or execution engine. It is designed to serve as a foundation for the following types of systems:
- SQL execution engines (Drill, Impala etc)
- Data analysis systems (Pandas, Spark etc)
- Streaming and queueing systems (Kafka, Storm etc)
- Storage systems (Parquet, Kudu, Cassandra etc)
- Machine Learning libraries(TensorFlow, Petastorm, Rapids etc)
Please do not think that this is part of Parquet format or part of PySpark. This is a separate self-contained format which I think is a bit undervalued and should be taught with all other big data formats.
https://arrow.apache.org/overview/
#big_data
Apache Arrow
Format
Arrow Format
Where do I start to learn AWS?
So, if you go to the AWS Documentation you will see an endless list of services, but it's just the global table of contents of global tables of contents! That's right — Amazon is huge right now. At the time of writing these lines are two and a half hundred services under the hood. It is not realistic to learn them all, and there is no reason to do it at all.
John Markoff says “The Internet is entering its Lego era.” AWS services is similar to Lego — you finding the right pieces and combine them together. In order to highlight the most essential pieces it is reasonable to say that they were historically the first. They are:
- S3 — storage
- EC2 — virtual machines + EBS drives
- RDS — databases
- Route53 — DNS
- VPC — network
- ELB — load balancers
- CloudFront — CDN
- SQS/SNS — messages
- IAM — main access rights to everything
- CloudWatch — logs/metrics
Then there are modern serverless pieces (Lambda, DynamoDB, API Gateway, CloudFront, IAM, SNS, SQS, Step Functions, EventBridge).
#aws
So, if you go to the AWS Documentation you will see an endless list of services, but it's just the global table of contents of global tables of contents! That's right — Amazon is huge right now. At the time of writing these lines are two and a half hundred services under the hood. It is not realistic to learn them all, and there is no reason to do it at all.
John Markoff says “The Internet is entering its Lego era.” AWS services is similar to Lego — you finding the right pieces and combine them together. In order to highlight the most essential pieces it is reasonable to say that they were historically the first. They are:
- S3 — storage
- EC2 — virtual machines + EBS drives
- RDS — databases
- Route53 — DNS
- VPC — network
- ELB — load balancers
- CloudFront — CDN
- SQS/SNS — messages
- IAM — main access rights to everything
- CloudWatch — logs/metrics
Then there are modern serverless pieces (Lambda, DynamoDB, API Gateway, CloudFront, IAM, SNS, SQS, Step Functions, EventBridge).
#aws
Rapids
Nvidia has been developing an open source platform Rapids, whose task is to accelerate the work of data processing and machine learning algorithms on the GPU. Developers on Rapids don't have to use different libraries: they just write code in Python, and Rapids automatically optimizes it to run on the GPU. All data is stored in the Apache Arrow format in-memory.
I already wrote about GPU vs CPU. But the problem is that the amount of memory using the CPU we have now is limited to terabytes, and the GPU has a maximum of 50 GB of memory. Here Dask comes to the rescue - integration with Dask gives Rapids GPU clusters with multi GPU support.
The Rapids repository has the cuDF library for data preparation and neural network training, and the cuML library allows to develop machine learning algorithms without going into the details of CUDA programming.
Sounds cool, doesn't it? But, there is always but:
- it's still not production ready
- porting any complex udf is very hard (at least you should know cuda, which I don't)
- no cpu libraries version for inference
- no automatic memory management
- it's nvidia only
https://github.com/rapidsai
#ml
Nvidia has been developing an open source platform Rapids, whose task is to accelerate the work of data processing and machine learning algorithms on the GPU. Developers on Rapids don't have to use different libraries: they just write code in Python, and Rapids automatically optimizes it to run on the GPU. All data is stored in the Apache Arrow format in-memory.
I already wrote about GPU vs CPU. But the problem is that the amount of memory using the CPU we have now is limited to terabytes, and the GPU has a maximum of 50 GB of memory. Here Dask comes to the rescue - integration with Dask gives Rapids GPU clusters with multi GPU support.
The Rapids repository has the cuDF library for data preparation and neural network training, and the cuML library allows to develop machine learning algorithms without going into the details of CUDA programming.
Sounds cool, doesn't it? But, there is always but:
- it's still not production ready
- porting any complex udf is very hard (at least you should know cuda, which I don't)
- no cpu libraries version for inference
- no automatic memory management
- it's nvidia only
https://github.com/rapidsai
#ml
Telegram
L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵
CPU vs GPU
https://youtu.be/-P28LKWTzrI
https://youtu.be/-P28LKWTzrI
MLOps
Our ML algorithms are fine, but good results do require a significant team of data specialists, data engineers, field experts, and more support staff. And while the number and cost of expert staff is not constraining enough, our understanding of how to optimize for nodes, layers, and hyperparameters is still primitive. Finally, moving models into production and keeping them up to date is a final hurdle, given that the estimation created by a model can often only be achieved by continuing to use the same expensive and complex architecture used for learning. It should be understood that moving to production is a process and not a step and it starts long before the model development. Its first step is to define the business objective, the hypothesis of the value that can be extracted from the data, and the business ideas for its application.
MLOps — is a combination of technologies and processes of machine learning and approaches to the implementation of developed models in business processes. The very concept emerged as an analogy of DevOps in relation to ML models and ML approaches. DevOps is an approach to software development that allows increasing the speed of implementation of individual changes while maintaining flexibility and reliability through a number of approaches, including continuous development, division of functions into a number of independent microservices, automated testing and deploying of individual changes, global performance monitoring, a system of prompt response to detected failures, etc.
MLOps, or DevOps for machine learning, allows data science and IT teams to collaborate and accelerate model development and implementation by monitoring, validating, and managing machine learning models.
Of course, there is nothing new here — everyone has been doing it one way or another for a while. Now just a hype word appears behind which there are usually ready-made solutions like Seldon, Kubeflow, or MLflow.
#ml
Our ML algorithms are fine, but good results do require a significant team of data specialists, data engineers, field experts, and more support staff. And while the number and cost of expert staff is not constraining enough, our understanding of how to optimize for nodes, layers, and hyperparameters is still primitive. Finally, moving models into production and keeping them up to date is a final hurdle, given that the estimation created by a model can often only be achieved by continuing to use the same expensive and complex architecture used for learning. It should be understood that moving to production is a process and not a step and it starts long before the model development. Its first step is to define the business objective, the hypothesis of the value that can be extracted from the data, and the business ideas for its application.
MLOps — is a combination of technologies and processes of machine learning and approaches to the implementation of developed models in business processes. The very concept emerged as an analogy of DevOps in relation to ML models and ML approaches. DevOps is an approach to software development that allows increasing the speed of implementation of individual changes while maintaining flexibility and reliability through a number of approaches, including continuous development, division of functions into a number of independent microservices, automated testing and deploying of individual changes, global performance monitoring, a system of prompt response to detected failures, etc.
MLOps, or DevOps for machine learning, allows data science and IT teams to collaborate and accelerate model development and implementation by monitoring, validating, and managing machine learning models.
Of course, there is nothing new here — everyone has been doing it one way or another for a while. Now just a hype word appears behind which there are usually ready-made solutions like Seldon, Kubeflow, or MLflow.
#ml
Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. The framework's creators are active in promoting it - they say it's kind of cool and they also promise SOTA in NLP. I haven't seen it, but it would be interesting to compare it with real capabilities - so far it looks promising.
https://github.com/JohnSnowLabs/spark-nlp
#spark #ml
https://github.com/JohnSnowLabs/spark-nlp
#spark #ml
GitHub
GitHub - JohnSnowLabs/spark-nlp: State of the Art Natural Language Processing
State of the Art Natural Language Processing. Contribute to JohnSnowLabs/spark-nlp development by creating an account on GitHub.
In short, I wrote a book to put an additional title to my name - author. In short, I got a lot of experience and little sense. I will be glad if you recommend/read/write a review on it.
https://www.amazon.com/dp/B08KG1DNRD/ref=cm_sw_r_cp_awdb_t1_K-rDFbH218AY4
https://www.amazon.com/dp/B08KG1DNRD/ref=cm_sw_r_cp_awdb_t1_K-rDFbH218AY4
Hit like button if you want to know more about the tech writing topic - despite the fact that I did everything myself I dive deeper into this a little.
My friend asked me an interesting question about what skills are worth learning for Data Management specialists and how to build a grow roadmap. In fact, the question made me think because I haven't had a clear picture in my head. It's just my thoughts on the topic and for the most part, I'm just speculating about the current state and the future of Data Management.
https://luminousmen.com/post/data-management-skills
https://luminousmen.com/post/data-management-skills
Blog | iamluminousmen
Data Engineering skills
In data management, we are still in the Wild West - new trends emerge every day. How to stay relevant to the industry?
A great text about how betting on computer games and AI development made Nvidia the largest company.
https://www.wsj.com/articles/how-nvidias-ceo-cooked-up-americas-biggest-semiconductor-company-11600184856
https://www.wsj.com/articles/how-nvidias-ceo-cooked-up-americas-biggest-semiconductor-company-11600184856
WSJ
How Nvidia’s CEO Cooked Up America’s Biggest Semiconductor Company
In the midst of the pandemic, Nvidia Chief Executive Jensen Huang engineered his company’s biggest game changer from his kitchen: a $40 billion acquisition that could cement his status as the chip industry’s hottest CEO.
Give me 10 seconds to explain what ML engineers are doing
A very cool report on why and how everything (not) works, and what to do with it:
https://youtu.be/xA5U85LSk0M
#dev
https://youtu.be/xA5U85LSk0M
#dev
YouTube
How Your Systems Keep Running Day After Day - John Allspaw
How Your Systems Keep Running Day After Day: Resilience Engineering as DevOps
John Allspaw, CTO/Researcher, Adaptive Capacity Labs
DOES17 San Francisco
DevOps Enterprise Summit
https://events.itrevolution.com/us/
My goal today is twofold. One, I'm intending…
John Allspaw, CTO/Researcher, Adaptive Capacity Labs
DOES17 San Francisco
DevOps Enterprise Summit
https://events.itrevolution.com/us/
My goal today is twofold. One, I'm intending…
The UK fucked up the the statistics on covid-19 from September 25 to October 2, because they stored data on daily cases in Excel spreadsheet, each case was put in a separate column, and at some point it ran out of columns. Because of that, the data were not loaded into the official dashboard in real time, and the British found out about the huge increase in infestations when it was too late.
P.S. As a solution, they think of keeping several Excel tablets. Lol
https://www.dailymail.co.uk/news/article-8805697/Furious-blame-game-16-000-Covid-cases-missed-Excel-glitch.html.
P.S. As a solution, they think of keeping several Excel tablets. Lol
https://www.dailymail.co.uk/news/article-8805697/Furious-blame-game-16-000-Covid-cases-missed-Excel-glitch.html.
Mail Online
Blame game after 16,000 Covid cases missed due to Excel glitch
A clearer picture of the country's outbreak has emerged after some 16,000 confirmed infections had to be added to the daily totals running back more than a week.
Cool writeup on zero and few-shot learning techniques in NLP. It's hard to create a model that will perform well on the unseen data wIthout fine-tune it, author here comprehensively explaining the methods and giving examples of how this problem can be solved.
https://joeddav.github.io/blog/2020/05/29/ZSL.html
#ds
https://joeddav.github.io/blog/2020/05/29/ZSL.html
#ds
Joe Davison Blog
Zero-Shot Learning in Modern NLP
State-of-the-art NLP models for text classification without annotated data
Technical debt
Any more or less experienced engineer has more than once encountered a situation where it is necessary to do "dirty" work. For example, write intentionally not scalable code, not decomposing, intentionally not write tests, manual deployment, hardcore configurations. Because it is "temporary". Because "now we have to give something, we will fix it later". Cut corners, code smells, undocumented changes - all this is accumulated over time. And it is very difficult to solve, and it is easier to start with a clean slate.
There is no escape from this, and a small amount of technical debt will haunt you until you retire. To somehow live with it, you can apply the same "golden" rule 80/20. If you work in a project group, allocate 80% of resources to the project and 20 to "pay" the debt. You are the creditor and payer here, and if you miss a couple of "payments" (and it does not matter for what reasons), you will get into bondage, when, on the contrary, 80% of the time is spent on solving technical debt and only 20% on the project.
A useful article about the causes, consequences, and ways to avoid technical "debt": https://www.extremeuncertainty.com/technical-debt-technical-bankruptcy
#dev
Any more or less experienced engineer has more than once encountered a situation where it is necessary to do "dirty" work. For example, write intentionally not scalable code, not decomposing, intentionally not write tests, manual deployment, hardcore configurations. Because it is "temporary". Because "now we have to give something, we will fix it later". Cut corners, code smells, undocumented changes - all this is accumulated over time. And it is very difficult to solve, and it is easier to start with a clean slate.
There is no escape from this, and a small amount of technical debt will haunt you until you retire. To somehow live with it, you can apply the same "golden" rule 80/20. If you work in a project group, allocate 80% of resources to the project and 20 to "pay" the debt. You are the creditor and payer here, and if you miss a couple of "payments" (and it does not matter for what reasons), you will get into bondage, when, on the contrary, 80% of the time is spent on solving technical debt and only 20% on the project.
A useful article about the causes, consequences, and ways to avoid technical "debt": https://www.extremeuncertainty.com/technical-debt-technical-bankruptcy
#dev
Extremeuncertainty
Technical debt – or technical bankruptcy? | Extreme Uncertainty
It's time we admit we have a problem with technical debt. Everyone knows what it is, everyone is talking about, but not enough is being done about it. Time and again I have seen teams and systems end up swamped in technical debt. Swimming and eventually drowning…
Privacy is an emerging topic in the Machine Learning community. There aren’t canonical guidelines to produce a private model. There is a growing body of research showing that a machine learning model can leak sensitive information of the training dataset, thus creating a privacy risk for users in the training set.
Cost-efficient “membership inference attacks” predict whether a specific piece of data was used during training. If an attacker is able to make a prediction with high accuracy, they will likely succeed in figuring out if a data piece was used in the training set. The biggest advantage of a membership inference attack is that it is easy to perform, i.e., does not require any re-training.
A few years ago, cornell researches did some investigation around the privacy properties of machine learning models. Interesting to read:
https://www.cs.cornell.edu/~shmat/shmat_oak17.pdf
Cost-efficient “membership inference attacks” predict whether a specific piece of data was used during training. If an attacker is able to make a prediction with high accuracy, they will likely succeed in figuring out if a data piece was used in the training set. The biggest advantage of a membership inference attack is that it is easy to perform, i.e., does not require any re-training.
A few years ago, cornell researches did some investigation around the privacy properties of machine learning models. Interesting to read:
https://www.cs.cornell.edu/~shmat/shmat_oak17.pdf
Definitely panic if there's caviar
I ran into a Valve handbook for new employees and I was a little caught up in reading. Sometimes I wanted to steal pieces of text so well written.
I advise you to glance at it, it's very interesting.
#stuff
I ran into a Valve handbook for new employees and I was a little caught up in reading. Sometimes I wanted to steal pieces of text so well written.
I advise you to glance at it, it's very interesting.
We usually don't do any formalized employee "development" (course work, mentor assign-ment), because for senior people it's mostly not effective. We believe that high-performance people are generally self-improving.https://steamcdn-a.akamaihd.net/apps/valve/Valve_NewEmployeeHandbook.pdf
#stuff
Daily Standup
Scrum (one of the Agile frameworks) has one such procedure, called "Daily Scrum" (or Daily Standup). It is a simple team meeting where everyone talks about yesterday's achievements and what they will be doing today. It's a simple meeting that should synchronize everyone with each other, and where you can raise critical questions.
The name "standup" comes from the fact that at this meeting everyone, as you guessed, is standing, making the meeting go faster. Apparently, people are tired standing up, want to finish as soon as possible and run away, but when you sit at the meeting, you are comfortable, and a lot of time is spent on chitchat.
False. Yes, people get tired, but they don't start giving out information in a concise manner. And if you have 10 people (it is believed that everyone speaks no longer than a minute or two), the last one in the list will be so tired that it will give a few words, in order to sit his ass on a comfortable, soft chair. As a result, we have a meeting that everyone hates and does not want to go to.
Scrum requires some discipline and self-control, it is perfect for small teams with new product, whose future is not clear. In its turn, it is completely unsuitable for large monolithic projects and engineering teams, especially reactive ones, working on incoming tasks.
There are a lot of disputes about right and wrongly prepared Scrum, about the importance of the role of Scrum Master and who should be given this role, but I have not seen any discussions on morning standups although I often meet those who do not like them.
For those who missed the article on the scrum - https://luminousmen.com/post/11-steps-of-scrum
Scrum (one of the Agile frameworks) has one such procedure, called "Daily Scrum" (or Daily Standup). It is a simple team meeting where everyone talks about yesterday's achievements and what they will be doing today. It's a simple meeting that should synchronize everyone with each other, and where you can raise critical questions.
The name "standup" comes from the fact that at this meeting everyone, as you guessed, is standing, making the meeting go faster. Apparently, people are tired standing up, want to finish as soon as possible and run away, but when you sit at the meeting, you are comfortable, and a lot of time is spent on chitchat.
False. Yes, people get tired, but they don't start giving out information in a concise manner. And if you have 10 people (it is believed that everyone speaks no longer than a minute or two), the last one in the list will be so tired that it will give a few words, in order to sit his ass on a comfortable, soft chair. As a result, we have a meeting that everyone hates and does not want to go to.
Scrum requires some discipline and self-control, it is perfect for small teams with new product, whose future is not clear. In its turn, it is completely unsuitable for large monolithic projects and engineering teams, especially reactive ones, working on incoming tasks.
There are a lot of disputes about right and wrongly prepared Scrum, about the importance of the role of Scrum Master and who should be given this role, but I have not seen any discussions on morning standups although I often meet those who do not like them.
For those who missed the article on the scrum - https://luminousmen.com/post/11-steps-of-scrum
David Beazley recently launched Practical Python Programming, a course on Python that he created and taught for 13 years. Definitely recommended for newbies 👌
#python
#python
practical-python
Welcome!
Practical Python Programming (course by @dabeaz)
DeepMind shared their new curated list of learning resources for many different areas of DS, ML and AI
https://storage.googleapis.com/deepmind-media/research/New_AtHomeWithAI%20resources.pdf
#ds #ml
https://storage.googleapis.com/deepmind-media/research/New_AtHomeWithAI%20resources.pdf
#ds #ml