DataFrames can be partially cached, but partitions cannot be partially cached. When you use
So if an action like
#spark
cache() or persist(), DataFrame is not fully cached until you invoke an action that goes through each record e.g. count().So if an action like
take(1) is used, only one partition will be cached, because Catalyst understands that you don't need to calculate all partitions just to get one record. #spark
Let robots yell at people
As a technical lead, you are at your best as a productivity multiplier for your entire team. Anything you can do to increase your team's velocity has compounding effects for the company. Think automation/instrumentation/documentation. (rephrasing some twitter guy, not my words)
For example, often I am debugging the code with the built-in print statements in Python. Yes, I rarely use pdb and all that stuff and I don't suffer much. And sometimes I might forget to remove these debugging print lines from the PR. So as to avoid my colleagues constantly coming to me after a review and blaming me, we can create a linter rule for that. This helps not to screw up a relationship with anybody, and it is more effective to prevent repeated mistakes and to train newcomers.
What do you think?
#dev #management
As a technical lead, you are at your best as a productivity multiplier for your entire team. Anything you can do to increase your team's velocity has compounding effects for the company. Think automation/instrumentation/documentation. (rephrasing some twitter guy, not my words)
For example, often I am debugging the code with the built-in print statements in Python. Yes, I rarely use pdb and all that stuff and I don't suffer much. And sometimes I might forget to remove these debugging print lines from the PR. So as to avoid my colleagues constantly coming to me after a review and blaming me, we can create a linter rule for that. This helps not to screw up a relationship with anybody, and it is more effective to prevent repeated mistakes and to train newcomers.
What do you think?
#dev #management
How are you debugging in Python?
Anonymous Poll
6%
pdb
29%
print()
48%
pycharm/spider/etc debugger
6%
breakpoint()
6%
coding in REPL/Jupyter/Ipython
3%
other
I expect that Airflow for you is not just a word expressing the fart. Often Airflow DAGs become too big and complicated to understand. They get split between different teams within a company for future implementation and support. It may end up with a problem of incorporating different DAGs into one pipeline.
https://luminousmen.com/post/airflow-dag-dependencies
#big_data
https://luminousmen.com/post/airflow-dag-dependencies
#big_data
In Wired, a cool article about how ML combined with good physical stimulation leads to an order of magnitude faster experiments. Here it drives the development of new batteries. My respect.
https://www.wired.com/story/ai-is-throwing-battery-development-into-overdrive/
#stuff
https://www.wired.com/story/ai-is-throwing-battery-development-into-overdrive/
#stuff
Wired
AI Is Throwing Battery Development Into Overdrive
Improving batteries has always been hampered by slow experimentation and discovery processes. Machine learning is speeding it up by orders of magnitude.
The University of New York has made available a course on deep learning (https://cds.nyu.edu/deep-learning/) from Yann LeCun.
He is one of the creators of modern deep learning, a Turing award winner and former head of AI development on Facebook.
He is one of the creators of modern deep learning, a Turing award winner and former head of AI development on Facebook.
You don't need Hadoop
In the article the author writes that there is no need to use Hadoop on 600MB data. Despite the fact that the author made a clear emphasis on the size of data and his point is understandable, in my opinion, the size of data is insignificant in comparison with their structure and query patterns. Yes, Big Data is when the data don't fit on one machine. But not every task that requires horizontal scaling is big data.
So, if the data is trivial for processing and 100 terabytes for Hadoop may be small. It depends on what you do and how you do it.
Everything boils down to a trivial conclusion - use a tool suitable for the task.
#big_data
In the article the author writes that there is no need to use Hadoop on 600MB data. Despite the fact that the author made a clear emphasis on the size of data and his point is understandable, in my opinion, the size of data is insignificant in comparison with their structure and query patterns. Yes, Big Data is when the data don't fit on one machine. But not every task that requires horizontal scaling is big data.
So, if the data is trivial for processing and 100 terabytes for Hadoop may be small. It depends on what you do and how you do it.
Everything boils down to a trivial conclusion - use a tool suitable for the task.
#big_data
Apple released Tensorflow fork with support for a new chip M1 on macOS Big Sur. Thus, Tensorflow received a version for its own Apple processor, with support for hardware acceleration. Users can now get up to 7x faster training on the new 13-inch MacBook Pro with M1.
I like the idea of moving ML closer to the user, that makes a lot of sense especially for Apple.
Link
#ml
I like the idea of moving ML closer to the user, that makes a lot of sense especially for Apple.
Link
#ml
Evolution of trust
COVID is still brutal on the streets, but there is an epidemic that continues to rage even longer. Using game theory, the author tries to explain the epidemic of mistrust and seek a way to fix it.
https://ncase.me/trust/
#soft_skills
COVID is still brutal on the streets, but there is an epidemic that continues to rage even longer. Using game theory, the author tries to explain the epidemic of mistrust and seek a way to fix it.
https://ncase.me/trust/
#soft_skills
ncase.me
The Evolution of Trust
an interactive guide to the game theory of why & how we trust each other
Today we gonna deep dive into Spark memory management
https://luminousmen.com/post/dive-into-spark-memory
https://luminousmen.com/post/dive-into-spark-memory
Blog | iamluminousmen
Deep Dive into Spark Memory Management
Discover why your Spark cluster is losing money with a deep dive into Spark memory management. Uncover the complexities of memory allocation, off-heap memory, and task management for optimal performance.
If you have been writing Spark applications for a long time, you couldn't help but come across the tuning of its configuration parameters. Now you can do it automatically with a tool that optimizes the cluster resources for you. Made by me for you.
http://spark-configuration.luminousmen.com/
#spark #big_data
http://spark-configuration.luminousmen.com/
#spark #big_data
Apache Spark Configuration Optimization
Apache Apache Spark Configuration Optimization
Tool for automatic Apache Spark cluster resource optimization
Who here in any professional way connected or interested in DS/ML?
Anonymous Poll
27%
I'm ML engineer
35%
I'm Data Scientist
38%
I'm never did any, but interested
AWS S3 is Now Strongly Consistent!
Effective immediately, all S3 GET, PUT, and LIST operations, as well as operations that change object tags, ACLs, or metadata, are now strongly consistent.
https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/
#aws
Effective immediately, all S3 GET, PUT, and LIST operations, as well as operations that change object tags, ACLs, or metadata, are now strongly consistent.
https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/
#aws
Amazon
Amazon S3 Update – Strong Read-After-Write Consistency | Amazon Web Services
When we launched S3 back in 2006, I discussed its virtually unlimited capacity (“…easily store any number of blocks…”), the fact that it was designed to provide 99.99% availability, and that it offered durable storage, with data transparently stored in multiple…
Preview version of Learning Spark with latest 3.0 release. I'm saving your time filling up the forms, enjoy the book.
#big_data #spark
#big_data #spark
When I got my first computer, I had a Windows system installed on it. I don't know about you, but every time there was an error in Windows that meant nothing to me I went to the lower Internet to solve it. And I found out that there are many more interesting things on the lower Internet...
Windows has created a layer of programmers who can solve problems. Thanks to her for that.
But Windows OS is still a pile of horse shit.
That's why you can buy a new year's ugly sweater with it: https://gear.xbox.com/pages/windows
Windows has created a layer of programmers who can solve problems. Thanks to her for that.
But Windows OS is still a pile of horse shit.
That's why you can buy a new year's ugly sweater with it: https://gear.xbox.com/pages/windows
Andreessen Horowitz published a detailed guide on the state of architectures for data infrastructure.
It covers data sources, data ingestion and transformation, storage, historical (for analytics), predictive and outputs along with different tools that can be used for each as well as case studies from several large companies on their data infrastructure setups.
Highly recommended.
https://a16z.com/2020/10/15/the-emerging-architectures-for-modern-data-infrastructure/
It covers data sources, data ingestion and transformation, storage, historical (for analytics), predictive and outputs along with different tools that can be used for each as well as case studies from several large companies on their data infrastructure setups.
Highly recommended.
https://a16z.com/2020/10/15/the-emerging-architectures-for-modern-data-infrastructure/