Today we gonna deep dive into Spark memory management
https://luminousmen.com/post/dive-into-spark-memory
https://luminousmen.com/post/dive-into-spark-memory
Blog | iamluminousmen
Deep Dive into Spark Memory Management
Discover why your Spark cluster is losing money with a deep dive into Spark memory management. Uncover the complexities of memory allocation, off-heap memory, and task management for optimal performance.
If you have been writing Spark applications for a long time, you couldn't help but come across the tuning of its configuration parameters. Now you can do it automatically with a tool that optimizes the cluster resources for you. Made by me for you.
http://spark-configuration.luminousmen.com/
#spark #big_data
http://spark-configuration.luminousmen.com/
#spark #big_data
Apache Spark Configuration Optimization
Apache Apache Spark Configuration Optimization
Tool for automatic Apache Spark cluster resource optimization
Who here in any professional way connected or interested in DS/ML?
Anonymous Poll
27%
I'm ML engineer
35%
I'm Data Scientist
38%
I'm never did any, but interested
AWS S3 is Now Strongly Consistent!
Effective immediately, all S3 GET, PUT, and LIST operations, as well as operations that change object tags, ACLs, or metadata, are now strongly consistent.
https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/
#aws
Effective immediately, all S3 GET, PUT, and LIST operations, as well as operations that change object tags, ACLs, or metadata, are now strongly consistent.
https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/
#aws
Amazon
Amazon S3 Update – Strong Read-After-Write Consistency | Amazon Web Services
When we launched S3 back in 2006, I discussed its virtually unlimited capacity (“…easily store any number of blocks…”), the fact that it was designed to provide 99.99% availability, and that it offered durable storage, with data transparently stored in multiple…
Preview version of Learning Spark with latest 3.0 release. I'm saving your time filling up the forms, enjoy the book.
#big_data #spark
#big_data #spark
When I got my first computer, I had a Windows system installed on it. I don't know about you, but every time there was an error in Windows that meant nothing to me I went to the lower Internet to solve it. And I found out that there are many more interesting things on the lower Internet...
Windows has created a layer of programmers who can solve problems. Thanks to her for that.
But Windows OS is still a pile of horse shit.
That's why you can buy a new year's ugly sweater with it: https://gear.xbox.com/pages/windows
Windows has created a layer of programmers who can solve problems. Thanks to her for that.
But Windows OS is still a pile of horse shit.
That's why you can buy a new year's ugly sweater with it: https://gear.xbox.com/pages/windows
Andreessen Horowitz published a detailed guide on the state of architectures for data infrastructure.
It covers data sources, data ingestion and transformation, storage, historical (for analytics), predictive and outputs along with different tools that can be used for each as well as case studies from several large companies on their data infrastructure setups.
Highly recommended.
https://a16z.com/2020/10/15/the-emerging-architectures-for-modern-data-infrastructure/
It covers data sources, data ingestion and transformation, storage, historical (for analytics), predictive and outputs along with different tools that can be used for each as well as case studies from several large companies on their data infrastructure setups.
Highly recommended.
https://a16z.com/2020/10/15/the-emerging-architectures-for-modern-data-infrastructure/
The report of the World Economic Forum has been published with the forecast of how the labor market will change in the next five years. The main topic is the mass transition to online and digitalization of all professions, including those influenced by COVID-19.
WEF analysts predict that by 2025 the world will have 97 million new jobs, and 85 million - will disappear or be automated.
https://www.weforum.org/reports/the-future-of-jobs-report-2020
WEF analysts predict that by 2025 the world will have 97 million new jobs, and 85 million - will disappear or be automated.
https://www.weforum.org/reports/the-future-of-jobs-report-2020
Demand for Data Engineers Up 50%
In the subject of the previous post. The Dice 2020 Tech Job Report labeled data engineer as the fastest-growing job in technology in 2019, with a 50% year-over-year growth in the number of open positions. Check out the report
https://techhub.dice.com/Dice-2020-Tech-Job-Report.html
In the subject of the previous post. The Dice 2020 Tech Job Report labeled data engineer as the fastest-growing job in technology in 2019, with a 50% year-over-year growth in the number of open positions. Check out the report
https://techhub.dice.com/Dice-2020-Tech-Job-Report.html
The Difference Between Amateurs and Professionals
Interesting read, take a look. Basically you can read it as soft skills rules.
Those are the things I like in particular:
Amateurs have a goal. Professionals have a process.
Amateurs go faster. Professionals go further.
https://fs.blog/2017/08/amateurs-professionals/
#soft_skills
Interesting read, take a look. Basically you can read it as soft skills rules.
Those are the things I like in particular:
Amateurs have a goal. Professionals have a process.
Amateurs go faster. Professionals go further.
https://fs.blog/2017/08/amateurs-professionals/
#soft_skills
Farnam Street
Turning Pro: The Difference Between Amateurs and Professionals
Learn how to go pro and unlock the next level by uncovering the hidden differences in mindset between amateurs and professionals.
Fuck it
FuckIt.py uses state-of-the-art technology to make sure your Python code runs whether it has any right to or not. Does some code have an error? Fuck it.
https://github.com/ajalt/fuckitpy
I want to draw your attention to the tests where you can see full proof that P ≠ NP.
#python
FuckIt.py uses state-of-the-art technology to make sure your Python code runs whether it has any right to or not. Does some code have an error? Fuck it.
https://github.com/ajalt/fuckitpy
I want to draw your attention to the tests where you can see full proof that P ≠ NP.
#python
GitHub
GitHub - ajalt/fuckitpy: The Python error steamroller.
The Python error steamroller. Contribute to ajalt/fuckitpy development by creating an account on GitHub.
This is a whole new sport
Visual artist removed bikes from BMX tricks video:
https://www.instagram.com/p/CFKV2pbg3Wb/
Visual artist removed bikes from BMX tricks video:
https://www.instagram.com/p/CFKV2pbg3Wb/
Instagram
Fernando Livschitz
Best tricks @nitrocircus @nitroworldgamesofficial 1) @rwillyofficial 2) @kurtisdowns 3) @gavgodfrey
How to pass the interview, if you are an ML engineer
https://joelgrus.com/2016/05/23/fizz-buzz-in-tensorflow/
#ml
https://joelgrus.com/2016/05/23/fizz-buzz-in-tensorflow/
#ml
Joel Grus
Fizz Buzz in Tensorflow
interviewer: Welcome, can I get you coffee or anything? Do you need a break? me: No, I've probably had too much coffee already! interviewer: Great, great. And are you OK with writing code on the...
Complexity in distributed systems
The complexity of applications is often narrowed to a single concept of time Big O complexity. Simply put Big O notation describes how the runtime scales with respect to some input variables. With Big O Notation, you can mathematically describe how the function(application) will behave when the input size goes to infinity.
Although academically it makes sense, in real life engineers have more indicators of application complexity.
But in addition to the time complexity(converted to the number of operations), we also have memory complexity. And time is unlimited and in theory, we can solve anything in an unlimited amount of time. On the other hand, memory is always a limited resource and we should not forget about it when solving practical tasks. For example, when you have a huge array you need to sort and one memory cell - you can't use the QuickSort algorithm, you'll have to look for alternatives.
But apart from the limitations on the number of operations and memory, we also have other resources that we need to consider.
In the current world of microservice architecture, distributed systems increasingly dominate the development world. The number of nodes and executors on these nodes and network capacity in a system affects the complexity of a distributed algorithm.
Informally, data engineers have an understanding that the number of inter-node communication and the types of communication affect the complexity of the distributed algorithms. This is how the Big Data world defines shuffle operations and tries to move from a synchronous model to a concurrent one if possible. But when reading academic papers you do not see such practical considerations without which sometimes it makes no sense to implement one or another SOTA algorithm on paper.
#ml #dev
The complexity of applications is often narrowed to a single concept of time Big O complexity. Simply put Big O notation describes how the runtime scales with respect to some input variables. With Big O Notation, you can mathematically describe how the function(application) will behave when the input size goes to infinity.
Although academically it makes sense, in real life engineers have more indicators of application complexity.
But in addition to the time complexity(converted to the number of operations), we also have memory complexity. And time is unlimited and in theory, we can solve anything in an unlimited amount of time. On the other hand, memory is always a limited resource and we should not forget about it when solving practical tasks. For example, when you have a huge array you need to sort and one memory cell - you can't use the QuickSort algorithm, you'll have to look for alternatives.
But apart from the limitations on the number of operations and memory, we also have other resources that we need to consider.
In the current world of microservice architecture, distributed systems increasingly dominate the development world. The number of nodes and executors on these nodes and network capacity in a system affects the complexity of a distributed algorithm.
Informally, data engineers have an understanding that the number of inter-node communication and the types of communication affect the complexity of the distributed algorithms. This is how the Big Data world defines shuffle operations and tries to move from a synchronous model to a concurrent one if possible. But when reading academic papers you do not see such practical considerations without which sometimes it makes no sense to implement one or another SOTA algorithm on paper.
#ml #dev
Distributed systems is hard because:
▪️Engineers can’t combine error conditions. Instead, they must consider many permutations of failures. Most errors can happen at any time, independently of (and therefore, potentially, in combination with) any other error condition.
▪️The result of any network operation can be UNKNOWN, in which case the request may have succeeded, failed, or received but not processed.
▪️Distributed problems occur at all logical levels of a distributed system, not just low-level physical machines.
▪️Distributed problems get worse at higher levels of the system, due to recursion.
▪️Distributed bugs often show up long after they are deployed to a system.
▪️Distributed bugs can spread across an entire system.
▪️Many of the above problems derive from the laws of physics of networking, which can’t be changed.
Author changes the "independent failure" term to "sharing fate". Lol
https://aws.amazon.com/builders-library/challenges-with-distributed-systems/
#big_data
▪️Engineers can’t combine error conditions. Instead, they must consider many permutations of failures. Most errors can happen at any time, independently of (and therefore, potentially, in combination with) any other error condition.
▪️The result of any network operation can be UNKNOWN, in which case the request may have succeeded, failed, or received but not processed.
▪️Distributed problems occur at all logical levels of a distributed system, not just low-level physical machines.
▪️Distributed problems get worse at higher levels of the system, due to recursion.
▪️Distributed bugs often show up long after they are deployed to a system.
▪️Distributed bugs can spread across an entire system.
▪️Many of the above problems derive from the laws of physics of networking, which can’t be changed.
Author changes the "independent failure" term to "sharing fate". Lol
https://aws.amazon.com/builders-library/challenges-with-distributed-systems/
#big_data
Amazon
Challenges with distributed systems
Introducing properties of distributed systems that make them so challenging, including non-determinism and testing.